Datasets: HTML File

Create a dataset from HTML files when you have available a collection of HTML files that you want to analyze. Google Harvest and Web Harvest datasets also consist of HTML files that will be harvested, but when the dataset is created, the documents have not yet been collected. With a Google Harvest, Google searches for particular terms and retrieves documents which contain them. With a Web Harvest, given a URL to start from, IN-SPIRE retrieves documents linked to it.

Creating a Dataset from HTML Files

  1. Follow the basic steps for creating a new dataset, selecting HTML Dataset.
    Click Next.
  2. Choose Add or Add from Folder, and point at local HTML files that you want to analyze.
  3. Edit Optional Settings if you wish.
  4. Click Finish to start processing. The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor the status of the Dataset as it is processed by clicking the Refresh button at the top of the Dataset Editor window.