You are here: Working with Projects and Datasets > Types of Datasets > Web > Datasets Web Harvest

Datasets: Web Harvest

If you know the URL's of web sites you want to harvest, the Web Harvest option is the one to use. Follow the Creating New Datasets Basic Steps 1 and 2, choosing "Web Harvest". The following window will open:

Dataset Wizard Web Harvest Window

  1. Enter a Dataset Name.

  2. The Web Address List is a list of URLs, one per line, of the web sites you wish to harvest. You can:

    When your list is complete, click Next > to go to Step 3, or to accept the default settings and begin processing immediately, click Finish and go to Basic Step 12.
    If any of the web sites require authentication (in other words, if you must log in to the site using a username and password), go to Step 3.

  3. The following window opens:
    Dataset Wizard Web Harvest Scope Window

    The settings on this screen serve as controls for the duration of a harvest and can be useful if you are experiencing any of the following problems:

  1. When you are done, to accept all the defaults on the following screens and start processing immediately, click Finish. To set filters, click Nexty>and go to Step 5.

  2. The Filters window appears:
    Dataset Wizard Web Harvest Window Filter Hosts Window

    Filters help you to deal with the following problems:

    Enter hosts or URL words to filter, click Next> , and go to Optional Settingsor to use the default settings for the remainder of the options and start processing immediately, click Finish, and go to Basic Step 12.

Start Processing

The Processing dialog opens, informing you that the dataset is being processed. Click OK.The dataset appears in the list of datasets in the Dataset Editor window. You can monitor its status as it is processed by clicking , the Refresh button, at the top of the Dataset Editor window.

Control the Harvest While it is Running

The progress of the harvest is reflected in the Harvest Progress window:

Harvest Requests Completed is the number of requests the harvester has made and completed; this number will not be the same as the number of Usable Docs Retrieved. There are several reasons why a request does not result in a usable document:

All requests that result in unusable data are logged in the HarvestLog.txt file, which can be found at C:\Documents and Settings\<username>\INSPIRE\DatasetRoot\<dataset handle>\Harvest\HarvestLog.txt. The default datatset handle is "00000003". The default location for INSPIRE is C:\Documents and Settings\<username>\INSPIRE. If you have installed IN-SPIRE in a different location, look there for the HarvestLog.txt file.

of