Datasets:  Web Harvest

If you know the URL's of websites you want to harvest, the Web Harvest option is the one to use. Follow the Creating New Datasets Basic Steps 1 and 2. The following window will display:

  1. Enter a Dataset Name.

  2. The Web Address List is a list of URLs, one per line, of the websites you wish to harvest. You can:

    When your list is complete, click Next> and go to Step 3, or to accept the default settings and begin processing immediately, click Finish and go to Basic Step 12.
    Note: If any of the websites require authentication (in other words, if you must log in to the site using a username and password), go to Step 3.

  3. The following window opens:

    The settings on this screen serve as controls for the duration of a harvest and can be useful if you are experiencing any of the following problems:

  1. When you are done, to accept all the defaults on the following screens and start processing immediately, click Finish. To set filters, click Next> and go to Step 5.

  2. The Filters window appears:

    Filters help you to deal with the following problems:

    Enter hosts or URL words to filter and click Next> and go to Optional Settings or to use the default settings for the remainder of the options and start processing immediately, click Finish and go to Basic Step 12.

Start Processing

The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor its status as it is processed by clicking , the Refresh button, at the top of the Dataset Editor window.

Control the Harvest While it is Running

The progress of the harvest is reflected in the Harvest Progress window:

Harvest Requests Completed is the number of requests the harvester has made and completed; this number will not be the same as the number of Usable Docs Retrieved. There are several reasons why a request does not result in a usable document:

 

All requests that result in unusable data are logged in the HarvestLog.txt file, which can be found here: INSPIRE_HOME\DatasetRoot\Harvest\HarvestLog.txt. The default location for INSPIRE_HOME is C:\\Program Files\INSPIRE.

 

 

 

7/18/05