You are here: Working with Projects and Datasets > Types of Datasets > Web > RSS XML Harvest

Datasets: RSS XML Harvest

Video tutorial available.


If you know the URL's of RSS feeds you want to harvest, the RSS XML Harvest option is the one to use. It will enable you to collect and visualize documents from a list of RSS feed web addresses. Follow the Creating New Datasets Basic Steps 1 and 2, choosing "Web / RSS XML Harvest". The following window will open:

  1. Enter a Dataset Name, or accept the default dataset name, which is "RSS Feed URLs <date>".
  2. The RSS Feed Address List is a list of URLs, one per line, of the RSS feeds you wish to harvest. You can:

    When your list is complete, click Next > to go to Step 3, or to accept the default settings and begin processing immediately, click Finish and go to Basic Step 12.


  3. Dataset Wizard Web Harvest Scope Window The settings on this screen serve as controls for the duration of a harvest and can be useful if you are experiencing any of the following problems:
  1. When you are done, to accept all the defaults on the following screens and start processing immediately, click Finish. To set filters, click Next > and go to Step 5.
  2. The Filters window appears:
    Dataset Wizard Web Harvest Window Filter Hosts Window
    Filters help you to deal with the following problems: Enter hosts or URL words to filter, click Next > , and go to Optional Settingsor to use the default settings for the remainder of the options and start processing immediately, click Finish, and go to Basic Step 12.

Start Processing

The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor its status as it is processed by clicking , the Refresh button, at the top of the Dataset Editor window.

Control the Harvest While it is Running

The progress of the harvest is reflected in the Harvest Progress window:

Harvest Requests Completed is the number of requests the harvester has made and completed; this number will not be the same as the number of Usable Docs Retrieved. There are several reasons why a request does not result in a usable document:

All requests that result in unusable data are logged in the HarvestLog.txt file, which can be found at C:\Documents and Settings\<username>\INSPIRE\DatasetRoot\<dataset handle>\Harvest\HarvestLog.txt. The default datatset handle is "00000003". The default location for INSPIRE is C:\Documents and Settings\<username>\INSPIRE. If you have installed IN-SPIRE in a different location, look there for the HarvestLog.txt file.

of