Follow the basic steps for creating a new dataset. The Dataset Wizard for a Google harvest will display.
In the Dataset Name field, enter a dataset name.
Type the query text in one or more of the Google Query boxes. The query will join the options together using the Boolean AND operator. For example, if you type "cat" in the field "with at least one of the words" and type "dog" in the field "without the words", Google search will look for documents containing "cat" that do not also contain "dog".Finally, click Next>. The Advanced Google Options panel displays.
The defaults for Language, Date, and Occurrences are shown in the above example. You are not required to change them, although you may wish to hone your query by specifying:
The language of the pages (any language, or specify a single language).
The period in which pages have been updated (anytime, or within 3 months, 6 months, a year).
Where the query terms must appear (anywhere, or in the title, in the text, in the URL, or in links).
When you are done, click Next>.
The Settings
panel will display.
Note:
At
this point,
you can click Finish to
accept the defaults on the following panels and start processing.
The settings on the above panel serve as controls for the duration of a harvest and can be useful if you are experiencing any of the following problems:
An excessive number of documents are being
retrieved.
Use the Maximum Documents drop-down list to set a reasonable number,
and the harvest will be terminated when that number is reached.
The harvest retrieves "linked to"
pages that are not relevant to the analysis.
Specify a Harvest Depth of 1 level, which means that links from
the top level pages will be followed, but links on those pages will not
be.
Harvests seem to take a long time to complete.
Parallel Fetches asks you to specify how many web servers you want
to be downloading from at the same time. Clearly, if you have the processor
power, having a number of parallel fetches going at once can shorten the
harvest time. Attempting too many parallel fetches can actually slow the
harvest, however.
Frequent connection timeouts.
Connection Timeout refers to how long the harvester should wait
for a reply after it has contacted a web server. If you are experiencing
frequent connection timeouts, you may want to increase the timeout interval.
Pages with very large graphics or slow or
overloaded servers.
Download Timeout limits how long it may take to actually download
a web page, and insures that the harvest will not be stuck trying to fetch
a page from a very slow or unresponsive server.
When you are done, click Next>.
The Filters panel will display.
Filters help you to deal with the following problems:
A large site which you know contains pages
that are not relevant to your analysis that will dominate the harvest
and obscure the pages which are most interesting to you.
A "host" is a web server; its address appears in the
URL for a page, immediately after http://. For example, if the URL is
http://www.amazon.com/stores/books, the host is "www.amazon.com".
Enter names of hosts you want to avoid in the Filter Hosts box, one per
line.
Your query terms include words which have several meanings, only one of which is interesting for the analysis, or you may want to exclude certain sections of websites (on-line catalogs, for example).
Enter hosts or URL words to filter, click Next> and go to Optional Settings or to use the default settings for the remainder of the options and start processing immediately, click Finish.
The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset name displays in the list of datasets in the Dataset Editor window. You can monitor its status as it is processed by clicking , the Refresh button, at the top of the Dataset Editor window.
The progress of the harvest is reflected in the Status column of the Dataset Editor window. To see the complete status details, click the Status button . The Dataset Details window will display.
Use the tabs to view status information about the dataset harvesting, preprocessing, and processing phases of the Google harvest. This information can be used to refine your harvest.
6/18/05