Stopwords are defined as non-information-bearing words. Words like "the",
"and", and "asked" are in the stopwords list so they
will be ignored by the text engine. Words that are ignored can not be
major terms or be used in a vocabulary query.
If you go to the Query tool and perform a Words
in Document query on the phrase "he asked", for example, you
will not find any matching documents.
Particular types of data will generally have other additional words
that should be ignored. IN-SPIRE
allows you to edit the stopwords list for a dataset, and to save a customized
stopwords list for use in subsequent analyses.
Accessing a Dataset Stopwords List
The Stopwords list is accessible from the Dataset Editor's Dataset Wizard,
which will be open when you create a new dataset.
From the IN-SPIRE main toolbar window, select File
> Datasets. . . The Dataset Editor window opens.
Click on the name of the dataset of interest to select
it, and click Edit, or
click New. The Dataset
Wizard opens. Note: If
a dataset is open, you will be given the option to close it and continue,
or cancel.
If you are editing an existing dataset, click Next in the Dataset Wizard until you
see the Stopwords panel. If this is a new dataset, you must select a dataset
name and data type to go on to the Stopwords panel for an ASCII dataset.
If the Dataset Wizard is already open, use the Back and Next buttons
to find the Stopwords panel.
Adding or Removing
Words from the Stopwords List
From the Dataset Wizard, move through the wizard
until you reach the dataset Stopwords panel (example shown is for the
creation of an ASCII dataset).
Scroll through the list of words to review the default
list. Some words in the list will look strange (e.g., "theyd").
For details on how the punctuation list is created and what it does, see
Punctuation Rules.
To add one or more words to the stopwords list, click
Add. The Add Stopwords window will
display.
Enter the new stopwords, separated by a space, and
click Add, or, if you are finished, click Done.
The Add Stopwords window will close and the words will display in the
Current Stopwords list, in lower case, without punctuation.
You can use Ctrl-V to paste a list of words from the clipboard to the
Add Stopwords window. This is useful if you want to use the Gist tool
to collect additions to the Current Stopwords list. See Gist
for more information on how to do this.
What words might you want to add? Are there frequently-occurring non-information-bearing
words in your dataset that are not yet on the stopwords list? For example,
a dataset of Shakespearean documents would contain many instances of "thee",
"thou", "thy", and "thine", which you would
probably want to ignore during an analysis. Add these words to the stopwords
list.
To add words from an existing stopwords list, click
Load. For details, see
the following procedure Using Stopwords from Another Dataset Stopwords
File.
To remove a stopword, click on a word that you want
included in the analysis of this dataset, and click Delete.
To save the modified list of stopwords, click Next to go to the next panel in the Dataset
Wizard. Your
changeswill be saved. To
save the modified stopwords list to be used in another dataset, see
the procedure To Save the Current
Stopwords List for Use in Other Datasets.
Using Stopwords from
Another Dataset Stopwords File
From the Dataset Wizard, move through the wizard
until you reach the dataset Stopwords panel (example shown is for the
creation of an ASCII dataset).
On the Stopwords panel, click Load.
A file open dialog will display the IN-SPIRE Stopwords directory. The
files that are accessible are stopwords files that have been explicitly
saved for use by other datasets. See To
Save the Current Stopwords File for Use by Other Datasets.
Select the stopwords file in the list that you want
to open, and click Load.
The file open dialog closes, and the chosen stopwords list is loaded
into the list to the left of the Stopwords panel, with the name of the
file immediately above it. Note: The
name of the stopwords file is not a stopword itself and it will not be
added if you decide to add all the stopwords in the file to your current
list.
From the newly loaded stopwords list, select the
terms you want to add to the Current Stopwords list.
To add all the terms in the file to your new
stopwords list, click in the list box which contains the terms you want
to add (it will be the one on the left) and click Ctrl-A to select all
terms in that list.
To add a single word from the file to your new
stopwords list, click on the word in the list to select it.
To add several words from the list, select them
using Ctrl-click for non-contiguous words, Shift-click for contiguous
words.
Click the arrow to add the selected terms to the current stopwords list.
The selected terms will display in the Current Stopwords
list.
If you want to use stopwords from another stopwords file as well, click
Load again and repeat the above steps.
To Save the Current Stopwords
File for Use by Other Datasets.
On the upper right of the Stopwords panel, click
Save... A file save
dialog will open.
Enter a name for this stopwords file. Caution: The
name of a stopwords file must end in ".stop"
Click Save.
Editing the Default Stopwords File
The default stopwords file is used for all new datasets. To modify this
file, you can either edit it in a text editor or use the Stopwords panel
as described in Adding
or Removing Words from the Stopwords List, to create the desired stopwords
file. Using the Stopwords panel is the preferred method, as you do not
need to worry about keeping the stopwords file alphabetized. To use the
Stopwords panel:
Add all or some of the terms in the stopwords file
to the Current Stopwords list.
Add and delete terms as appropriate.
Save the Current Stopwords list in the INSPIRE\DatasetRoot\
folder as "00000000.stop". Be careful of the number of zeroes;
there are 8 of them.
Alternatively, edit the default stopwords file in a text editor:
In the main IN-SPIRE install directory (by default,
this is C:\\Program Files\INSPIRE\), find a file named 00000000.stop.
This file contains the default stopwords list.
Make a backup copy of 00000000.stop.
Edit 00000000.stop in a text editor. Caution: The
stopwords file is in alphabetical order and must remain so. Corrupting
the stopwords file will cause problems with the datasets.