Stopwords

Stopwords are defined as non-information-bearing words. Words like "the", "and", and "asked" are in the stopwords list so they will be ignored by the text engine. Words that are ignored can not be major terms or be used in a vocabulary query. If you go to the Query tool and perform a Words in Document query on the phrase "he asked", for example, you will not find any matching documents.

 

Particular types of data will generally have other additional words that should be ignored.  IN-SPIRE allows you to edit the stopwords list for a dataset, and to save a customized stopwords list for use in subsequent analyses.

Accessing a Dataset Stopwords List

The Stopwords list is accessible from the Dataset Editor's Dataset Wizard, which will be open when you create a new dataset.

 

  1. From the IN-SPIRE main toolbar window, select File > Datasets. . . The Dataset Editor window opens.
  2. Click on the name of the dataset of interest to select it, and click Edit, or click New. The Dataset Wizard opens.
    Note:  If a dataset is open, you will be given the option to close it and continue, or cancel.
  3. If you are editing an existing dataset, click Next in the Dataset Wizard until you see the Stopwords panel. If this is a new dataset, you must select a dataset name and data type to go on to the Stopwords panel for an ASCII dataset.

 

If the Dataset Wizard is already open, use the Back and Next buttons to find the Stopwords panel.

Adding or Removing Words from the Stopwords List

  1. From the Dataset Wizard, move through the wizard until you reach the dataset Stopwords panel (example shown is for the creation of an ASCII dataset).
  2. Scroll through the list of words to review the default list. Some words in the list will look strange (e.g., "theyd"). For details on how the punctuation list is created and what it does, see Punctuation Rules.
  3. To add one or more words to the stopwords list, click Add. The Add Stopwords window will display.
  4. Enter the new stopwords, separated by a space, and click Add, or, if you are finished, click Done. The Add Stopwords window will close and the words will display in the Current Stopwords list, in lower case, without punctuation.
    You can use Ctrl-V to paste a list of words from the clipboard to the Add Stopwords window. This is useful if you want to use the Gist tool to collect additions to the Current Stopwords list. See Gist for more information on how to do this.
    What words might you want to add? Are there frequently-occurring non-information-bearing words in your dataset that are not yet on the stopwords list? For example, a dataset of Shakespearean documents would contain many instances of "thee", "thou", "thy", and "thine", which you would probably want to ignore during an analysis. Add these words to the stopwords list.
  5. To add words from an existing stopwords list, click Load. For details, see the following procedure Using Stopwords from Another Dataset Stopwords File.
  6. To remove a stopword, click on a word that you want included in the analysis of this dataset, and click Delete.
  7. To save the modified list of stopwords, click Next to go to the next panel in the Dataset Wizard.  Your changes will be saved.  To save the modified stopwords list to be used in another dataset, see the procedure To Save the Current Stopwords List for Use in Other Datasets.

Using Stopwords from Another Dataset Stopwords File

  1. From the Dataset Wizard, move through the wizard until you reach the dataset Stopwords panel (example shown is for the creation of an ASCII dataset).  
  2. On the Stopwords panel, click Load. A file open dialog will display the IN-SPIRE Stopwords directory. The files that are accessible are stopwords files that have been explicitly saved for use by other datasets. See To Save the Current Stopwords File for Use by Other Datasets.
  3. Select the stopwords file in the list that you want to open, and click Load. The file open dialog closes, and the chosen stopwords list is loaded into the list to the left of the Stopwords panel, with the name of the file immediately above it.
    Note:  The name of the stopwords file is not a stopword itself and it will not be added if you decide to add all the stopwords in the file to your current list.
  4. From the newly loaded stopwords list, select the terms you want to add to the Current Stopwords list.  
  5. Click the arrow  right arrow button to add the selected terms to the current stopwords list.
  6. The selected terms will display in the Current Stopwords list.
 

If you want to use stopwords from another stopwords file as well, click Load again and repeat the above steps.

To Save the Current Stopwords File for Use by Other Datasets.

  1. On the upper right of the Stopwords panel, click Save... A file save dialog will open.
  2. Enter a name for this stopwords file.
    Caution:  The name of a stopwords file must end in ".stop"
  3. Click Save.

Editing the Default Stopwords File

The default stopwords file is used for all new datasets. To modify this file, you can either edit it in a text editor or use the Stopwords panel as described in Adding or Removing Words from the Stopwords List, to create the desired stopwords file. Using the Stopwords panel is the preferred method, as you do not need to worry about keeping the stopwords file alphabetized. To use the Stopwords panel:

 

  1. Load the default stopwords file that is distributed with IN-SPIRE (Default.stop) into the Stopwords panel, as described in Using Stopwords from Another Dataset Stopwords File.
  2. Add all or some of the terms in the stopwords file to the Current Stopwords list.
  3. Add and delete terms as appropriate.
  4. Save the Current Stopwords list in the INSPIRE\DatasetRoot\ folder as "00000000.stop". Be careful of the number of zeroes; there are 8 of them.

 

Alternatively, edit the default stopwords file in a text editor:

 

  1. In the main IN-SPIRE install directory (by default, this is C:\\Program Files\INSPIRE\), find a file named 00000000.stop. This file contains the default stopwords list.
  2. Make a backup copy of 00000000.stop.
  3. Edit 00000000.stop in a text editor.
    Caution:  The stopwords file is in alphabetical order and must remain so. Corrupting the stopwords file will cause problems with the datasets.
 
 

6/29/05