Settings: Stopwords

Stopwords are words that the text engine ignores during text processing. As a consequence, you will not be able to search for them and they will not be found by Probe nor appear in Galaxy or ThemeView Classic labels.

Stopwords are defined as non-information-bearing words. Words like "the", "and", and "asked" are in the stopwords list.

Some datasets will have other additional words that should be ignored. IN-SPIRE allows you to edit the stopwords list for a dataset, and to save a customized stopwords list for use in subsequent analyses.

Accessing a Dataset Stopwords List

The Stopwords list is accessible from the Dataset Editor's Dataset Wizard, which will be open when you create a new dataset.

  1. From the IN-SPIRE main toolbar window, select File > Datasets. . ., and the Dataset Editor window opens.
  2. Click on the name of the dataset of interest to select it, and click Edit or New. The Dataset Wizard opens.
    If a dataset is open, you will be given the option to close it and continue, or cancel.
  3. If you are editing an existing dataset, click Next in the Dataset Wizard until you see the Stopwords panel. If this is a new dataset, you must select a dataset name and data type to go on to the Stopwords panel for a text dataset.

If the Dataset Wizard is already open, use the Back and Next buttons to find the Stopwords panel.

Adding or Removing Words from the Stopwords List

  1. From the Dataset Wizard, move through the wizard until you reach the dataset Stopwords panel (example shown is for the creation of an Text dataset).
    Dataset Stopwords Window
  2. Scroll through the list of words to review the default list. Some words in the list will look strange (e.g., "theyd"). For details on how the punctuation list is created and what it does, see Punctuation Rules.
  3. To add one or more words to the stopwords list, click Add.... The Add Stopwords window will appear.
    Add Stopwords Window
  4. Enter the new stopwords, separated by a space, and click Add, or, if you are finished, click Done. The Add Stopwords window will close, and the words display in the Current Stopwords list, in lower case, without punctuation.

    You can use CTRL-V to paste a list of words from the clipboard to the Add Stopwords window. This is useful if you want to use the Summary tool to collect additions to the Current Stopwords list. See Summary for more information on how to do this.

    What words might you want to add? Are there frequently-occurring non-information-bearing words in your dataset that are not yet on the stopwords list? For example, a dataset of Shakespearean documents would contain many instances of "thee", "thou", "thy", and "thine", which you would probably want to ignore during an analysis. Add these words to the stopwords list.
  5. To add words from an existing stopwords list, click Load. For details, see the following procedure Using Stopwords from Another Dataset Stopwords File.
  6. To remove a stopword, click on a word that you want included in the analysis of this dataset, and click Delete.
  7. To save the modified list of stopwords, click Next to go to the next panel in the Dataset Wizard. Your changes will be saved. To save the modified stopwords list to be used in another dataset, see the procedure To Save the Current Stopwords List for Use in Other Datasets.

Using Stopwords from Another Dataset Stopwords File

  1. From the Dataset Wizard, move through the wizard until you reach the dataset Stopwords panel (example shown is for the creation of an Text dataset).
  2. On the Stopwords panel, click Load. A file open dialog will display the IN-SPIRE Stopwords directory. The files that are accessible are stopwords files that have been explicitly saved for use by other datasets. See To Save the Current Stopwords File for Use by Other Datasets.
  3. Select the stopwords file in the list that you want to open, and click Load. The file open dialog closes, and the chosen stopwords list is loaded into the list to the left of the Stopwords panel, with the name of the file immediately above it.
    The name of the stopwords file is not a stopword itself and it will not be added if you decide to add all the stopwords in the file to your current list.
    Load Stopwords Window
  4. From the newly loaded stopwords list, select the terms you want to add to the Current Stopwords list.
    • To add all the terms in the file to your new stopwords list, click in the list box which contains the terms you want to add (it will be the one on the left) and click Ctrl-A to select all terms in that list.
    • To add a single word from the file to your new stopwords list, click on the word in the list to select it.
    • To add several words from the list, select them using CTRL-click for non-contiguous words, SHIFT-click for contiguous words.
  5. Click the arrow right arrow button to add the selected terms to the current stopwords list.
  6. The selected terms will display in the Current Stopwords list.

If you want to use stopwords from another stopwords file as well, click Load again and repeat the above steps.

To Save the Current Stopwords File for Use by Other Datasets.

  1. On the upper right of the Stopwords panel, click Save. A file save dialog will open.
  2. Enter a name for this stopwords file.
    danger icon The name of a stopwords file must end in ".stop". The default installed is 00000000.stop.
  3. Click Save.

Editing the Default Stopwords File

The default stopwords file is used for all new datasets. To modify this file, you can either edit it in a text editor or use the Stopwords panel as described in Adding or Removing Words from the Stopwords List to create the desired stopwords file. Using the Stopwords panel is the preferred method, as you do not need to worry about keeping the stopwords file alphabetized. To use the Stopwords panel:

  1. Load the default stopwords file that is distributed with IN-SPIRE (Default.stop) into the Stopwords panel, as described in Using Stopwords from Another Dataset Stopwords File.
  2. Add all or some of the terms in the stopwords file to the Current Stopwords list.
  3. Add and delete terms as appropriate.
  4. Save the Current Stopwords list in the C:\Documents and Settings\<username>\INSPIRE\DatasetRoot\ folder as "00000000.stop". Be careful of the number of zeroes; there are 8 of them.

Alternatively, edit the default stopwords file in a text editor:

  1. In the IN-SPIRE directory (by default, C:\Documents and Settings\<username>\INSPIRE\DatasetRoot), find a file named 00000000.stop. This file contains the default stopwords list.
  2. Make a backup copy of 00000000.stop.
  3. Edit 00000000.stop in a text editor.
    danger icon The stopwords file is in alphabetical order and must remain so. Corrupting the stopwords file will cause problems with the datasets.