Welcome

Data Sets
Overview
Creating New
--ASCII Text
--XML
--Google Harvest
--Web Harvest
Settings
--Fields
--Stopwords
--Stopmajors
--Punctuation Rules
Editing
Merging
Exporting
Importing
Subsetting

Visualizations
Galaxy
--Basics
--Outliers
ThemeView
Settings

Tools
Document Viewer
Gist
Groups
--Basics
--Evidence Panel
Major Terms
Queries
Print
Probe
Time Slicer

About version 2.2
Overview
Known issues

Settings: Stopwords

What are stopwords?

Stopwords are defined as non-information-bearing words. Words like "the", "and", "asked" are in the stopword list so they are ignored by the text engine. Words that are ignored can't be major terms or be used in a vocabulary query. If you go to the Query tool and do a "vocab" query on the phrase "he asked", for example, you will not find any matching documents.

Particular types of data will generally have other additional words that should be ignored, so IN-SPIRE allows you to edit the stopword list for a data set, and to save a customized stopword list for use in subsequent analyses.

Accessing a data set's Stopword list

The Stopword list is accessible from the Data Set Editor's Data Set Wizard, which will be open when you create a new data set. If it is not:

  1. Choose File > Data Sets. . . The Data Set Editor window opens.
  2. Click on the name of the data set of interest to select it, and click Edit, or click New. The Data Set Wizard opens.
    If a data set is open, you will be given the option to close it and continue, or cancel.
  3. If you are editing an existing data set, click Next until you see the Stopwords panel (for an ASCII data set, this is Step 3of 6). If this is a new data set, you must select a data set name and data type to go on to the Stopwords panel.

If the Data Set Wizard is already open, use the <Back and Next> buttons to find the Stopwords panel, which is Step 3 of 6.

Adding or removing words from the Stopword list

  1. Access the data set's Stopwords panel.
    stopwords panel
  2. Scroll through the list of words to review the default list. Some words in the list will look strange (e.g., "theyd"). For why, see Punctuation Rules.
  3. To add one or more words to the stopword list, click Add. Enter the word(s), separated by white space, and click Add, or, if you are finished, click Done. The Add Stopwords window closes and the words appear in the Current Stopwords list, in lower case without punctuation.
    You can use CTRL-V to paste a list of words from the clipboard to the Add Stopwords window. This is useful if you want to use Gist to collect additions to the Current Stopwords list. See Gist for more information on how to use Gist for this.
    What words might you want to add? Are there frequently-occurring non-information-bearing words in your data set that are not yet on the stopword list? For example, a data set of Shakespearean documents would contain many instances of "thee", "thou", "thy", and "thine", which you would probably want an analysis to ignore. Add these to the stopword list.
  4. To add words from an existing stopword list, click Load. For details, see Combining Stopword Files, below.
  5. To remove a stopword, click on a word that you wish to include in the analysis of this data set, and click Delete.
  6. To save the modified list, click Next.
  7. See To save the Current Stopwords list for use in other data sets.

Using stopwords from another data set's Stopword file

  1. Access the Stopwords panel. On the Stopwords panel, click Load. A file open dialog opens to the IN-SPIRE Stopwords directory. The files that are accessible are Stopword files that have been explicitly saved for use by other data sets. See To save the Current Stopwords file for use by other data sets.
  2. Choose the stopwords file in the list that you want to open, and click Load. The file open dialog closes, and the chosen stopword list is loaded into the list on the left of the Stopwords panel, with the name of the file immediately above it.
    stopwords panel after loading stopwords file
  3. Choose which terms you want to add:
    To add all terms in the file to your new stopwords list:
    Click in the list box which contains the terms you want to add (it will be the one on the left).
    Click CTRL-A to select all terms in that list. All terms will be selected.

    To add a single word from the file to your new stopwords list:
    Click on the word in the list to select it.

    To add several words in the list, select them using CTRL-click for non-contiguous words, SHIFT-click for contiguous words.
  4. Click right arrow button to add the selected terms to the current stopword list.
  5. The selected terms appear in the Current Stopwords list.

If you want to use stopwords from another stopwords file as well, click Load again and repeat the above steps.

To save the Current Stopwords file for use by other data sets.

  1. On the upper right of the Stopwords panel, click Save... A file save dialog opens.
  2. Enter a name for this stopwords file.
    danger iconThe name of a stopword file must end in ".stop"
  3. Click Save.

Editing the default stopwords file

The default stopwords file is used for all new data sets. To modify this file, you can either edit it in a text editor or use the IN-SPIRE Stopwords panel, as above, to create the desired stopwords file. Using the Stopwords panel is the preferred method, as you needn't worry about keeping the stopwords file alphabetized; it's taken care of for you. To use the Stopwords panel:

  1. Load the default stopword file that is distributed with IN-SPIRE (Default.stop) into the Stopwords panel, as above.
  2. Add all or some of the terms in the file to the Current Stopwords list.
  3. Add and delete terms as appropriate.
  4. Save the Current Stopwords list in the INSPIRE\DatasetRoot\ folder as "00000000.stop". Be careful of the number of zeroes; there are 8 of them.

Alternatively, edit the default stopwords file in a text editor:

  1. In the main IN-SPIRE install directory (by default, this is C:\\Program Files\INSPIRE\), find a file named 00000000.stop. This file contains the default stopword list.
  2. Make a backup copy of 00000000.stop.
  3. Edit 00000000.stop in a text editor.
    danger iconThe stopword file is in alphabetical order and must remain so. Corrupting the stopword file will cause problems with the data sets.