Welcome
Data Sets
Overview
Creating New
--ASCII Text
--XML
--Google Harvest
--Web Harvest
Settings
--Fields
--Stopwords
--Stopmajors
--Punctuation Rules
Editing
Merging
Exporting
Importing
Subsetting
Visualizations
Galaxy
--Basics
--Outliers
ThemeView
Settings
Tools
Document Viewer
Gist
Groups
--Basics
--Evidence Panel
Major Terms
Queries
Print
Probe
Time Slicer
About version 2.2
Overview
Known issues
|
Settings: Stopwords
What are stopwords?
Stopwords are defined as non-information-bearing words. Words like "the",
"and", "asked" are in the stopword list so they are
ignored by the text engine. Words that are ignored can't be major terms
or be used in a vocabulary query. If you go to the Query tool and do a
"vocab" query on the phrase "he asked", for example,
you will not find any matching documents.
Particular types of data will generally have other additional words that
should be ignored, so IN-SPIRE allows you to edit the stopword list for
a data set, and to save a customized stopword list for use in subsequent
analyses.
Accessing a data set's Stopword
list
The Stopword list is accessible from the Data Set Editor's Data Set Wizard,
which will be open when you create a new data set. If it is not:
- Choose File > Data Sets. . . The Data Set Editor window opens.
- Click on the name of the data set of interest to select it, and click
Edit, or click New. The Data Set Wizard opens.
If a data set is open, you will be given the option to close it and
continue, or cancel.
- If you are editing an existing data set, click Next until
you see the Stopwords panel (for an ASCII data set, this is Step 3of
6). If this is a new data set, you must select a data set name and data
type to go on to the Stopwords panel.
If the Data Set Wizard is already open, use the <Back and Next>
buttons to find the Stopwords panel, which is Step 3 of 6.
Adding or removing words from the Stopword list
- Access the data set's Stopwords panel.
- Scroll through the list of words to review the default list. Some
words in the list will look strange (e.g., "theyd"). For why,
see Punctuation Rules.
- To add one or more words to the stopword list, click Add.
Enter the word(s), separated by white space, and click Add, or, if you
are finished, click Done. The Add Stopwords window closes and
the words appear in the Current Stopwords list, in lower case without
punctuation.
You can use CTRL-V to paste a list of words from the clipboard to the
Add Stopwords window. This is useful if you want to use Gist to collect
additions to the Current Stopwords list. See Gist
for more information on how to use Gist for this.
What words might you want to add? Are there frequently-occurring non-information-bearing
words in your data set that are not yet on the stopword list? For example,
a data set of Shakespearean documents would contain many instances of
"thee", "thou", "thy", and "thine",
which you would probably want an analysis to ignore. Add these to the
stopword list.
- To add words from an existing stopword list, click Load. For
details, see Combining Stopword
Files, below.
- To remove a stopword, click on a word that you wish to include in
the analysis of this data set, and click Delete.
- To save the modified list, click Next.
- See To save the Current Stopwords
list for use in other data sets.
- Access the Stopwords panel. On the Stopwords panel, click Load.
A file open dialog opens to the IN-SPIRE Stopwords directory. The files
that are accessible are Stopword files that have been explicitly saved
for use by other data sets. See To
save the Current Stopwords file for use by other data sets.
- Choose the stopwords file in the list that you want to open, and click
Load. The file open dialog closes, and the chosen stopword
list is loaded into the list on the left of the Stopwords panel, with
the name of the file immediately above it.
- Choose which terms you want to add:
To add all terms in the file to your new stopwords list:
Click in the list box which contains the terms you want to add (it will
be the one on the left).
Click CTRL-A to select all terms in that list. All terms will be selected.
To add a single word from the file to your new stopwords list:
Click on the word in the list to select it.
To add several words in the list, select them using CTRL-click for non-contiguous
words, SHIFT-click for contiguous words.
- Click
to add the selected terms to the current stopword list.
- The selected terms appear in the Current Stopwords list.
If you want to use stopwords from another stopwords file as well, click
Load again and repeat the above steps.
To save the Current Stopwords
file for use by other data sets.
- On the upper right of the Stopwords panel, click Save...
A file save dialog opens.
- Enter a name for this stopwords file.
The
name of a stopword file must end in ".stop"
- Click Save.
Editing the default stopwords file
The default stopwords file is used for all new data sets. To modify this
file, you can either edit it in a text editor or use the IN-SPIRE Stopwords
panel, as above, to create the desired stopwords file. Using the Stopwords
panel is the preferred method, as you needn't worry about keeping the
stopwords file alphabetized; it's taken care of for you. To use the Stopwords
panel:
- Load the default stopword file that is distributed with IN-SPIRE (Default.stop)
into the Stopwords panel, as above.
- Add all or some of the terms in the file to the Current Stopwords
list.
- Add and delete terms as appropriate.
- Save the Current Stopwords list in the INSPIRE\DatasetRoot\ folder
as "00000000.stop". Be careful of the number of zeroes; there
are 8 of them.
Alternatively, edit the default stopwords file in a text editor:
- In the main IN-SPIRE install directory (by default, this is C:\\Program
Files\INSPIRE\), find a file named 00000000.stop. This file contains
the default stopword list.
- Make a backup copy of 00000000.stop.
- Edit 00000000.stop in a text editor.
The
stopword file is in alphabetical order and must remain so. Corrupting
the stopword file will cause problems with the data sets.
|