Settings: Stopmajors
IN-SPIRE automatically determines which words from your dataset are
best for discriminating one document from another. Most of the time these
are also the best words for describing the topical content of your documents,
or "major terms". The process of determining major terms is
not perfect, however, so IN-SPIRE gives you the ability to prevent a word
from becoming a major term. You can create a customized Stopmajor list
for a particular dataset, and save it for use in subsequent analyses.
Influencing which Words can be Major Terms
One way to stop some words from being major terms is to add them to
the stopword list, but doing so effectively eliminates any reference to
them, and they will not appear in the Summary tool or be able to be queried for. The Stopmajor list identifies
terms that you want to be available for Summary and Search, but do not
want considered when IN-SPIRE is determining what major terms are used
for creating the document "signatures".
Adding a term to the Stopmajor list prevents it from influencing the
clustering of the documents and from appearing as a ThemeView Classic peak label,
since it is not part of the document's mathematical signature. However,
because the Stopmajor list affects the complex statistics and term relationships
that are the foundation of the IN-SPIRE text processing, it should be
used sparingly.
Like Stopwords and Punctuation
Rules, a Stopmajor list is associated with a dataset.
Accessing the Stopmajor List
The Stopmajor list is accessible from the Dataset Editor Dataset Wizard.
- From the IN-SPIRE main toolbar window select File > Datasets. The Dataset Editor window opens.
- Click on the name of the dataset of interest to select it, and click Edit or click New. The Dataset Wizard opens.
If a dataset is open, you will be given the option to close it and continue or cancel.
- If you are editing an existing dataset, click Next until you see the Stopmajor list panel. If this is a new dataset, you must select a dataset name and data type to go on to the Stopmajor panel.
If the Dataset Wizard is already open, use the <Back and Next>
buttons to find the Stopmajor panel.
Adding or Removing Words from the Stopmajor List
- Access the Stopmajor panel.
- Review the Stopmajor list. There are no Stopmajor
words unless you add some.
- To add a word, click Add. The Add Stopmajors window opens.
- Enter one or more terms, separated by spaces, and click Add. Notice that the terms will appear in alphabetical order in the Current Stopmajors panel of the Stopmajors window. When all the terms you want to identify as stopmajors appear there, click Done on the Add Stopmajors window to close it.
- Alphanumeric tokens are words that contain both letters and numbers, for example, "32B". To ensure that no alphanumeric words become major terms, click on the checkbox by Automatically use alphanumeric tokens as stopmajors.
- To remove a word from the Stopmajor list, click to select the word you want to remove, and then click Delete. You can delete several words at a time. See Selecting multiple items for how to select several items from a list.
Using Words from Another Dataset Stopmajor List
- Access the Stopmajor panel. On the Stopmajor panel, click the Load... button. A file chooser dialog opens to the DatasetRoot\Stopwords\directory. Stopmajor files that are accessible here are those that have been explicitly saved for use by other datasets. See Saving the Current Stopmajor List for Use by Other Datasets.
- Choose a Stopmajor file and click Load.The terms in the file are listed in the panel to the left, with the file name directly above.
- Choose which terms you want to add by selecting (highlighting them):
- To add all terms in the file to your new stopwords list, click in the left hand panel, which contains the list of terms you can add. Click Ctrl-A to select all
terms in that list. All terms will be selected.
- To add a single word from the file to your new stopwords list, click on the word
in the list to select it.
- To select several terms from the list so that you can add them all simultaneously,
see Selecting Items from a List.
You
may notice a line like "*** webdata.stop_major ***" in the list
of words. Do not be concerned; after all, this word will never be found
in your dataset. Do include it in the words you copy to the Current
Stopmajor list in Step 4.
- Click to add the selected terms to the current stopword list.
- The selected terms appear in the Current Stopwords list.
Saving
the Current Stopmajor List for Use by Other Datasets
- On the upper right of the Stopmajor panel, click the Save... button. A file save
dialog opens.
- Enter a name for this stopmajor file.
The extension ".stop_major" will be automatically added to the filename you enter.
- Click Save. The file is saved in the Stopwords directory. For how to use this saved stopmajor file with any dataset, see Using words from another dataset's Stopmajor list (above).
Editing the Default Stopmajor File
The default Stopmajor file is used for all new datasets. To modify this
file, you can either edit it in a text editor, or use the IN-SPIRE Stopmajor
panel, as above, to create the desired Stopmajor file. Using the Stopmajor
panel is the preferred method, as you needn't worry about keeping the
stopmajor file alphabetized; it's taken care of for you. To use the Stopmajor
panel:
- Load the default stopmajor file that is distributed
with IN-SPIRE (00000000.stop_major) into the Stopmajor panel, as above.
- Add all or some of the terms in the file to the Current
Stopmajor list.
- Add and delete terms as appropriate.
- Save the Current Stopmajor list in the C:\Documents and Settings\<username>\INSPIRE\DatasetRoot
folder as "00000000.stop_major".
Be
careful of the number of zeroes; there are eight of them.
Alternatively, edit the default Stopmajor file in a text editor:
- In the main INSPIRE\DatasetRoot directory (default is C:\Documents and Settings\<username>\INSPIRE\DatasetRoot), find a file named 00000000.stop_major.
This file contains the default stopmajor list.
- Make a backup copy of 00000000.stop_major.
- Edit 00000000.stop_major in a text editor such as
Notepad. If you edit the file with MS Word, make sure you save it as text.
The
stopmajor file is in alphabetical order and must remain so. Corrupting
the stopmajor file will cause problems with the datasets.