Settings: Punctuation Rules
By default, IN-SPIRE ignores all punctuation. Word boundaries are defined
by white space. All punctuation, such as periods, quotes, and commas,
becomes white space or is deleted. For some analyses, however, this default
behavior does not produce the desired result. Data containing email addresses
or web pages, for example, present a challenge. By
default, IN-SPIRE will parse "someone@somewhere.com" as "someone",
"somewhere" and "com".
IN-SPIRE enables you to customize how it processes punctuation characters.
You can create a Punctuation Rules file for a particular dataset, or save
it for use in subsequent analyses.
Because
the application of punctuation rules is the first step in processing a
dataset, keeping a character or deleting a character can have sweeping
effects.
Accessing the Punctuation Rules for a dataset
The Punctuation Rules are accessible from the Dataset Editor Dataset Wizard. To work with punctuation rules
- From the IN-SPIRE main toolbar window, select File
> Datasets. . . The Dataset Editor window opens.
- Click on the name of the dataset of interest to select it, and click Edit or click New. The Dataset
Wizard opens.
If a dataset is open, you
will be given the option to close it and continue, or cancel.
- If you are editing an existing dataset, click Next until you see the Punctuation Rules
panel (for a plain text dataset, this is Step 7 of 8). If this is a new dataset, you must select a dataset name and data type to go on to the Punctuation Rules panel.
-
If the Dataset Wizard is already open, use the <Back
and Next> buttons to find the Punctuation Rules panel.
Editing the Punctuation Rules for a dataset
Each line in the panel at the right contains a punctuation "rule" for one character. It lists how that character will be treated by the text engine, as it reads source documents and creates a dataset. The columns Beginning, Middle and End, show what the text engine will replace the character with, if it occurs at the beginning, middle, or end of a word.
- To change the rule for a character, click
on the character in the list to select it and click Edit. The Edit Rule window opens.
- To preserve the "@" character in
e-mail addresses, select the "@" from the Middle: box, so that
any "@" that occurs in the middle of a word will be preserved,
resulting in joe@somewhere being considered all one word.
- To save your change and close the Edit Rule window,
click Done.
- To save your change without closing the Edit Rule
window, click Update. The Current Punctuation list reflects the change and the Edit rule window becomes:
- Now you may add or modify rules by
- Entering
the character, following steps 1-3 above; or
- Entering
the ASCII code for the character.
- If you check the Enter Numeric Code, the window changes to:
Follow steps 1-3 above.
- To add/modify another rule, click New.
Otherwise click Done. The Edit Rules window closes and changes are reflected in the Current Punctuation list.
Adding Punctuation Rules
To add punctuation rules from an existing punctuation rules file to the Current Punctuation list
- Click Load. A file open dialog opens to the IN-SPIRE punctuation rules directory.
- Choose the punctuation rules file in the list that
you want to open, and click Load. The file open dialog closes, and the chosen punctuation list is loaded into the list on the left of the Punctuation Rules panel. The name of the punctuation file is shown immediately above the list.
- Select the punctuation rules you want to add:
- To add a single punctuation rule from the file to your Current Punctuation list, click on the word in the list to select it.
- To add several rules in the list, select them using CTRL-click for non-contiguous rows, SHIFT-click for contiguous rows.
- To add all punctuation rules in the file to your Current Punctuation list, click anywhere in the list box on the left. Click CTRL-A to select all punctuation rules in that list.
- Click to add the selected rows to the Current Punctuation list.
- The selected terms appear in the Current Punctuation list.
You can use punctuation rules from another file as well, to do so, click Load again and repeat the above steps.
Saving the Current Punctuation List
To save the Current Punctuation list for use by other datasets
- On the upper right of the Punctuation Rules panel,
click Save... A file save dialog opens.
- Enter a name for this punctuation rules file.
The name of a punctuation rules file must end in ".punc"
- Click Save. The file will be saved and can be loaded later when you need it.
Editing the Default Punctuation Rules File
All new datasets use the default punctuation rules file. You can use
the Punctuation Rules panel to modify this default punctuation rules file.
- Load the default punctuation rules file that is distributed
with IN-SPIRE (00000000.punc) into the Punctuation Rules panel.
- Add all or some of the rules in the file to the Current
Punctuation list.
- Add and delete rules as appropriate.
- Save the Current Punctuation list as "00000000.punc" to the
INSPIRE\DatasetRoot\ folder .
Be
careful of the number of zeroes in the filename; there should be eight
of them.