By default, IN-SPIRE ignores all punctuation. Word boundaries are defined
by white space. All punctuation, such as periods, quotes, and commas,
becomes white space or is deleted. For some analyses, however, this default
behavior does not produce the desired result. Data containing email addresses
or web pages, for example, present a challenge. By
default, IN-SPIRE will parse "someone@somewhere.com" as "someone",
"somewhere" and "com".
IN-SPIRE enables you to customize how it processes punctuation characters.
You can create a Punctuation Rules file for a particular dataset, or save
it for use in subsequent analyses.
Caution: Because
the application of punctuation rules is the first step in processing a
dataset, keeping a character or deleting a character can have sweeping
effects.
Accessing the Punctuation Rules for a dataset
The Punctuation Rules are accessible from the Dataset Editor Dataset
Wizard. To
work with punctuation rules
From the IN-SPIRE main toolbar window, select File
> Datasets. . . The Dataset Editor window opens.
Click on the name of the dataset of interest to select
it, and click Edit, or
click New. The Dataset
Wizard opens. Note: If a dataset
is open, you will be given the option to close it and continue, or cancel.
If you are editing an existing dataset, click Next until you see the Punctuation Rules
panel (for an ASCII dataset, this is Step 7 of 8). If this is a new
dataset, you must select a dataset name and data type to go on to
the Punctuation Rules panel.
If the Dataset Wizard is already open, use the
<Back and Next> buttons to find the Punctuation Rules panel.
Editing the Punctuation Rules for a dataset
Each line in the panel at the right contains a punctuation "rule"
for one character. It lists how that character will be treated by the
text engine, as it reads source documents and creates a dataset. The columns
Begin, Middle and End, show what the text engine will
replace the character with, if it occurs at the beginning, middle, or
end of a word.
To change the rule for a character, click on the
character in the list to select it and click Edit.
The Edit Rule window opens.
To preserve the "@" character in e-mail
addresses, select the "@" from the Middle: box, so that any
"@" that occurs in the middle of a word will be preserved, resulting
in joe@somewhere being considered all one word.
To save your change and close the Edit Rule window,
click Done.
To save your change without closing the Edit Rule
window, click Update.
The Current Punctuation list reflects the change and the Edit rule window
becomes:
Now you may add or modify rules by
Entering the character, following steps 1-3 above;
or
Entering the ASCII code for the character.
If you check the Enter
Numeric Code, the window changes to:
Follow steps 1-3 above.
To add/modify another rule, click New.
Otherwise click Done.
The Edit Rules window closes and changes are reflected in the Current
Punctuation list.
Adding Punctuation Rules
To add punctuation rules from an existing punctuation rules file to
the Current Punctuation list
Click Load.
A file open dialog opens to the IN-SPIRE punctuation rules directory.
Choose the punctuation rules file in the list that
you want to open, and click Load.
The file open dialog closes, and the chosen punctuation list is loaded
into the list on the left of the Punctuation Rules panel. The name of
the punctuation file is shown immediately above the list.
Select the punctuation rules you want to add:
To add a single punctuation rule from the file
to your Current Punctuation list, click on the word in the list to select
it.
To add several rules in the list, select them
using Ctrl-click for non-contiguous rows, Shift-click for contiguous rows.
To add all punctuation rules in the file to your
Current Punctuation list, click anywhere in the list box on the left.
Click Ctrl-A
to select all punctuation rules in that list.
Click to add the selected rows to the
Current Punctuation list.
The selected terms appear in the Current Punctuation
list.
You can use punctuation rules from another file as well, to do so, click
Load again and repeat the above steps.
Saving the Current Punctuation List
To save the Current Punctuation list for use by other datasets
On the upper right of the Punctuation Rules panel,
click Save... A file
save dialog opens.
Enter a name for this punctuation rules file. Caution: The
name of a punctuation rules file must end in ".punc"
Click Save.
The file will be saved and can be loaded
later when you need it.
Editing the Default Punctuation Rules File
All new datasets use the default punctuation rules file. You can use
the Punctuation Rules panel to modify this default punctuation rules file.
Load the default punctuation rules file that is distributed
with IN-SPIRE (00000000.punc) into the Punctuation Rules panel.
Add all or some of the rules in the file to the Current
Punctuation list.
Add and delete rules as appropriate.
Save the Current Punctuation list as "00000000.punc" to the
INSPIRE\DatasetRoot\ folder .
Caution: Be
careful of the number of zeroes in the filename; there should be eight
of them.