Settings: Punctuation Rules

By default, IN-SPIRE ignores all punctuation. Word boundaries are defined by white space. All punctuation, such as periods, quotes, and commas, becomes white space or is deleted. For some analyses, however, this default behavior does not produce the desired result. Data containing email addresses or web pages, for example, present a challenge. By default, IN-SPIRE will parse "someone@somewhere.com" as "someone", "somewhere" and "com".

IN-SPIRE enables you to customize how it processes punctuation characters. You can create a Punctuation Rules file for a particular dataset, or save it for use in subsequent analyses.

Because the application of punctuation rules is the first step in processing a dataset, keeping a character or deleting a character can have sweeping effects.

Accessing the Punctuation Rules for a dataset

The Punctuation Rules are accessible from the Dataset Editor Dataset Wizard. To work with punctuation rules

  1. From the IN-SPIRE main toolbar window, select File > Datasets. . . The Dataset Editor window opens.
  2. Click on the name of the dataset of interest to select it, and click Edit or click New. The Dataset Wizard opens.
    If a dataset is open, you will be given the option to close it and continue, or cancel.
  3. If you are editing an existing dataset, click Next until you see the Punctuation Rules panel (for a plain text dataset, this is Step 7 of 8). If this is a new dataset, you must select a dataset name and data type to go on to the Punctuation Rules panel.
    Dataset Punctuation Rules Window
  4. If the Dataset Wizard is already open, use the <Back and Next> buttons to find the Punctuation Rules panel.

Editing the Punctuation Rules for a dataset

Each line in the panel at the right contains a punctuation "rule" for one character. It lists how that character will be treated by the text engine, as it reads source documents and creates a dataset. The columns Beginning, Middle and End, show what the text engine will replace the character with, if it occurs at the beginning, middle, or end of a word.

  1. To change the rule for a character, click on the character in the list to select it and click Edit. The Edit Rule window opens.
    Punctuation Rules Edit Window
  1. To preserve the "@" character in e-mail addresses, select the "@" from the Middle: box, so that any "@" that occurs in the middle of a word will be preserved, resulting in joe@somewhere being considered all one word.
  2. To save your change and close the Edit Rule window, click Done.
  3. To save your change without closing the Edit Rule window, click Update. The Current Punctuation list reflects the change and the Edit rule window becomes:
    Edit Rule Window
  4. Now you may add or modify rules by
    1. Entering the character, following steps 1-3 above; or
    2. Entering the ASCII code for the character.
  5. If you check the Enter Numeric Code, the window changes to:
    Enter Numeric Code Window
    Follow steps 1-3 above.
  6. To add/modify another rule, click New. Otherwise click Done. The Edit Rules window closes and changes are reflected in the Current Punctuation list.

Adding Punctuation Rules

To add punctuation rules from an existing punctuation rules file to the Current Punctuation list

  1. Click Load. A file open dialog opens to the IN-SPIRE punctuation rules directory.
  2. Choose the punctuation rules file in the list that you want to open, and click Load. The file open dialog closes, and the chosen punctuation list is loaded into the list on the left of the Punctuation Rules panel. The name of the punctuation file is shown immediately above the list.
    Puctuation Rules Window - Adding Rules
  3. Select the punctuation rules you want to add:
  4. Click right arrow button to add the selected rows to the Current Punctuation list.
  5. The selected terms appear in the Current Punctuation list.

You can use punctuation rules from another file as well, to do so, click Load again and repeat the above steps.

Saving the Current Punctuation List

To save the Current Punctuation list for use by other datasets

  1. On the upper right of the Punctuation Rules panel, click Save... A file save dialog opens.
  2. Enter a name for this punctuation rules file.
    danger icon The name of a punctuation rules file must end in ".punc"
  3. Click Save. The file will be saved and can be loaded later when you need it.

Editing the Default Punctuation Rules File

All new datasets use the default punctuation rules file. You can use the Punctuation Rules panel to modify this default punctuation rules file.

  1. Load the default punctuation rules file that is distributed with IN-SPIRE (00000000.punc) into the Punctuation Rules panel.
  2. Add all or some of the rules in the file to the Current Punctuation list.
  3. Add and delete rules as appropriate.
  4. Save the Current Punctuation list as "00000000.punc" to the INSPIRE\DatasetRoot\ folder .

    danger icon Be careful of the number of zeroes in the filename; there should be eight of them.