Datasets:  Fields

Fields are sections of dataset documents which have had their boundaries marked by delimiters, which have been defined (described) so that IN-SPIRE can recognize them. Delimiters can be almost anything that enables IN-SPIRE to find the chunk of text consistently.  If no fields are defined, all the text in a document is lumped into a single field, which is then used by IN-SPIRE for clustering.

When to Define Fields

While defining fields is optional, if sections of your dataset's documents are labeled so they can be defined as fields, it will be possible for you to refine your analysis. Define fields when you want to do any of the following actions.

 

What IN-SPIRE Needs to Know About Fields

 

It is not necessary to define a field for every labeled part of the document. If you do not define fields, IN-SPIRE will lump together the labeled parts. Depending on the structure of the data, this can introduce "noise," obscuring the analysis.

 

Note:  For certain types of datasets, such as ASCII and XML, you can define fields during the initial dataset creation.  For other datasets, such as Google harvest, you can define fields once the initial dataset has been created.  

Defining Dataset Fields

To define fields in your dataset, begin with a dataset, either when you are creating a new dataset, or by editing an existing dataset.  

 

  1. From the IN-SPIRE File menu select Datasets.  The Dataset Editor will display.

  2. From the Dataset Editor window, select a dataset from the list and click the Edit button.  The Dataset Wizard will display.

  3. Click Next> to navigate through the steps until you reach the Format Fields step.  For the example shown, a Google Harvest dataset is being edited and Format Fields displays on Step 5 of the Dataset Wizard.  The step number can be different depending upon what sort of dataset you are creating or editing.

  4. From the Format Fields step, select a Field Name from the list and click the Add button. The Field Properties window will display.

  5. From the Field Properties window, edit the field.  You can use the Source Preview pane at the bottom of the Field Properties window to look at the dataset documents while you create a field.

  6. Choose a descriptive name for the Field Name. When a document is displayed in the Document Viewer, this field will be labeled with the field name.

  7. Enter the Field Delimiter text which will always mark out this field in the dataset documents. In the above example, the Field Delimiter is TITLE :. Enter it exactly as it appears (spaces and punctuation are important, in this example, the colon is included after the word 'title').

  8. To ensure that the Field Delimiter you enter is exactly what is found in your documents, copy field text from the Source Preview window.  To quickly copy text in the Source Preview to Field Name, Field Delimiter, or End Delimiter fields, in the Source Preview area, highlight the text you want to use.

  9. Right-click on the selected text.  A drop-down menu will display.

  10. From the drop-down menu, select the fields to which you want to copy the text.

  11. If you have not already selected a delimiter Rule for your field, select one from the list.

  12. Select a Field Type.  Most fields will be Regular Fields.  

  13. Select Field Options.  

 

Selecting the Delimiter Rule

To tell when one field ends and the next begins, use the following guidelines when selecting the delimiter Rule.  

 

If the field ends here:

Choose this:

At the end of the line.

EOL

At the next line.

Next Line

At the beginning of the next found field.

 

Caution:  Use this option carefully.  If there are no more fields to be found, the value of the field will be the text of the rest of the document. If the data is garbled or if the document format is incorrect, it is possible for IN-SPIRE not to "see" a document's main body text. As a consequence, its content will not be used for the analysis.

Next Field

Marked by a delimiter

 

If the end delimiter is not found, then the field is not found.

 

Caution:  Don't reuse End Delimiters. If the same character string is used to mark the ends of two different fields, your document may not process correctly.

End Delimiter

 

 

 

7/18/05