Settings: Fields

A field is a section of a document whose boundaries are marked by delimiters. Examples of fields are title, source, time/date, country, region, etc. Delimiters that are consistent across documents enable IN-SPIRE to find the field. Defining fields is optional, but if no fields have been defined, all the text in a document is lumped into a single field, which is then used by IN-SPIRE for clustering.

When to Define Fields

Define fields when you want to refine your analysis by:

  • Clustering on a particular field rather than all the document text.
  • Using the Time Tool, which requires a date field.
  • Labeling documents in the Galaxy visualization and the Document Viewer using a Title field rather than with IN-SPIRE-assigned document numbers.
  • Searching a particular field and not all document text.
  • Having IN-SPIRE automatically create groups from the contents of a field.

What IN-SPIRE Needs to Know About Fields

  • What text always marks the beginning of each document?
  • What text always labels each of the parts of the document?
  • How can you tell where one part stops and the next begins?
  • Where is the "body" of the document? The body includes all text from the end of the last-defined field to the beginning of the next document delimiter.

It is not necessary to define a field for every labeled part of the document. If you do not define fields, IN-SPIRE will lump together the labeled parts. Depending on the structure of the data, this can introduce "noise," obscuring the analysis.

For certain types of datasets, such as Text and XML, you can define fields during the initial dataset creation. For other datasets, such as Google harvest, you can define fields once the initial dataset has been created.

Defining Dataset Fields

To define fields in your dataset, begin with a dataset, either when you are creating a new dataset, or by editing an existing dataset.

  1. From the IN-SPIRE File menu select Datasets. The Dataset Editor opens.
  2. From the Dataset Editor window, select a dataset from the list, and click the Edit button. The Dataset Wizard opens.
  3. Click Next> to navigate through the steps until you reach the Format Fields step. For the example shown, a Google Harvest dataset is being edited and Format Fields appears as Step 5 of the Dataset Wizard. The step number will be different depending upon what type of dataset you are creating or editing.
  4. To create a new field, click Add. The Field Properties window opens.
    Format Fields Window
  5. Edit the field. You can use the Source Preview pane at the bottom of the Field Properties window to look at the dataset documents while you create a field.
  6. Choose a descriptive name for the Field Name, which will be the label used for the field in the Document Viewer.
  7. Enter the Field Delimiter text which will always mark out this field in all documents. In the above example, the Field Delimiter is TITLE :. Enter it exactly as it appears (spaces and punctuation are important, In this example, there is a space after "TITLE" followed by a colon.
    To ensure that the Field Delimiter you enter is exactly what is found in your documents, copy field text from the Source Preview window.
    • To quickly copy text in the Source Preview to Field Name, Field Delimiter, or End Delimiter fields, in the Source Preview area, highlight the text you want to use.
    • Right-click on the selected text.
      Set Field Drop-Down Menu
    • From the right-click menu, choose the field(s) to which you want to copy the text.
  8. If you have not already selected a Delimiter Rule for your field, select one from the list.
    Dataset Field Rule List
  9. Select a Field Type. Most fields will be Regular Fields.
    • To label documents in the Galaxy with a title rather than document numbers, assigned by IN-SPIRE: Choose Title Field. Titles will be visible in the Galaxy and in other tools like the Document Viewer.
    • To use the Time Tool, you must define a date field. Select Date/Time Field . Select a Time/Date Format from the table.
      Field Properties Date and Time Field Selections
  10. Select Field Options.
    • To have IN-SPIRE use the terms in the field to calculate topical content and clustering, choose Include in Computation. If the field is not used for computation, the terms in that field will be ignored when calculating the topical characteristics of your documents, but the terms will be available for the Search and Summary tools. Usually you will want to use all of your non-date fields for computation.
    • Select Case Sensitive Delimiter to distinguish this field delimiter from others with the same spelling but different capitalization.
    • Select Categorical Field if the field is a categorical field (has a limited number of discrete values) and you want IN-SPIRE to create a group for each value it finds as well as a column for the field in the Document Viewer.
      If the field can contain only a single value, select Single text value, but sometimes a field can contain multiple values. If that is true in your dataset from the list, then select the way the values are separated.
      Choices for how multiple values simultaneously present in a field will appear in the dataset.

Selecting the Delimiter Rule

To tell when one field ends and the next begins, use the following guidelines when selecting the delimiter Rule.

If the field ends here:

Choose this:

At the end of the line.

EOL

At the next line.

Next Line

At the beginning of the next found field.

Use this option carefully. If there are no more fields to be found, the value of the field will be the text of the rest of the document. If the data is garbled or if the document format is incorrect, it is possible for IN-SPIRE not to "see" a document's main body text. As a consequence, its content will not be used for the analysis.

Next Field

At a delimiter

If the end delimiter is not found, then the field is not found.

Do not reuse End Delimiters. If the same character string is used to mark the ends of two different fields, your document may not process correctly.

End Delimiter