Welcome
Data Sets
Overview
Creating New
--ASCII Text
--XML
--Google Harvest
--Web Harvest
Settings
--Fields
--Stopwords
--Stopmajors
--Punctuation Rules
Editing
Merging
Exporting
Importing
Subsetting
Visualizations
Galaxy
--Basics
--Outliers
ThemeView
Settings
Tools
Document Viewer
Gist
Groups
--Basics
--Evidence Panel
Major Terms
Queries
Print
Probe
Time Slicer
About version 2.2
Overview
Known issues
|
Settings: Fields
When to define Fields
Fields are delimited sections of data set documents, which have been
defined (described) so that IN-SPIRE can recognize them. If no fields
are defined, all the text in a document is lumped into a single field,
which is used for clustering.
On the other hand, if sections of your data set's documents are labeled
so they can be defined as fields, it will be possible for you to refine
your analysis. Define fields when you want to:
- Cluster on a particular field rather than all the text
- Use the Time Slicer, which requires a date field.
- Label documents in the Galaxy visualization and the Document Viewer
using a Title field rather than with IN-SPIRE-assigned document
numbers.
- Query only a particular field.
What IN-SPIRE needs to know about Fields
- What text always marks the beginning of each document?
- What text always labels each of the parts of the document?
- How can you tell where one part stops and the next begins?
- Where is the "body" of the document? The body includes all
text from the end of the last-defined field to the beginning of the
next document delimiter.
It's not necessary to define a field for every labeled part of the document.
However, if you do not, IN-SPIRE will lump together the labeled parts
which have not been defined as fields. Depending on the structure of the
data, this can introduce "noise," obscuring the analysis.
Field Properties: Step-by-Step
![field properties window](images/dse_field_properties.jpg)
Use the Source Preview at the bottom of the Field Properties window to
help you perform these steps:
1. Field Name
Choose a descriptive name for the field. When a document will be displayed
in the Document Viewer, this field will be labeled with the Field Name
you enter here.
2. Field Delimiter
Enter the text which always labels this field in the data set documents.
In the example, the Field Delimiter is "TITLE :". Enter it exactly
as it appears (spaces and punctuation are important). To insure that the
Field Delimiter you enter is exactly what is found in your documents,
follow the steps below to copy text from the Source Preview window.
- To quickly copy text in the Source Preview to Field Name,
Field Delimiter and/or End Delimiter fields
(you can save several cut-and-paste operations by following
these steps):
- a. In the Source Preview area, highlight the text you want to use.
- b. Right-click on the text, bringing up a popup menu.
- c. Choose the fields to which you want to copy the text. The text
is copied to the fields you have chosen.
3. Rule
How can you tell when this field ends and the next begins?
If the field ends here: |
Choose this: |
At the end of the line |
EOL |
At the next line |
Next Line |
At the beginning of the next found field
If there are no more fields found, the value of the field will
be the text of the rest of the document. If your data is malformed
or your format incorrect, it is possible for IN-SPIRE not to "see"
a document's main body text. As a consequence, its content won't
be used for the analysis.
Use this option with caution. |
Next Field |
Marked by a delimiter
If the end delimiter is not found, then the field is not found.
Don't reuse
End Delimiters. If the same character string is used to mark the
ends of two different fields, your document may not process correctly.
|
End Delimiter |
4. Field Type
Most fields will be Regular Fields.
Choose Title Field to label documents in the Galaxy, rather than
using IN-SPIRE-assigned document numbers. Titles can be visible in the
Galaxy and some tools as well.
Choose Date/Time Field to use this field for the Time Slicer.
Choose a Time/Date Format in the box immediately below.
5. Field Options
Choose Include in Computation to use the terms in this field
when calculating the statistics that determine topical content and organize
your documents into clusters. Usually you will want to use all of your
non-date fields for computation, unless a particular field contains irrelevant
data or data that is of little interest for the analysis. If the field
is not used for computation, the terms in that field will be ignored when
calculating the topical characteristics of your documents, but the terms
will be available for query and gisting.
Choose Case sensitive delimiters to distinguish this field delimiter
from others with the same spelling but different capitalization.
|