If IN-SPIRE is running, from the main IN-SPIRE menu bar, select File > Datasets... Alternatively, on start up, the Dataset Editor window opens after the splash screen has appeared.
To create a new dataset from two existing datasets,
see Merging Two Existing Datasets.
To create a new dataset from source files, on the Dataset Editor window,
click New. The Dataset Wizard window
opens.
Notice that there are two tabbed panels on this screen: Datasets and Saved Settings. Datasets will be uppermost. Choose from the items on the list, which describe either the form of the data that you already have (ASCII dataset or XML dataset) or the Internet search that will retrieve the data you want to use for this dataset (for example, FBIS Portal Harvest, Google Harvest, and Web Harvest). Click Next.
Follow the appropriate link below for specific instructions on each type.
Settings for FBIS Portal Harvest
To use a previously saved processing configuration,
Click on the Saved Settings
tab and choose one of the saved settings listed in the window.
Note: you
must have previously saved settings for any to display under the Saved
Settings tab.
To accept the default processing options and start
processing, go to Start Processing. To
set your own processing options, click Next
and go to Optional Settings.
Caution: If you
start processing a text dataset at this point without defining any fields,
all the fields in your data will be lumped together for the purposes of
the analysis. If this is not what you want, click Next
and go to step 7 under Optional Settings.
Fields. If you are creating an ASCII or XML dataset, you can specify Fields. The default behavior for IN-SPIRE is for all the text in a document to be used for the analysis. If you have structured text documents with delimited sections, IN-SPIRE can be configured to analyze some document sections and not others. Click Add. The Field Properties window opens. See Fields for more information. There are special considerations when Using XML Data.
Stopwords: These are words which the text engine ignores during text processing; they will be eliminated from the dataset. This means they will not appear in any of the tools such as Gist, and they will not be found when you query the dataset.
To edit the stopword list, click Next until the Stopword List window opens.
To delete a word from this list, click on it to select it, then click Delete.
To edit a word, click Edit.
To add words to the list, click Add. For more information, see Stopwords.
Stopmajor list: Terms that you want to be available for gisting and querying, but do not want considered as major terms when determining document relatedness. For more information, see Stopmajor Words.
Saved Phrases: Your data contains phrases which
will be important to your analysis. You
can identify these phrases to the IN-SPIRE text engine on the Saved Phrases
screen. For example, you might identify “United States”, or “enriched
uranium” or “united arab emirates” as phrases. Once defined, these phrases
will now be “major terms”, appear in peak labels, and be used for clustering
documents in the Galaxy. If you do not identify any phrases, the text
engine will use single words to determine the similarity between documents.
You can save the list of phrases to a file for re-use. This provides an
extremely powerful way of viewing your data.
To create a new set of phrases, click Add. The Add Phrase window will display.
Enter a phrase and click Add. Your phrase will display on the Current Phrases list. The window will remain open.
Keep adding as many phrases as you need. Click Done to exit the Add Phrase window and return to the Dataset Wizard.
Click Save to save your current phrases.
If you have phrases saved from an earlier
session, you can click Load to select the file of saved phrases. Select
the saved phrases file from the Open window and click Load. The
saved phrases will display in the left-hand column. The
entire list will be selected. To
add the saved phrases to the Current Phrases column, click the arrow . You
can also pick and choose phrases from the list and add them by single-clicking
or by using Ctrl-click to select only those phrases you need.
Entity Extraction and Language Detector: The dictionary defines an entity as “something that has separate and distinct existence and objective or conceptual reality.” For IN-SPIRE, entities will be words (“Iraq”) or phrases (“Palestine Liberation Organization”). An entity extractor is software which attempts to identify entities in the data so they can be used in analysis. Entities can also be longer phrases such as “in the White House” or “going to Iraq”. Entity Extraction can significantly increase processing time. If you know which entities you are most interested in, you may want to enter them on the Saved Phrases screen instead of using an entity extractor.
Select an entity extractor from the list. Entity extraction provides text analysis technology that automatically identifies and extracts key entities.
Select a language detector to automatically
determine the language of harvested documents.
Note: Your
version of IN-SPIRE may display different entity extractors or language
detectors from those in the example.
Punctuation Rules. You can specify how punctuation characters in the text are treated by the text engine. Click Next until the Punctuation window opens. Click on the punctuation character of interest, then click Edit. For more information, see Punctuation Rules.
Save settings check box: This feature allows you to save your processing settings so they can be used in subsequent visualizations. They will be available through the "Saved Settings" tab in the Dataset Editor. The saved settings include the document collection, formats, Stopwords, and stopmajor settings. Click the Save These Dataset Settings... checkbox, and edit the name of the setting, if you wish.
Click Finish to start processing. The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor the status of the Dataset as it is processed by clicking the Refresh button at the top of the Dataset Editor window.