Creating New Datasets

Dataset Editor Window

Accessing the Dataset Editor

If IN-SPIRE is running, from the main IN-SPIRE menu bar, select File > Datasets... Alternatively, on start up, the Dataset Editor window opens after the splash screen has appeared.

Creating a Dataset

Basic Steps

  1. To create a new dataset from two existing datasets, see Merging Two Existing Datasets.
    To create a new dataset from source files, on the Dataset Editor window, click New. The Dataset Wizard window opens.

  2. Notice that there are two tabbed panels on this screen: Datasets and Saved Settings. Datasets will be uppermost. Choose from the items on the list, which describe either the form of the data that you already have (ASCII dataset or XML dataset) or the Internet search that will retrieve the data you want to use for this dataset (for example, FBIS Portal Harvest, Google Harvest, and Web Harvest). Click Next.

  3. Follow the appropriate link below for specific instructions on each type.

 

Settings for ASCII Dataset

Settings for FBIS Portal Harvest

Settings for Google Harvest

Settings for Web Harvest

Settings for XML Dataset

 

  1. To use a previously saved processing configuration, Click on the Saved Settings tab and choose one of the saved settings listed in the window.
    Note
    :  you must have previously saved settings for any to display under the Saved Settings tab.

  2. To accept the default processing options and start processing, go to Start Processing. To set your own processing options, click Next and go to Optional Settings.
    Caution: If you start processing a text dataset at this point without defining any fields, all the fields in your data will be lumped together for the purposes of the analysis. If this is not what you want, click Next and go to step 7 under Optional Settings.

Optional Settings

  1. Fields.  If you are creating an ASCII or XML dataset, you can specify Fields. The default behavior for IN-SPIRE is for all the text in a document to be used for the analysis. If you have structured text documents with delimited sections, IN-SPIRE can be configured to analyze some document sections and not others. Click Add. The Field Properties window opens. See Fields for more information. There are special considerations when Using XML Data.

  2. Stopwords: These are words which the text engine ignores during text processing; they will be eliminated from the dataset. This means they will not appear in any of the tools such as Gist, and they will not be found when you query the dataset.

    1. To edit the stopword list, click Next until the Stopword List window opens.

    2. To delete a word from this list, click on it to select it, then click Delete.

    3. To edit a word, click Edit.

    4. To add words to the list, click Add. For more information, see Stopwords.

  3. Stopmajor list: Terms that you want to be available for gisting and querying, but do not want considered as major terms when determining document relatedness. For more information, see Stopmajor Words.

  4. Saved Phrases: Your data contains phrases which will be important to your analysis.  You can identify these phrases to the IN-SPIRE text engine on the Saved Phrases screen. For example, you might identify “United States”, or “enriched uranium” or “united arab emirates” as phrases. Once defined, these phrases will now be “major terms”, appear in peak labels, and be used for clustering documents in the Galaxy. If you do not identify any phrases, the text engine will use single words to determine the similarity between documents. You can save the list of phrases to a file for re-use. This provides an extremely powerful way of viewing your data.

    1. To create a new set of phrases, click Add.  The Add Phrase window will display.  

    2. Enter a phrase and click Add.  Your phrase will display on the Current Phrases list.  The window will remain open.

    3. Keep adding as many phrases as you need.  Click Done to exit the Add Phrase window and return to the Dataset Wizard.  

    4. Click Save to save your current phrases.  

    5. If you have phrases saved from an earlier session, you can click Load to select the file of saved phrases. Select the saved phrases file from the Open window and click Load.  The saved phrases will display in the left-hand column.  The entire list will be selected.  To add the saved phrases to the Current Phrases column, click the arrow Add Phrase Arrow.  You can also pick and choose phrases from the list and add them by single-clicking or by using Ctrl-click to select only those phrases you need.
       

  5. Entity Extraction and Language Detector:  The dictionary defines an entity as “something that has separate and distinct existence and objective or conceptual reality.” For IN-SPIRE, entities will be words (“Iraq”) or phrases (“Palestine Liberation Organization”). An entity extractor is software which attempts to identify entities in the data so they can be used in analysis.  Entities can also be longer phrases such as “in the White House” or “going to Iraq”.  Entity Extraction can significantly increase processing time. If you know which entities you are most interested in, you may want to enter them on the Saved Phrases screen instead of using an entity extractor.

  6. Punctuation Rules. You can specify how punctuation characters in the text are treated by the text engine. Click Next until the Punctuation window opens. Click on the punctuation character of interest, then click Edit. For more information, see Punctuation Rules.

  7. Save settings check box: This feature allows you to save your processing settings so they can be used in subsequent visualizations. They will be available through the "Saved Settings" tab in the Dataset Editor. The saved settings include the document collection, formats, Stopwords, and stopmajor settings. Click the Save These Dataset Settings... checkbox, and edit the name of the setting, if you wish.

Start Processing

  1. Click Finish to start processing. The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor the status of the Dataset as it is processed by clicking the Refresh button at the top of the Dataset Editor window.