Creating New Datasets

If IN-SPIRE is running, from the IN-SPIRE main menu, choose File > Projects..., or click the ProjectsProjects buttonbutton. The Projects window also appears automatically when IN-SPIRE starts up. It is the first window you will see after the splash screen.

Projects window with Home Project showing

Creating a Dataset

It is possible to create a single dataset by combining two others, see Merging Two Existing Datasets.

Usually, however, you will be creating a dataset from files that you already have, or by harvesting documents from the Web. With the exception of Word, RTF, PDF datasets, a dataset can contain only one type of data, the "type of dataset" being either the type of data that you already have (Text Files , Word, RTF or PDF documents, HTML files, XML documents, or Outlook email messages; or by the Internet search that will retrieve the data you want to use (e.g., Google, Web, or Blog Harvest).

Basic Steps 1-3

  1. To create a new dataset, from the IN-SPIRE main menu choose File > Projects... to open the Projects window.
  2. On the Projects window, click New. The Dataset Wizard window opens.
    Types of datasets
  3. Choose the type of dataset you want to create and click Next, or (if you have saved a settings template that you want to use) choose one of the items in the Saved Dataset Settings list.

    The next steps depend on what you selected in Step 3.

  1. To accept the default processing options and start processing, go to Start Processing. To set your own processing options, click Next and go to Optional Settings.
    warning icon If you start processing a text dataset at this point without defining any fields, all the fields in your data will be lumped together for the purposes of the analysis. If this is not what you want, click Next and go to step 7 under Optional Settings.

Optional Settings

  1. Fields. If you are creating a Text or XML dataset, you can specify Fields.The default behavior for IN-SPIRE is for all the text in a document to be used for the analysis. However, if you have structured text documents with delimited sections, IN-SPIRE can be configured to analyze some document sections and not others. Click Add. The Field Properties window opens. See Fields for more information. There are special considerations when Using XML Data.
  2. Stopwords: These are words which the text engine ignores during text processing; they will be eliminated from the dataset. This means they will not appear in any of the tools such as Summary, and they will not be found when you search the dataset.
    1. To edit the stopword list, click Next until the Stopword List window opens.
    2. To delete a word from this list, click on it to select it, then click Delete.
    3. To edit a word, click Edit.
    4. To add words to the list, click Add. For more information, see Stopwords.
  3. Stopmajor list: Terms that you want to be available for Summary and searching, but do not want considered as major terms when determining document relatedness. For more information, see Stopmajor Words.
  4. Saved Phrases: Your data contains phrases which will be important to your analysis. You can identify these phrases to the IN-SPIRE text engine on the Saved Phrases screen. For example, you might identify “United States”, or “enriched uranium” or “United Arab Emirates” as phrases. Once defined, these phrases will now be “major terms”, appear in peak labels, and be used for clustering documents in the Galaxy. If you do not identify any phrases, the text engine will use single words to determine the similarity between documents. You can save the list of phrases to a file for re-use. This provides an extremely powerful way of viewing your data.
    Dataset Wizard Saved Phrases Window
    1. To create a new set of phrases, click Add. The Add Phrase window will open.
    2. Enter a phrase and click Add. Your phrase will appear on the Current Phrases list. The window will remain open.
    3. Keep adding as many phrases as you need. Click Done to exit the Add Phrase window and return to he Dataset Wizard.
    4. Click Save to save your current phrases.
    5. If you have phrases saved from an earlier session, you can click Load to select the file of saved phrases. Select the saved phrases file from the Open window and click Load. The saved phrases will display in the left-hand column. The entire list will be selected. To add the saved phrases to the Current Phrases column, click the arrow Add Phrase Arrow. You can also pick and choose phrases from the list and add them by single-clicking or by using CTRL-click to select only those phrases you need.
  5. Entity Extraction and Language Detector: The dictionary defines an entity as "something that has separate and distinct existence and objective or conceptual reality." For IN-SPIRE, entities will be words ("Iraq") or phrases ("Palestine Liberation Organization"). An entity extractor is software which attempts to identify entities in the data so they can be used in analysis. Entities can also be longer phrases such as "in the White House" or "going to Iraq". Entity Extraction can significantly increase processing time.
    If your dataset is large, please allow the substantial time required for the entity extractor to work. In the Dataset list, the status of the dataset will remain "Scanning" during this time. If you know which entities you are most interested in, you may want to enter them on the Saved Phrases screen instead of using an entity extractor.
    • Select an entity extractor from the list. Entity extraction provides text analysis technology that automatically identifies and extracts key entities.
    • Select a language detector to automatically determine the language of harvested documents.
  6. Punctuation Rules. You can specify how punctuation characters in the text are treated by the text engine. Click Next until the Punctuation window opens. Click on the punctuation character of interest, then click Edit. For more information, see Punctuation Rules.
  7. Save settings check box: This feature allows you to save your processing settings so they can be used in subsequent visualizations. They will be available through the "Saved Settings" tab in the Dataset Editor. The saved settings include the document collection, formats, Stopwords, and stopmajor settings. Click the Save These Dataset Settings checkbox, and edit the name of the setting, if you wish.

Start Processing

  1. Click Finish to start processing. The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Projects window. You can monitor the status of the Dataset as it is processed by choosing Refresh icon Refresh from the More menu of the Projects window.