Datasets: XML Data

Follow steps 1 and 2 of the Basic Steps listed in Creating a New Dataset, and select XML Dataset. The Dataset Wizard XML window will display.

Enter a Dataset Name.
Enter the Document Delimiter. The delimiter is the "content" of the XML tag, without angle brackets or a slash; it will typically be a word or words enclosed in angle brackets, and will be within the first three lines of your XML file(s). A corresponding XML tag with a "/" in front of the word will be near or at the end of the XML record. For example, <DOC> occurs at the beginning of the XML file, and </DOC> occurs near the end. The IN-SPIRE document delimiter would be DOC in this case.
Click Add... to add XML files (singly or severally), or click Add From Folder... to add an entire folder full of files. Browse to the folder you want to add, click on it to select it and click Add. The files appear in the Source files list. To remove a file from the Source files list, click on it, then click Remove.
Associate source files with XSLT files. XSLT files are used to transform XML tagged data into some other format for viewing. To associate XSLT files
1. Click on one or more source files that you wish to associate an XSLT file with. Ctrl-A will select all source files in the list.
2. Click on the XSLT button. A file open dialog opens in the IN-SPIRE sources folder.
3. Navigate to your XSLT file, if it is not in the sources folder.
4. Click Set. The XSLT file name appears next to the source file with which you have associated it.
Click Next>. The Format Fields window will display.
To add fields, click on the Add... button. The Field Properties window will display.
From the Source Preview pane, highlight a Field Delimiter and right-click.
From the drop-down menu, select Set as Field Name and Delimiter or one of the other delimiter choices. The Field Name and Delimiter fields will be filled with your selection.

Notice that the delimiters should not include the angle brackets from the XML tag. For example, for the field <DOCUMENT_ID>, DOCUMENT_ID is the field delimiter. See Defining Fields for more information on how to define fields.

Caution: Make sure at least one of the XML fields that are marked "Computational" actually exists in most of your documents, and that it contains significant textual content. If it does not, the IN-SPIRE text engine will not find enough data to create a visualization and will fail.

To accept the default settings on the following screens and begin processing immediately, click Finish and go to Start Processing, otherwise go to Optional Settings.

Start Processing

The Processing dialog opens, informing you that the dataset is being processed. Click OK. The dataset appears in the list of datasets in the Dataset Editor window. You can monitor its status as it is processed by clicking , the Refresh button, at the top of the Dataset Editor window.

Special Considerations for Processing XML Documents

XML/DTD Files

DTD files are not currently supported. If your XML files reference DTD's, although the XML files will be processed, the DTD references and DTD files will be ignored during processing.

Note: Your XML source files must be well-formed. For the rules which govern whether XML is well-formed or not, see any XML reference book.

Unsupported Entity Markers

The entity markers [ and ] are not supported at this time.

7/18/05