Skip to Main Content U.S. Department of Energy
IN-SPIRE™ Visual Document Analysis

FAQ: How does IN-SPIRE™ create a visualization with my documents?

In brief, IN-SPIRE™ creates mathematical representations of the documents, which are then organized into clusters and visualized into "maps" that can be interrogated for analysis.

More specifically, IN-SPIRE™ performs the following steps:

  1. The text engine scans through the document collection and automatically determines the distinguishing words or "topics" within the collection, based upon statistical measurements of word distribution, frequency, and co-occurrence with other words. Distinguishing words are those that help describe how each document in the dataset is different from any other document. For example, the word "and" would not be considered a distinguishing word, because it is expected to occur frequently in every document. In a dataset where every document mentions Iraq, "Iraq" wouldn't be a distinguishing word either.
  2. The text engine uses these distinguishing words to create a mathematical signature for each document in the collection. Then it does a rough similarity comparison of all the signatures to create cluster groupings.
  3. IN-SPIRE™ compares the clusters against each other for similarity, and arranges them in high-dimensional space (about 200 axes) so that similar clusters are located close together. The clusters can be thought of as a mass of bubbles, but in 200-dimensional space instead of just 3.
  4. That high-dimensional arrangement of clusters is then flattened down to a comprehensible 2-dimensions—trying to preserve a picture where similar clusters are located close to each other, and dissimilar clusters are located far apart. Finally, the documents are added to the picture by arranging each within the invisible “bubble” of their respective cluster. All of this information is then mapped onto the Galaxy and ThemeView™ visualizations that convey the document and topical relationships of your information.

Return to the FAQ page