Normalized compression distance for visual analysis of document collections

Research output: Contribution to journalArticlepeer-review

Abstract

In a world flooded by text of various sources, it is of strategic importance to find ways to map information present in written documents in a form that helps users locate and associate important information within a particular text data set. Content-based maps can support extremely useful explorations of text data sets. This paper proposes and evaluates the use of Kolmogorov complexity approximations as a means to detect similarity between general textual documents, in order to support mapping and visualization techniques for corpora exploration. The calculation of this similarity measure requires no intermediate representation of a corpus (such as vector representation) and therefore no pre-processing or parametrization steps. That makes it very attractive for a wider range of exploratory applications compared to conventional measures that need vector-based text representations. The visual layout used here is based on fast distance multi-dimensional projections. It is shown that the similarity measure and the resulting maps present very good precision and that the approach can be used successfully for visual analysis of automatically generated text maps.

Original languageEnglish
Pages (from-to)327-337
Number of pages11
JournalComputers and Graphics
Volume31
Issue number3
DOIs
Publication statusPublished - Jun 2007
Externally publishedYes

Keywords

  • Document visualization
  • Kolmogorov complexity
  • Multi-dimensional projection
  • Normalized compression distance
  • Text collection visualization

Fingerprint

Dive into the research topics of 'Normalized compression distance for visual analysis of document collections'. Together they form a unique fingerprint.

Cite this