Friday, September 13, 2013

Visualizing the semantic relationships within a text

In the last few posts, I had some images that were supposed to give a rough idea of the semantic relationships within a text.  Over the last few days I have been rewriting my lexical analysis tools in C# (the old ones were in C++ and Tcl), and that has given me a chance to play with some new ways of looking at the data.

When I depict this data in two dimensions, I'm really showing something like the shadow of a multi-dimensional object, and I have to catch that object at the right angle to get a meaningful shadow.  In the following images, I have imagined the tokens in the text as automata that are bound by certain rules, but otherwise behaving randomly, in an attempt to produce an organic image of the data.

For example, each point of this image represents a word in the lexicon of the King James version of the book of Genesis.  The words all started at random positions in space, but were made to iteratively organize themselves so that their physical distances from other words roughly reflected the semantic distance between them.  After running for an hour, the process produced the following image:


There are clearly some groups forming within the universe of words.  Since this process is iterative, the longer it runs the more ordered the data should become (up to a point)...but I haven't had the patience to run it for more than an hour.

The same process applied to the Voynich manuscript produces something smoother, as you can see below.  However, the VM has a much broader vocabulary, so I estimate I would need about nine hours of processing to arrive at the same degree of evolution as I have in the image above.


Reflecting on that, I thought I would try a new algorithm where words are attracted to each other in proportion to their similarity.  The result was not interesting enough to show here, since the words rapidly collapsed into a small number of points.  I am certain that this data would be interesting to analyze, but it is not interesting to look at.

Thinking about the fractal nature of this collapse, I thought I would use similarity data to model something like the growth of a plant.  In the following image, I have taken the lexicon of the King James Genesis again and used it to grow a vine.  I started with one arbitrarily chosen token, then added the others by joining them to whatever existing point was semantically nearest.  Each time, I tried up to ten times to find a random angle of growth that did not overlap another existing line, before finally allowing one line to overlap others.

I am quite pleased with the result here, since it shows what I have always intuitively felt about the data--that there are semantic clusters of words in the lexicon.


The same approach applied to the Voynich manuscript produces a denser image, due to the greater breadth of the vocabulary:


But how does this compare to random data?  To answer this question, I processed a random permutation of the Voynich manuscript, so the text would have the same word frequency and lexicon as the Voynich manuscript, but any semantic context would be destroyed.  Here is the result:


Intuitively, I feel that the random data is spread more evenly than the Voynich data, but to get beyond intuition I need a metric I can use to measure it.

Good night.

No comments:

Post a Comment