Wednesday, August 20, 2014

Word breaks in the Rohonc Codex

I have developed some really cool tools that work on the word level of a text, and identify things like the level of semantic content in words (to differentiate function words from content words), semantic similarity, and so forth.

Unfortunately, I can't use those tools on the Rohonc Codex yet because the transcription is on the level of the glyph, not the word.

But the codex gives us a few tiny clues to word breaks, and I think I have figured out how to leverage those to split the text into words.

The main clue is the hyphen, which looks like a double line. The hyphen at the end of a line tells us where a word break probably is not, and the absence of a hyphen tells us where a word break probably is. This allows us to build a small table of likely word heads and tails, which might be used to identify word breaks (where a break may be between a likely tail and head).

As a proof-of-concept, I took an English text with 47,975 words, removed all of the spaces and punctuation, then divided the text into lines of approximately 80 characters, breaking lines at word boundaries. I then measured the frequency of every segment that started or ended a line of text, and wrote a scoring function that looks like this:


In this equation, b(x, y) is the "break score" of a point between strings x and y, the variable tx represents the number of times that string x is found at the end of a line, and hy is the number of times that string y is found at the head of the line.

If the break score crosses a certain threshold, then I insert a word break between strings x and y.

There are a few parameters that govern how you actually implement this. The first is the length of the strings x and y, and the second is the threshold to use. Experimenting with my English text, I found the optimal string length to be 3, and the optimal threshold to be 10. With these parameters, I was able to correctly identify 78% of word breaks in the English text, with only one incorrect word break for every 10 correct word breaks.

Before I apply this to the Rohonc Codex, I think I will put some more effort into cleaning up the transcription. In order for the word-breaking algorithm to work efficiently, I will need to identify lines that are damaged and possibly missing glyphs on one or the other end.

In the meantime, I am wondering what it would look like if we apply this to the Voynich manuscript. It would be interesting if the apparent word breaks in the VM don't strongly correlate with the word breaks identified by this approach.

No comments:

Post a Comment