Friday, March 21, 2014

Starting the final phase of Rohonc transcription (I think)

I think I've finally settled on a workable process for machine-assisted transcription of the Rohonc Codex.

I tried several approaches before landing on the current one. One approach was to analyze each page in a top-down way, first identifying the areas that contained text, then splitting those into lines, and splitting the lines into glyphs. The other approach was bottom-up: First, identify glyphs, then identify lines.

No single automated process was able to correctly split the pages up 100% of the time, so I have adopted a top-down automated approach with manual overrides. I have now broken all of the pages down to the line level, and I have some code that picks out glyphs from a line with great accuracy.

Now comes the fun part: writing (and training) the glyph-recognition algorithm.

I've decided to use the "mark" as my fundamental unit of text. A mark is a single, contiguous, dark shape on the page within the bounds of a text line. Many Rohonc glyphs consist of a single mark, but many consist of core mark with one or more satellite marks. Most satellite marks are single dots above or to the left of the core mark, but some marks are dashes, and some are haloes that surround a core mark.

My glyph-recognition algorithm will start out by finding the best match between the glyphs on a new line and any that have been previously identified. This will be followed by a manual intervention step where I can correct any incorrect automated matches, or reject a mark as being non-text. When the line has been completely treated, constellations of marks will be matched to known glyphs.

I have wrestled with several different approaches to matching marks. One approach is to simply overlay one mark upon another and determine the total number of dark points that are the different, and calculate the ratio between that number and the full number of points.

Another approach that I am toying with is to use a two-dimensional version of Levenshtein distance. One way to do that would be to treat each row and column of the mark bitmaps as a string, calculate the individual Levenshtein distances, and sum them up to come up with a total distance.

But some calibration would be needed to make an apples-to-apples comparison between different match scores.

Sunday, March 9, 2014

Interesting features of Rohonc script and character recognition

I've been working on code that can scan the images of the Rohonc Codex and help me transcribe it. Hopefully I will be able to complete the transcription relatively quickly with the assistance of some code that can recognize and categorize graphemes (and remember what wacky name I decided to give each character).

In the process, I have unearthed a wealth of interesting detail and challenges.

Regarding the grapheme recognition process, the challenges are many. The script is hand-written, the lines are irregular, and the scanned pages are not necessarily orthogonal to the images. My approach is to identify individual marks, place them in a network together with other similar marks, and differentiate them based on the local density of the area of the network in which they appear. Then, I think I can recognize constellations of marks as graphemes, and start training the program to do the transcription.

It is clear to me at this point that this is going to be more of a computer-assisted transcription project than a pure computer transcription, but even so the work should go much more quickly with the aid of a machine whose eyes never tire.

One of the challenges I have had to overcome is distinguishing between stray dots on the page and the dots that are intended to be part of a grapheme. Unless I am mistaken, it appears that the dots that accompany a grapheme always appear above or to the left of the main shape of the grapheme. I suspect this is related to the right-to-left direction of the text.

In categorizing the graphemes, I am running into a problem I have wrestled with for years, ever since I first started thinking about ways to automate the recognition of patterns. I call it the "cloud-within-the-network" problem, and I need to find out what the proper answer to it is.

The "cloud-within-the-network" problem works like this: Suppose you have some dense networks, and you loosely connect them to each other in a larger network. How do you computationally recognize the existence of the dense networks within the larger loose network?

It seems like it should be relatively simple, but every solution I think up seems to have a problem with it. In the case of this transcription project, I have a workaround, but some day I would like to find the right solution.

Monday, March 3, 2014

Rohonc Transcriber (stage 1)

In a recent post, I said I would write a program to transcribe the Rohonc Codex.

Tonight I did the first part. I wrote some code to identify lines of text and graphemes. The image below shows a page of text, with the first-pass graphemes marked by green rectangles.


This is just a first pass. Some of these rectangles enclose multiple graphemes, and they will need to be split apart.

Next, I think I'll build a database of all of the grapheme images, then compare them to each other to identify image families.

Sunday, March 2, 2014

One possible explanation for the reference to "John 22"

In a recent post, I noted a problem with an apparent scriptural reference to the book of John chapter 22. The book of John only has 21 chapters according to the modern system of division, and in the Byzantine manuscripts that I have been able to find, there are only 18 or 19 kephalaia.

However, in the section of the RC in question, the number 22 is written with the number 2 followed by the number 20, as though it were the Roman numeral IIXX.  I had thought this meant "two and twenty", but another explanation could be that this is a language where the number 18 is expressed as "two from twenty".

The only language I can think of at the moment where this is done is Latin: duodēvīgintī, but there may be others. Indeed, it turns out the Romans sometimes wrote 18 as IIXX, which lends a little weight to this hypothesis.

For this explanation to work, we would have to assume that the references are to a copy of the New Testament that is divided according to an earlier system like the Byzantine kephalaia.

If so, then this passage may offer us a good crib. I may look into that this evening.

Transcription of the Rohonc Codex

It's time for a free and open transcription of the Rohonc Codex.

Unfortunately, I don't have the time to do it by hand, and we do not yet have a universal crowd-sourcing platform where I can set it up. So I'll write a program to scan the images and do an initial pass at automating the transcription.

When it is done, I will release it for free.

Edit: I did get a long way on the transcription, but life intervened so it is not yet complete. The latest revision of the transcription is here: http://quint.us/Roho/.