Monday, April 21, 2014

Rohoncian looks like a case-marking language with prepositions

The computer-assisted transcription of the Rohonc codex is about 61% done. So far, I'm resisting the urge to do any kind of statistical analysis because the results will almost certainly be skewed by the fact that my glyph recognition algorithm has a harder time with some glyphs than others.

However, I came across something today that I thought was interesting. Throughout the codex, there is a sequence of four glyphs that is usually written with a halo over it, as follows:

Co D Co D

Starting from this post, I'll use my provisional transcription system, so I don't have to talk about "the glyph that looks like a triangle", and so forth.  So this sequence is transcribed as Co D Co D.

I think the odds are good that this is a noun, or a proper noun, but of course it is not clear what it is. It could be an epithet of God, the Holy Spirit, or even an abstract noun like Grace, but for the sake of convenience I'll call it "the holy noun". At first blush, it looks like a reduplicated stem, but there are a few cases where it appears to be inflected.

In an earlier post, I suggested that the simple line bending to the left was a preposition line "on". I now transcribe this glyph as L (for "left"), and in the following sequence you can see two cases where the holy noun is preceded by L, and at the same time the ending of the sequence changes:

L C D C I Ix Hk C D C I

It may or may not be important, but in this case the glyph Co is replaced by C. But since the only difference between the two is a small loop at the top, it is possible that they are allographs of the same grapheme.

More important is the suffix. It looks like the (tentatively) nominative suffix D becomes I when the noun is prefixed by the preposition L.

Compare also the following:

O Co D C D C

It seems fairly clear that this is the same holy noun (marked with a halo, as always), but it has an added C suffix after the D. This may or may not be related to the O that precedes the whole word.

If these sequences really do show prepositions and case-marking, then it narrows the field of candidate languages quite a bit. We would be looking for a language with prepositions and at least three cases. Here is the rough paradigm of a noun in -D:

Possible Nominative: stem + D
Oblique A: stem + I
Oblique B: stem + D C

Thursday, April 10, 2014

Dividing work between man and machine

This project to write a text recognizer for the Rohonc codex has been really rewarding. The quality of the text is so poor that the solutions have to be really clever, and that is what makes it so fun. I've learned a massive amount about image manipulation and text recognition, and I have a whole slew of projects I want to undertake when I'm done with this one.

I've probably spent four hours training my glyph recognition algorithm, and I think it probably identifies glyphs correctly about 80-90% of the time. Right now, I can process a single line of text in under a second, but it takes me about 15-30 seconds to manually verify the transcription and fix any errors that crop up. That seems pretty fast, but when you multiply it out by 4285 lines, it comes to about 25 hours of manual work. I need to pare that down, because it'll take me forever to scrape together 25 hours of free time.

A lot of this project has involved dividing labor between me and the machine, making the most of what the machine can do without my intervention, and making the best use of my feedback on good and bad matches. The code has been very fluid but very stable, basically organized around building a powerful set of core functionality, but using the simplest and most ergonomic user interface for each task.