Thursday, December 11, 2014

The Three Tablets

Page 14 shows three tablets with writing on them, and two figures standing among them.


Delia Huegel takes this to be Moses, Aaron, and the tables of the law, but she notes the odd fact that there are three tables instead of two.

The text on the second and third tablets begins with the formulae I I CQ IGV and I I I CQ IGV, which I expect probably means something like "the second X" and "the third X". Since the first tablet has Q1C in the place of the number "one", I expect Q1C probably represents a word that is semantically equivalent to "first", but etymologically unrelated to the number "one" (e.g. English first, Latin primus, etc.)

Since the tables are numbered one, two and three, I don't think they represent the tables of the law, but rather a set of three things or ideas. The obvious candidate is the holy trinity.

The usual order of the three persons (or hypostases) of the trinity are established in Matthew 28:19 as the Father, the Son and the Holy Spirit, in which case the third tablet would represent the Holy Spirit.

The text of the third tablet contains the glyph RT, which I have previously read as "saint". In a number of languages, the word for "saint" is just a nominal form of the adjective "holy", as it is in Latin: Spiritus Sanctus. This suggests that the word either before or after RT on the third tablet might be "spirit", forming the phrase "holy spirit".

The glyphs of the first and third tablets could be nearly the same, though in a different order, up until the final phrase:

First: Q1C CQ IGV [?] K1A1A I I RAA O X2 O C I RAA C F O R CO [?]
Third: I I I CQ IGV O IGDA O O X2 O K1A1A I I RAA C I RAA RT CUNSAR I IX O O

The word C.F.O.R.CO is interesting because one of my earlier algorithms (about which I did not write) identified it as a likely alphabetic word. Is it possible to read C.F.O.R.CO as "father", and RT CUNSAR.I.IX.O.O as "holy spirit"? If so, how do I test that reading?

Tuesday, December 9, 2014

Heresy? Apocryphy?

I've been running a bunch of simulations to see how the Rohonc data match against alphabetic data from Old Hungarian, Old Albanian, Latin and Old Church Slavonic (in Glagolitic and Cyrillic). My plan was to take predictions based on this data and test them against the names of the evangelists, to see which prediction best explained those names.

There are a number of scenarios where D could be read as t, so my reading of XDC.D as Matthew could work (e.g. as ma-t, mat-t or something similar). However, nothing in my simulations seems to let me read CO.IH.D as Luke.

I agonized over this for a while, then went back to the page where I tried to relate scriptural references to images. I realized that the only strong evidence for reading CO.IH.D as Luke was the triple scriptural reference, and that reading was based on the assumption that the references were to canonical gospels. But if that assumption was wrong, then there was really very little reason to read CO.IH.D as Luke.

Indeed, an image accompanying a reference to CO.IH.D chapter 6 is problematic, since it features an angel (or perhaps a winged Christ) appearing to a man lying on the ground, and I could not match that to anything in the sixth chapter of Luke.

So what would it mean if CO.IH.D is not Luke?

Looking at the images corresponding to CO.IH.D, each one involves a figure with a striped turban and a beard, twice with wings, usually with one other person, though sometimes alone outside a city. Chronologically (according to chapter) the images can be arranged as follows:

Chapter 1

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 11

Chapter 17

Elsewhere (e.g. in the image of Jesus entering Jerusalem) the figure with the striped turban and pointed beard is Christ, so whatever CO.IH.D is, it would seem to contain a gospel-like narrative of the life or ministry of Christ.

There are any number of candidates among known apocrypha, but I suppose I should start with anything that mentions angels in the first and sixth chapters, or else mentions Christ appearing in the form of an angel.

Friday, November 28, 2014

Old Church Slavonic, Revisited

After a circuitous route of reading, I decided to revisit Old Church Slavonic. It started when I was looking at some of the contested inscriptions in the Basarabi Cave Complex, where I saw the following sequence:
With the bar over it, it looks like either an abbreviation or a nomen sacrum. It made me think of the Rohonc word I tentatively read as "Christ":


From there, I started reading about Glagolitic, and I realized that I should probably look at letter frequencies in Old Church Slavonic as it was written in both Glagolitic and and Cyrillic, so I reanalyzed OCS using the Codex Marianus (Glagolitic) and Codex Suprasliensis (Cyrillic).

Codex Marianus
InitialFinalAll
izhe9.1%14.8%8.3%
jest5.0%16.1%8.0%
big jer0.0%23.4%7.7%
on5.3%8.1%6.5%
az2.1%7.9%6.2%
tverdo3.4%0.0%5.9%

Codex Suprasliensis
InitialFinalAll
izhe11.3%18.1%8.7%
az2.4%10.6%7.4%
on4.1%9.1%7.3%
jest1.2%11.4%6.3%
big jer0.1%15.4%5.8%
tverdo4.1%1.0%6.8%

One of the interesting things about Glagolitic is that some of the forms of these letters resemble the forms of the most common Rohonc letters. But I can only steal 10 minutes away today, so I'll have to get back to that in another post.

Wednesday, November 12, 2014

Comparison with Old Church Slavonic and Koine Greek

In this post, I'll complete my initial attempt to compare the initial and final frequencies of the three most common Rohonc glyphs with the three most common letters in some candidate languages.

First, Old Church Slavonic:

InitialFinalAll
И6.8%12.5%7.8%
Є4.1%13.5%7.4%
Ъ0.0%19.91%7.2%

Old Church Slavonic differs from Rohoncian in that the most common letter is more frequent as a final than as an initial, and there is a wide disparity between the frequency of the second letter as initial and final.

Second, Koine Greek:

InitialFinalAll
α12.8%6.2%11.0%
ε15.4%6.9%10.1%
ο10.1%5.9%10.1%

Koine Greek differs from Rohoncian in that the third most common letter is more common as an initial than as a final (though if the iota ended up in third place, it would fit well, with initial and final frequencies of 2.7% and 8.6%, respectively).

Suppose we give each of these languages a location in six-dimensional space, indicated by the relative frequencies of the top three symbols as initials and finals...what would their distances from each other be in this space? And which would be closest to Rohoncian?

Interestingly, the two languages that are closest to each other by this measurement are Rohoncian and Latin, with a distance of 0.115. Next closest to Rohoncian is Old Hungarian, with a distance of 0.130. The languages that are most distant from each other are Koine Greek and Old Albanian, with a distance of 0.293.

Overall, I am weakly inclined to think that Rohoncian is some kind of Latin or Hungarian. Not only does this particular measurement favor these two languages, but there are graphical similarities between the Rohonc C, I and the Latin e, i.

(In case you are wondering, I also looked at Voynichese, just for the fun of it. It differs significantly from all of the other languages I have looked at, in that the second and third most common letters occur infrequently as initials or finals.)

Tuesday, November 11, 2014

Comparison with Old Hungarian

So far I've compared the relative frequencies of initials and finals for the three most common glyphs in Rohoncian with Latin and Old Albanian. Today I'll do Old Hungarian.

For Old Hungarian, I used the four gospels from the Hussite Bible, and I counted long and short vowels together. The top three letters break down as follows:

InitialFinalAll
e, é16.5%6.1%16.6%
a, á10.1%8.5%10.7%
t6.1%13.6%8.0%

Like Latin and Rohoncian, the most common letter in Old Hungarian is more frequent as an initial than as a final, while the third most common letter is more frequent as a final than an initial. Like Latin, these two letters are e and t, respectively.

You might wonder why, if there are around 1,000 glyphs in Rohonc, I am comparing it statistically to alphabets instead of syllabaries or ideographic systems. The reason is that the most frequent glyphs in Rohonc are roughly as frequent as letters ought to be. Among Latin syllables, for example, the most common syllable in the Vulgate version of Genesis is et, but it only accounts for 3.29% of syllables. The most frequent Rohonc glyph is C, and it accounts for 12.9% of glyphs, putting it in the same ballpark as the most frequent letters of alphabetic systems.


Monday, November 10, 2014

Comparison with Old Albanian

A couple of days ago I looked at the relative frequency of the three most common Rohonc symbols as initials and finals, and compared that to the relative frequency of the three most common Latin letters.

Today I'll do the same with Old Albanian. My sample text for Old Albanian is Gjon Buzuku's Meshari, the three most common letters of which are e, i and h:

InitialFinalAll
e16.4%23.0%19.9%
i3.2%6.7%8.7%
h2.3%24.8%8.3%

In some respects, Old Albanian fits better than Latin. Latin initial i is far more common than Rohonc I (9% > 5%), while Old Abanian shares with Rohonc I that both are far much more frequent as finals than initials. However, Albanian e occurs more frequently as a final than as an initial.

In order for this to work, the names of two of the evangelists would have to end in h. In fact, the names of two of the evangelists do end in h in the Meshari:


Maξeh: Matthew
March: Mark

Furthermore, Luke is also written with an hLucha.

Saturday, November 8, 2014

Relative frequency of initials and finals

In a previous post, I argued that we could use the presence or absence of hyphens at the end of a line to generate some basic statistics about word initials and word finals. At the time I was thinking of using this information to divide the text into words, but over the last few busy months I have been thinking about another use for this data.

In most (or all?) languages, the frequency of ranking initials differs somewhat from the ranking of finals and medials. For example, in Latin, the letter t occurs nearly four times more often at the end of a word than at the beginning, whereas u occurs about 3.5 times more frequently as an initial than a final.

Rohoncian is no different from known languages in that respect. For example, the glyph D occurs 8.5 times more frequently as a final than as an initial. If Rohoncian is a known real language, then the difference between frequency ranking in initial, medial and final positions could be used to help narrow it down.

For example, using the three most common glyphs in Rohoncian, we could construct a kind of litmus test. The relative frequencies of those glyphs are:

InitialFinalAll
C12.9%4.9%10.2%
I5.0%6.4%9.7%
D1.5%12.8%7.8%

If we wanted to test the theory that Rohoncian is Latin and those three glyphs are alphabetic, then we would match them up to the most common three Latin letters:

InitialFinalAll
e15.6%11.6%12.9%
i9.0%7.8%11.0%
t4.9%19.7%8.6%

In broad terms, this correspondence seems to work out well. C shares in common with e that both are ranked first in overall frequency and somewhat more frequent as initials. Similarly, D and t share the third position and are significantly more frequent as finals than initials.

The main problem with this, as far as earlier proposals go, is that two of the evangelists have names ending in D (i.e. CO IH D and XDC D). However, this is already a problem because it seems to work best to read those names as Luke (or Mark) and Matthew, and it is not clear what those names share in common that would lead them to be written with the same final.

On the positive side, I had previously proposed reading the word K O A D CX as "nights". If this is the word noctes, then the D falls in the right place, and CX could be read es. (The glyph CX looks like C, but with a dot).

Part of me wonders what we would get if we looked at initials, medials and finals in the Voynich manuscript. But that carcass has been picked over by smarter minds than mine, and yielded almost nothing.

Thursday, November 6, 2014

Quick (Tironian) Note

I've been very busy for a few months, and some of my projects have languished, including working on the Rohonc codex. (How do you prioritize a project that may not succeed?)

However, I happened to see a page from the Old Irish Book of Leinster that got me thinking about this again. I don't have much time (it's my lunch break) but I thought I could write a quick note about it.

The thing that caught my eye was the use of Tironian notes for their phonetic values. An example is the name "Conchobar", written (among other ways) as follows:


The first glyph in this name looks like a backwards C, but it is none other than the Tironian note for con:

Except, instead of representing the Latin morpheme con, it represents only the phonetic value. The same note is used in the name Conall. The RC does not look like a text that is written fully in Tironian notation, but it would be interesting to try to transcribe a sample of the Rohonc codex as though it were a subset of Tironian notation and see what it sounds like.

If you've ever wondered what a text written completely in Tironian notation looks like, here is a piece from the psalms, given at the end of a 9th century work titled Comentarii notarum tironianarum:


Saturday, September 20, 2014

The Magic Password Box

In a previous post I mentioned that an advantage of secret languages is that encryption happens within the human mind, beyond the reach of keyloggers, malware and packet sniffers.

The security of the human mind is also acknowledged in one of the most common exhortations regarding passwords: Don't write it down! Memorize it!

But memorizing passwords becomes increasingly difficult as we need more of them, and they must be more complex, and each must be different from the others. Many people have started to keep passwords in text documents, while more security-conscious people are starting to use applications like Password Safe. But what if someone gets access to your document, or there is an unrecognized vulnerability with your password storage application?

I'm in the same boat as everyone else. Once upon a time I used to generate passwords by looking around, concatenating two unrelated nouns representing things in my environment, and changing some of the letters to numbers and punctuation marks. Eventually I wrote an application to store my passwords in an encrypted text file, and that gave me the freedom to start generating passwords randomly.

Currently I probably only remember a tenth or fewer of my passwords. If my encrypted text file were lost, my passwords would be lost along with them.

Now I am experimenting with an idea that I call the Magic Password Box. The principle is relatively simple, but the affect on the security of my passwords is profound. Here is how it works:

  • I create one long password, over 100 alphabetic characters, in the form of a nonsense limerick
  • For each environment in which I need to use a password, I create a short mnemonic, like "gmail", "amazon" or "creditcard"
  • For passwords that do not require frequent updates, I compute the password as a function of SHA256(limerick + mnemonic)
  • For passwords that require updates every 90 days, I compute the password as a function of SHA256(limerick + mnemonic + quarter + year)
I'm already using a random number generator to create passwords, so replacing that with a hash isn't a huge change for me. The big change is this: I never need to store a password again, and all of my passwords can now rely on the security of my memory. If my laptop is struck by lightning, I can still get my passwords. (Perhaps I need a backup in case my brain fails to reproduce the limerick, though!)

There are some mundane considerations around how to write the code for the password calculator so that (for example) it won't leak my root password, and it can generate passwords that conform to different password policies. But there are also some interesting possibilities, such as having the calculator send the password to the system clipboard so I never even see it or type it, hiding it from prying eyes and keyloggers.

Thursday, September 18, 2014

Change of focus on the Rohonc project

The Rohonc transcription has reached a point where I think it is "good enough for now". I have identified about 93% of the glyphs (including 100% of glyphs in the first 50 pages), corrected some problems with the order of lines, and identified those lines that are damaged either at the head or the tail.

If a solution is possible, I doubt it would rely on the 7% of glyphs that I haven't identified yet. Now I'm going to change my focus to word breaks, since I think the data is good enough to apply the formula I mentioned in my last post for identifying word breaks.

In the mean time, I have started a few other projects, so hopefully I will soon be able to post about some other interesting stuff in addition to the Rohonc Codex.

Wednesday, August 20, 2014

Word breaks in the Rohonc Codex

I have developed some really cool tools that work on the word level of a text, and identify things like the level of semantic content in words (to differentiate function words from content words), semantic similarity, and so forth.

Unfortunately, I can't use those tools on the Rohonc Codex yet because the transcription is on the level of the glyph, not the word.

But the codex gives us a few tiny clues to word breaks, and I think I have figured out how to leverage those to split the text into words.

The main clue is the hyphen, which looks like a double line. The hyphen at the end of a line tells us where a word break probably is not, and the absence of a hyphen tells us where a word break probably is. This allows us to build a small table of likely word heads and tails, which might be used to identify word breaks (where a break may be between a likely tail and head).

As a proof-of-concept, I took an English text with 47,975 words, removed all of the spaces and punctuation, then divided the text into lines of approximately 80 characters, breaking lines at word boundaries. I then measured the frequency of every segment that started or ended a line of text, and wrote a scoring function that looks like this:


In this equation, b(x, y) is the "break score" of a point between strings x and y, the variable tx represents the number of times that string x is found at the end of a line, and hy is the number of times that string y is found at the head of the line.

If the break score crosses a certain threshold, then I insert a word break between strings x and y.

There are a few parameters that govern how you actually implement this. The first is the length of the strings x and y, and the second is the threshold to use. Experimenting with my English text, I found the optimal string length to be 3, and the optimal threshold to be 10. With these parameters, I was able to correctly identify 78% of word breaks in the English text, with only one incorrect word break for every 10 correct word breaks.

Before I apply this to the Rohonc Codex, I think I will put some more effort into cleaning up the transcription. In order for the word-breaking algorithm to work efficiently, I will need to identify lines that are damaged and possibly missing glyphs on one or the other end.

In the meantime, I am wondering what it would look like if we apply this to the Voynich manuscript. It would be interesting if the apparent word breaks in the VM don't strongly correlate with the word breaks identified by this approach.

Tuesday, August 12, 2014

The mystery of the triple scriptural reference

I've added a new page to the sidebar of this blog, where I list 26 images that occur with the episodic formula, together with the book and chapter given in the accompanying formula. Perhaps among these 26 passages we can get some good cribs.

In matching up images to gospels, XDC.D makes the most sense if we read it as Matthew, HF.HS reads best as John, and CO.IH.D works best as Luke.

I feel fairly good about those three readings now, but page 170 of the codex presents us with an interesting mystery. We have an image accompanied by the episodic formula and three scriptural references:


The three references are XDC.D=Matthew chapter 5, CO.IH.D=Luke chapter 8, and CO.I.XCAB chapter 9.

Matthew 5:15 and Luke 8:16 share in common the Parable of the Lamp under a Bushel. In Matthew, this ends with the exhortation, "Let your light so shine before men, that they may see your good works and give glory to your Father who is in heaven." I argue that the picture shows Jesus carrying a light, and the dots represent the illumination of the light.

But then what is CO.I.XCAB chapter 9? It wouldn't be Mark, because this parable is told in the 4th chapter of Mark, not the 9th. (In fact, Mark is strangely absent as a source in the episodic formulae).

Compounding the mystery is that the name of the third source begins with the same symbol as the name of Luke, and I can't find any other evangelists (canonical or otherwise) whose names start with L.

Thursday, August 7, 2014

VSO word order and the word "and"

From the text on page 23, it's pretty clear that the IX glyph represents the number 11. (There is a repeated formula starting with the number 4 and running through, and where 11 should occur, there is the glyph IX).

However, there are also numerous cases where the number IX doesn't make sense as 11. For example, in the possible crib for forty days and forty nights, we have IX before the number forty, which led me to suggest that IX might stand for a preposition.

As I was hunting for saints' names, I noticed the caption for the picture on page 89:

Caption from page 89: HK CUTB1C A CO D IX H HH

In this caption, I have already suggested CUTB1C = Christ, and A CO D is apparently a saint based on the common occurrence of RT A CO D. That leaves the sequence of IX H HH, of which H HH is a sequence that commonly occurs independent of IX.

To me, the placement of CUTB1C over the head of Christ (identified by the striped turban) and A CO D and H HH over the other two heads suggests that the latter are the names of the other two people in the picture.

In that case, reading IX as "and" could make some sense, both here and in "40 days and 40 nights".

That leaves HK, which could be a transitive verb for which Christ is the subject and the other two people are the object, indicating a VSO word order. The verb could be something like "appears to", giving a caption like Christ appears to X and Y.

We have HK also in the following caption on page 127:


Caption from page 127: HK QX XD CQ B1CU // RT CO C IX C ADD

The picture shows an angel making generally the same gesture as the Christ from page 89, but this time towards a figure on a bed. The caption begins with HK and ends with the name of a saint, CO C IX C ADD, who is presumably the figure on the bed.

Looking at the Latin and Greek texts of Matthew 2:19, where the angel of the Lord appears to Saint Joseph, suggests another possible reading for HK as "behold":

Latin: ecce apparuit angelus Domini in somnis Ioseph
Greek:  ιδου αγγελος κυριου κατ οναρ φαινεται τω ιωσηφ
English: Behold, the angel of the Lord appeared to Joseph in a dream

The name of the saint contains IX, but it is not clear whether this is morphemic or phonemic. Certainly RT CO C never appears without IX C ADD, and certainly only one individual is depicted on the bed, so presumably IX is part of the saint's name.

If we knew the name of the saint, we might get the name of the angel, and thereby get a crib. Delia Huegel notes of this image that, if we knew the text, we might get the name of the angel and the figure on the bed. Therein lies the trick.

Wednesday, August 6, 2014

Saints' Names

Doing a quick pass through the current (imperfect) transcription, I come up with the following rough list of 26 strings that follow the glyph RT (which I think may mean saint).

The evangelists are identified by the fact that they occur in the episodic formula. It is interesting that the most common possible saint name, XI D, is not apparently an evangelist (though it may not refer to a saint at all).

Two of the possible evangelists have names starting CO, which suggests Mark and Matthew. If this is phonetic, then I wonder if the Holy Noun is a nomen sacrum for Maria Mater [Dei].

On the other hand, a phonetic reading of CO=ma militates against the idea that CO, CX and C are variants of the same glyph, since the apparent name for Thomas begins with CX.

Possible Saint Frequency Notes
XI D 47
HF HS 29 John the Evangelist
CO IH D 24 Luke the Evangelist
XDC D 13 Matthew the Evangelist
A CO D 11
CO CE XC 11
XDGBA E XDGB E 10
D 8
CUN1XX 7
U 6
CO C IX [C] ADD 5
CX I CX [I] or CX I QX 5
CO I [I] WD 4 Author of a scripture with at least 15 chapters, perhaps as many as 35 chapters.
CX F O 4 Thomas the Apostle
XWX 3
N1CO 3
KBA C1Q D 3
CUNSA I IX CUNSAR I I 3 possibly something like I Peter and II Peter
CUNSAR I I 2 possibly something like II Peter
V I V IX 2
Q C1A 2
A C IT 1
C C C D 1
C D K 1
C T D 1
E XMA 1
CO I XCAB 1 Author of a scripture with at least nine chapters.

Tuesday, August 5, 2014

Numerical Oddities and Saint Thomas

The Rohonc numerals are more complicated than they seem at first.

The small numbers seem to be relatively clear, if a little strange. Page 23 gives us the numbers 4 through 11, and by hunting around throughout the text we can fill in the numbers going back to 1:

1: Q1C (top of tablet I, page 14)
2: I I
3: I I I
4: I I I I
5: I I I I I
6: CY
7: CY I or I I I I I I I
8: CY I I or I I I I I I I I
9: LT
10: T
11: IX

This system makes a certain kind of sense, but then what are we to make of numbers like the following?

163.L9: I I I I I I I T IX CY CY
21.L4: IX I I I I I I T
59.R10: I I T T I I I I I
59.R11: I I I T T T CY

The only sense I can make of these is that I I I I I I I T, for example, is meant to be read 70 (i.e. seven tens). In the case of 59.R10-11, I am inclined to think that the I I T T and I I I T T T are intended to write 20 and 30, with the T T and T T T being redundant, since I can't find any case of T T or T T T without a prefix of I I or I I I.

This reading would solve the mystery of John 22, where the picture apparently of Doubting Thomas on page 64 is accompanied by an apparent scriptural reference to chapter I I T T in the book of HF HS. Since the story of Doubting Thomas appears in chapter 20 of the book of John, reading I I T T as 20 saves us having to resort to Byzantine kephalaia to explain this reference.

This would mean HF HS refers to John. If Delia Huegel is right, and the image on page 64 depicts Doubting Thomas, then we might assume Thomas would be mentioned in the text accompanying the image. Assuming RT means saint, the most likely candidate for Saint Thomas is the name mentioned twice in this section: RT CX F O.

64.R8 *Saint Thomas

64.R10 *Saint Thomas

It is interesting that this name contains a glyph that looks like a rotated T, followed by a glyph that looks like an O, as though it contained the abbreviation TO for Thomas.

Note that the word I am tentatively reading as night(s) also contains the O glyph, and that the word for night contains a rounded vowel in a wide variety of Romance languages.

Tuesday, July 29, 2014

A Possible Crib: "Forty days and forty nights"

There are two similar passages in the Rohonc codex where the number 40 is written twice in succession, as in the following case on page 5 line R11:

B1CU CURE B1CU CO D CO D IX T T T T D T T T T K1OA A D CX

And also the following, starting on page 120 L9 and ending on 121 R1:


B1CU IX XB B1CU C D CO D IX T T T T D
IX T T T T X2 K O A D CX CURJX XB B1CU C XB B1CU

The two passages are nearly parallel, but not entirely so. In the first one, the glyphs K and O were written so closely together that my transcription code took them to be a ligature, K1OA. In the second, they are clearly written separately. The second version also repeats IX before the second instance of the number 40, and appears to have a double dot after the last T of the sequence representing 40.

I googled the phrase "quadraginta * quadraginta", to get a rough idea of cases in Latin texts where the number 40 is repeated twice in close succession. (I used Latin in order to select texts in the right semantic domain and era, not because I have decided the language of this text is Latin.) As I had guessed, the most common phrase was quadraginta diebus et quadraginta noctibus, "for forty days and forty nights". This is the period of time for which it rained in Noah's flood, and the period of time for which Jesus fasted in the desert.

If this is a crib, then I suggest the following:

IX: a preposition like "for". Its second appearance on page 121 is completely natural, making only the difference between "for forty days and forty nights" and "for forty days and for forty nights".

T T T T: the number "forty"

D: "days" (or an abbreviation therefor)

K O A: "night" (perhaps plural, perhaps inflected)

It is possible that the word "nights" should include all of the glyphs K O A D CX, but I propose minimally K O A on the basis of page 13 L2:


XU CC D C XVA CV QO ? ? ? XDAS N IX I XVOA N

The end of this line contains the sequence IX I XVOA, where XVOA (a relatively common glyph) looks very much like a ligature of K O A. If so, then this sequence could read "for one night".

Some languages use the singular noun after a number greater than one, while others use the plural. It is possible that D CX contains a plural marker, but an analysis of numbers throughout the text should be done before we say that.

The XVOA glyph appears 65 times in my current transcription, overwhelmingly in the sequence C XVOA C. It may be fruitful to hunt for languages where a relatively common word contains within it the sounds of the word for "night".

Tachygraphic systems

I've been reading articles on Tironian notation and what is known about Greek tachygraphy. The most careful discussion of the topic I've come across yet is the article by F. W. G. Float, "On Old Greek Tachygraphy", in the 1901 Journal of Hellenic Studies.

Float points out that different shorthand systems are designed to accomplish different goals: Tachygraphic systems are designed for quick writing; Stenographic systems are meant to preserve space. Either type of system may emphasize clarity or secrecy to some degree.

Tironian notation is simply amazing. Tironian notae encode a modicum of phonetic information--as much as is needed to distinguish a less common word from a more common one--but I would guess that the average nota encodes less phonetic information than the average Chinese character, arguably making the notae tironianae more ideographic than Chinese.

Many of the classical tachygraphic systems seem to encode syllables. In my last post I had said that there were too many symbols in the Rohonc script for it to be a syllabary, but I was thinking of syllabaries that are based on (C)V syllables. For comparison with Latin, where syllables are more complex, I took the book of Genesis from the Vulgata, divided all of the words syllabically, and counted the unique syllables. There were 1139--roughly the same as the number of unique Rohonc symbols. It is not impossible that the Rohonc script could be (partly) syllabic.

Thursday, July 24, 2014

Comparison with Tironian notes

The Rohonc script appears to have on the order of one thousand glyphs. This is far more than one would expect in an alphabet or syllabary, but the most common glyphs are too common to represent morphemes.

It seems there is a relatively small number of core glyphs, some apparent ideograms, and several strategies to extend basic glyphs into more complex ones. These strategies include the addition of new lines and dots (C -> CE); the rotation or reversal of existing glyphs (C -> Q, XD -> XDA); and the use of ligatures (B + CU -> B1CU).

Similar strategies are used in some abugida systems like Ge'ez and Kharosthi, but the historical and geographic context of the Rohonc codex excludes any connection to these systems and their relatives.

I am not aware of any abugida system used in Europe around the 16th century. However, there was a system of scribal shorthand in use up until the 16th century called Tironian notes. Numerous extensions of this system were apparently developed with 1100, 4000, 5000 and 14000 notes.

Like the Rohonc script, Tironian notation has a relatively small number of core marks, extended to more complex marks using similar strategies to the Rohonc script. Indeed, an astonishing number of Tironian notae are similar to or even identical to Rohonc glyphs.

That is not to say that the two systems are the same. The most obvious difference is that the Rohonc script is written right-to-left. In addition, some I can't find Tironian equivalents for some of the most common Rohonc glyphs, and vice-versa.

However, among the writing systems that might have influenced the Rohonc script, Tironian notation has many features that make it a good candidate for further investigation.

Parallel Passages

There are a number of parallel passages in the Rohonc Codex. I came across one last night, and decided to compare the two versions of the passage side-by-side. I was able to identify several glyphs that my current transcription treats as different which should apparently be the same.

I also found an instance of alternate spelling, which may eventually provide some insight into the phonology of some of these glyphs. Compare the following from page 1 R4:

D CX D IX W CO D C3Q L C I CX D CX D C1FR CURE KB

and the following overlapping passage from page 124 R6:

C3Q L I1G CX D CX D C1FR XB KBAD O I CX CUNW

It appears that the word written L C I in the first line is written L I1G in the second line. It is possible that the two words are synonyms, or that the two glyph sequences are similar in sound. Perhaps I1G represents a palatalized form of C in the presence of I (or something like that).

Tuesday, July 22, 2014

Zipf's Law in the Rohonc Codex

I've added a column showing frequencies to the catalog of glyphs that accompanies my in-process transcription of the Rohonc codex.

Of course, the first thing one wants to do with glyph frequencies is to see if Rohoncian obeys Zipf's Law. At first blush, it would seem not, because we have the following distribution for the top ten glyphs:


Glyph Frequency Frequency * Rank
C49524952
I49029804
D381611448
CO298311932
N258812940
O257215432
H165711599
IX153812304
CX140212618
CX1Q8998990

However, this distribution supports something I have suspected already: CO, C and CX are probably the same glyph. I separated them in my transcription because I decided it would be easier to merge glyphs later. But, I suspected that they might be the same because they are apparently interchangeable in the Holy Noun.

If CO, C and CX are merged, then the distribution appears as follows:

Glyph Frequency Frequency * Rank
C, CO, CX93379337
I49029804
D381611448
N258810352
O257212860
H16579942
IX153810766
CX1Q8998990

It's still not perfect, but it is much closer to a normal Zipf distribution.

Monday, July 21, 2014

Rohonc Transcription online (as-is)

My transcription is still only 90% complete. I estimate about 20 more hours of work would be required to finish the remaining 10%, but it is hard for me to find that kind of time.

So rather than let the perfect be the enemy of the good, I have put the transcription online as-is, with the hope that I can continue to refine and update it as I go along.

I have thrown together a website: http://quint.us/Roho. This site is ugly as sin, because I wrote it with an emphasis of function over form. There may well be bugs, but hopefully they are rare. The site provides four basic pieces of functionality:

Download: Download my current revision of the transcription.

Search: Search the transcription.

Browse: Browse the page images and transcription.

Glyphs: View the catalog of glyphs.

As I refine the transcription system and the transcription, I will also work on the site to make it less ugly and more functional. Every time I update the transcription, I will increment the minor revision number. Every time I update the transcription system, I will increment the major revision number.

My system of transcription represents glyphs as strings of capital letters and numbers. The primary goal of the transcription system was to uniquely identify apparently unique glyphs. The secondary goal was to make transcriptions for similar glyphs similar in form. (For example, glyphs whose transcriptions begin with C have shapes that begin with the same semicircular stroke).

When I am less tired, I'll write up a better description of the transcription system.

Happy hunting. I welcome any kind of feedback on the site. Please leave a comment or email me at rst140720@quint.us if you have a suggestion.

Monday, May 12, 2014

Rohonc Transcription 90% done

The last few weeks have been busy for me, both at work and at home. But in my spare time I have managed to push my computer-assisted transcription of the Rohonc codex to 90% completion.

There will be a lot of manual work ahead for the remaining 10%. There are a lot of oddball graphemes, smudges, damaged lines and so forth to dig through. When I am done, I will go back and revise my transcription system, because I have noticed some opportunities for improvement.

Luckily everything is in a huge...thing...that is like a database, but built specifically for this task. When I need to change the transcription, I can do so with a minimum of fuss.

Saturday, May 3, 2014

Meanwhile, in another universe...

I'm still working on the Rohonc transcription, but I thought I would post something amusing and light-hearted for a change.

I've always been bothered by the fact that time seems to only go in one direction, and to be orthogonal to the three spatial dimensions. Somehow it seems...arbitrary.

So imagine a universe where it worked differently: Imagine a four-dimensional universe where time moves outward from a central point, which I'll call the Origin. So, instead of the spatial universe being a three-dimensional slice moving through a four-dimensional space-time, instead it is more like the surface of an expanding hypersphere.

How would light move in a universe like this? If we require that the speed of light be constant in this universe, then the path of light must always be at a constant angle of deflection from a line radiating from the Origin. If the speed of light and the passage of time are constant, then light spirals away from the Origin, always bending at a constant angle of deflection.

What would it be like inside this universe? First, on a small scale, time would appear to be linear, the same way that the Earth appears to be flat, and gravity appears to go in only one direction. Second, on a larger scale, the universe would constantly be expanding.

Now, suppose light is deflected as a result of some influence exerted by the Origin, and that influence decreases the further from the Origin we get. (Maybe inversely proportional to the cube of the distance from the Origin). Since we require that the speed of light be constant, the actual rate of passage of time relative to distance from the origin decreases the farther out we get. Since the size of the universe is proportional to the cube of the distance from the Origin, but time passes increasingly slowly, we would perceive this as an accelerated expansion of the universe.

Chores call, so that's the end of this post.

Monday, April 21, 2014

Rohoncian looks like a case-marking language with prepositions

The computer-assisted transcription of the Rohonc codex is about 61% done. So far, I'm resisting the urge to do any kind of statistical analysis because the results will almost certainly be skewed by the fact that my glyph recognition algorithm has a harder time with some glyphs than others.

However, I came across something today that I thought was interesting. Throughout the codex, there is a sequence of four glyphs that is usually written with a halo over it, as follows:

Co D Co D

Starting from this post, I'll use my provisional transcription system, so I don't have to talk about "the glyph that looks like a triangle", and so forth.  So this sequence is transcribed as Co D Co D.

I think the odds are good that this is a noun, or a proper noun, but of course it is not clear what it is. It could be an epithet of God, the Holy Spirit, or even an abstract noun like Grace, but for the sake of convenience I'll call it "the holy noun". At first blush, it looks like a reduplicated stem, but there are a few cases where it appears to be inflected.

In an earlier post, I suggested that the simple line bending to the left was a preposition line "on". I now transcribe this glyph as L (for "left"), and in the following sequence you can see two cases where the holy noun is preceded by L, and at the same time the ending of the sequence changes:

L C D C I Ix Hk C D C I

It may or may not be important, but in this case the glyph Co is replaced by C. But since the only difference between the two is a small loop at the top, it is possible that they are allographs of the same grapheme.

More important is the suffix. It looks like the (tentatively) nominative suffix D becomes I when the noun is prefixed by the preposition L.

Compare also the following:

O Co D C D C

It seems fairly clear that this is the same holy noun (marked with a halo, as always), but it has an added C suffix after the D. This may or may not be related to the O that precedes the whole word.

If these sequences really do show prepositions and case-marking, then it narrows the field of candidate languages quite a bit. We would be looking for a language with prepositions and at least three cases. Here is the rough paradigm of a noun in -D:

Possible Nominative: stem + D
Oblique A: stem + I
Oblique B: stem + D C

Thursday, April 10, 2014

Dividing work between man and machine

This project to write a text recognizer for the Rohonc codex has been really rewarding. The quality of the text is so poor that the solutions have to be really clever, and that is what makes it so fun. I've learned a massive amount about image manipulation and text recognition, and I have a whole slew of projects I want to undertake when I'm done with this one.

I've probably spent four hours training my glyph recognition algorithm, and I think it probably identifies glyphs correctly about 80-90% of the time. Right now, I can process a single line of text in under a second, but it takes me about 15-30 seconds to manually verify the transcription and fix any errors that crop up. That seems pretty fast, but when you multiply it out by 4285 lines, it comes to about 25 hours of manual work. I need to pare that down, because it'll take me forever to scrape together 25 hours of free time.

A lot of this project has involved dividing labor between me and the machine, making the most of what the machine can do without my intervention, and making the best use of my feedback on good and bad matches. The code has been very fluid but very stable, basically organized around building a powerful set of core functionality, but using the simplest and most ergonomic user interface for each task.

Friday, March 21, 2014

Starting the final phase of Rohonc transcription (I think)

I think I've finally settled on a workable process for machine-assisted transcription of the Rohonc Codex.

I tried several approaches before landing on the current one. One approach was to analyze each page in a top-down way, first identifying the areas that contained text, then splitting those into lines, and splitting the lines into glyphs. The other approach was bottom-up: First, identify glyphs, then identify lines.

No single automated process was able to correctly split the pages up 100% of the time, so I have adopted a top-down automated approach with manual overrides. I have now broken all of the pages down to the line level, and I have some code that picks out glyphs from a line with great accuracy.

Now comes the fun part: writing (and training) the glyph-recognition algorithm.

I've decided to use the "mark" as my fundamental unit of text. A mark is a single, contiguous, dark shape on the page within the bounds of a text line. Many Rohonc glyphs consist of a single mark, but many consist of core mark with one or more satellite marks. Most satellite marks are single dots above or to the left of the core mark, but some marks are dashes, and some are haloes that surround a core mark.

My glyph-recognition algorithm will start out by finding the best match between the glyphs on a new line and any that have been previously identified. This will be followed by a manual intervention step where I can correct any incorrect automated matches, or reject a mark as being non-text. When the line has been completely treated, constellations of marks will be matched to known glyphs.

I have wrestled with several different approaches to matching marks. One approach is to simply overlay one mark upon another and determine the total number of dark points that are the different, and calculate the ratio between that number and the full number of points.

Another approach that I am toying with is to use a two-dimensional version of Levenshtein distance. One way to do that would be to treat each row and column of the mark bitmaps as a string, calculate the individual Levenshtein distances, and sum them up to come up with a total distance.

But some calibration would be needed to make an apples-to-apples comparison between different match scores.

Sunday, March 9, 2014

Interesting features of Rohonc script and character recognition

I've been working on code that can scan the images of the Rohonc Codex and help me transcribe it. Hopefully I will be able to complete the transcription relatively quickly with the assistance of some code that can recognize and categorize graphemes (and remember what wacky name I decided to give each character).

In the process, I have unearthed a wealth of interesting detail and challenges.

Regarding the grapheme recognition process, the challenges are many. The script is hand-written, the lines are irregular, and the scanned pages are not necessarily orthogonal to the images. My approach is to identify individual marks, place them in a network together with other similar marks, and differentiate them based on the local density of the area of the network in which they appear. Then, I think I can recognize constellations of marks as graphemes, and start training the program to do the transcription.

It is clear to me at this point that this is going to be more of a computer-assisted transcription project than a pure computer transcription, but even so the work should go much more quickly with the aid of a machine whose eyes never tire.

One of the challenges I have had to overcome is distinguishing between stray dots on the page and the dots that are intended to be part of a grapheme. Unless I am mistaken, it appears that the dots that accompany a grapheme always appear above or to the left of the main shape of the grapheme. I suspect this is related to the right-to-left direction of the text.

In categorizing the graphemes, I am running into a problem I have wrestled with for years, ever since I first started thinking about ways to automate the recognition of patterns. I call it the "cloud-within-the-network" problem, and I need to find out what the proper answer to it is.

The "cloud-within-the-network" problem works like this: Suppose you have some dense networks, and you loosely connect them to each other in a larger network. How do you computationally recognize the existence of the dense networks within the larger loose network?

It seems like it should be relatively simple, but every solution I think up seems to have a problem with it. In the case of this transcription project, I have a workaround, but some day I would like to find the right solution.

Monday, March 3, 2014

Rohonc Transcriber (stage 1)

In a recent post, I said I would write a program to transcribe the Rohonc Codex.

Tonight I did the first part. I wrote some code to identify lines of text and graphemes. The image below shows a page of text, with the first-pass graphemes marked by green rectangles.


This is just a first pass. Some of these rectangles enclose multiple graphemes, and they will need to be split apart.

Next, I think I'll build a database of all of the grapheme images, then compare them to each other to identify image families.

Sunday, March 2, 2014

One possible explanation for the reference to "John 22"

In a recent post, I noted a problem with an apparent scriptural reference to the book of John chapter 22. The book of John only has 21 chapters according to the modern system of division, and in the Byzantine manuscripts that I have been able to find, there are only 18 or 19 kephalaia.

However, in the section of the RC in question, the number 22 is written with the number 2 followed by the number 20, as though it were the Roman numeral IIXX.  I had thought this meant "two and twenty", but another explanation could be that this is a language where the number 18 is expressed as "two from twenty".

The only language I can think of at the moment where this is done is Latin: duodēvīgintī, but there may be others. Indeed, it turns out the Romans sometimes wrote 18 as IIXX, which lends a little weight to this hypothesis.

For this explanation to work, we would have to assume that the references are to a copy of the New Testament that is divided according to an earlier system like the Byzantine kephalaia.

If so, then this passage may offer us a good crib. I may look into that this evening.

Transcription of the Rohonc Codex

It's time for a free and open transcription of the Rohonc Codex.

Unfortunately, I don't have the time to do it by hand, and we do not yet have a universal crowd-sourcing platform where I can set it up. So I'll write a program to scan the images and do an initial pass at automating the transcription.

When it is done, I will release it for free.

Edit: I did get a long way on the transcription, but life intervened so it is not yet complete. The latest revision of the transcription is here: http://quint.us/Roho/.

Friday, February 28, 2014

Chapter divisions, and questions about a solution

Last night, I went to bed wondering about a problem in one of the images in my last post. We have (apparently) a picture of Doubting Thomas, but the accompanying scriptural reference is (tentatively) John 22:


The problem is that the story of Doubting Thomas occurs in John 20, not 22. In fact, John only has 21 chapters.

I started to hunt about for some map that would show the relationship between modern chapters, Byzantine kephalaia, and whatever other system of capitulation might be out there, but no luck. As far as I can tell from looking at manuscripts, the Byzantine book of John was divided into 19 kephalaia.

One thing led to another, and the other thing led me the Wikipedia article on the Rohoncz codex, where I saw that many of the things I've been writing about in this series of posts have apparently been discussed by several Hungarian researchers since 2010, notably Tokai, Király and Láng.

This is great progress.  When I first looked at the RC, the predominant theories regarding it were highly implausible. I created a Yahoo group in June 2005 and posted some ideas about numbers and episodes, but became frustrated with the fact that I couldn't reliably embed images in messages. I posted sporadically after that time, but eventually put it aside, to return to it only this year.

If this has already been solved (as the Wikipedia article suggests), then I think I need to go find another mystery to work on.

Luckily I have a stack of them.

However, the Wikipedia article suggests a couple of things about the solution with which I disagree, so I will consider the RC to be partially solved until I see the full solution.

Thursday, February 27, 2014

Scriptural references in the Rohonc Codex

In my last post, I briefly mentioned a formula used to introduce chapters or episodes. In this post, I will explore the idea that part of that formula contains a scriptural reference to the source of the episode.

The following image shows the basic layout of the episodic formula:


the episodic formula

The text in red is boilerplate, generally found in most instances of the episodic formula. The text in blue represents a small set (three?) of possible non-numeric values, and the text in green is a number.

I propose that these formulae contain a scriptural reference, with the text in blue being the name of a book (usually one of the gospels) and the numbers in green being a specific chapter of the book.

Delia Huegel identifies the following as a depiction of Doubting Thomas:


The episodic formula that accompanies this picture contains the number 22. Interestingly, it looks like it is meant to be read "two and twenty", since the lower-order 2 comes before the higher-order 20. However, there are a number of strange things that happen with the numbers in these formulae, so they will definitely bear further examination.

The three main (or perhaps only) "books" mentioned in the episodic formulae are these:




Note that each of these begins with a crossed character, like the character for "nine" in reverse. Following my theory that the crossed line indicates a ligature with t, I suggest that this character represents some cognate of the word saint, which is common (I think) to all of the candidate languages.

If these are the names of three of the gospels, one possibility would be that the last two are Luke and Mark (in some order) because they both end with the same triangle character, and the first is John, because it does not share an initial with Mark (and so therefore is not Matthew).

If so, then I might need to scrap the theory I put forward in my last post suggesting that the triangle and circle represented the word for "day".

More bits from the Rohonc Codex

In this post, I'll propose a few more scattered readings for Rohonc words.

The following depiction of the crucifixion is accompanied by text in which there are two characters that look like the cross itself. I have highlighted one of them, which is accompanied by a prefixed character that curves over it:


In this context, it might make sense to say that the cross-shaped character is an ideogram for the cross itself, and the curved character is the preposition "on".

Note that the preposition "on" looks like an uncrossed version of the number nine. One possible way to interpret this is to say that the preposition here is Albanian , "on", and by crossing the line it is changed to nëntë, "nine". For this gloss, and others in this post, I'll indicate how the phrase would read in Modern Albanian, Romanian, Hungarian and Croatian, for comparison.

A: në kryq
R: pe cruce
H: a kereszten
C: na križu
on the cross

We also find the "on" character in the text accompanying a depiction of the resurrection. (Credit for identifying this scene goes to Delia Huegel, who also identified many other images in the codex).


In this case, the "on" is followed by the number "three". Since we know that Jesus rose on the third day, it would make sense that this could mean "on the third day" or "in three days":

A: në tre ditë
R: în tre zile
H: három nap alatt
C: u tri dana
in three days

A: në ditën e tretë
R: în a treia zi
H: a harmadik napon
C: trećeg dana
on the third day

If we can read the triangle and circle as "day", then that may work with the formula shown in the following image. This formula occurs at the beginning of many episodes in the text:


In this case, we could read the opening formula as beginning "one day...", an expression that is not uncommon at the beginning of a story.

This last image holds a wealth of possibility. It shows a meeting between Jesus and someone whose name or title is given above his head. This name or title shares two characters with the opening formula, including the first character for the word "day". If the underlying language is Albanian or South Slavic, we would expect the person to have a d near the end of his name. If it is Romanian, we would expect a z, and if Hungarian an n.