Wednesday, August 20, 2014

Word breaks in the Rohonc Codex

I have developed some really cool tools that work on the word level of a text, and identify things like the level of semantic content in words (to differentiate function words from content words), semantic similarity, and so forth.

Unfortunately, I can't use those tools on the Rohonc Codex yet because the transcription is on the level of the glyph, not the word.

But the codex gives us a few tiny clues to word breaks, and I think I have figured out how to leverage those to split the text into words.

The main clue is the hyphen, which looks like a double line. The hyphen at the end of a line tells us where a word break probably is not, and the absence of a hyphen tells us where a word break probably is. This allows us to build a small table of likely word heads and tails, which might be used to identify word breaks (where a break may be between a likely tail and head).

As a proof-of-concept, I took an English text with 47,975 words, removed all of the spaces and punctuation, then divided the text into lines of approximately 80 characters, breaking lines at word boundaries. I then measured the frequency of every segment that started or ended a line of text, and wrote a scoring function that looks like this:

In this equation, b(x, y) is the "break score" of a point between strings x and y, the variable tx represents the number of times that string x is found at the end of a line, and hy is the number of times that string y is found at the head of the line.

If the break score crosses a certain threshold, then I insert a word break between strings x and y.

There are a few parameters that govern how you actually implement this. The first is the length of the strings x and y, and the second is the threshold to use. Experimenting with my English text, I found the optimal string length to be 3, and the optimal threshold to be 10. With these parameters, I was able to correctly identify 78% of word breaks in the English text, with only one incorrect word break for every 10 correct word breaks.

Before I apply this to the Rohonc Codex, I think I will put some more effort into cleaning up the transcription. In order for the word-breaking algorithm to work efficiently, I will need to identify lines that are damaged and possibly missing glyphs on one or the other end.

In the meantime, I am wondering what it would look like if we apply this to the Voynich manuscript. It would be interesting if the apparent word breaks in the VM don't strongly correlate with the word breaks identified by this approach.

Tuesday, August 12, 2014

The mystery of the triple scriptural reference

I've added a new page to the sidebar of this blog, where I list 26 images that occur with the episodic formula, together with the book and chapter given in the accompanying formula. Perhaps among these 26 passages we can get some good cribs.

In matching up images to gospels, XDC.D makes the most sense if we read it as Matthew, HF.HS reads best as John, and CO.IH.D works best as Luke.

I feel fairly good about those three readings now, but page 170 of the codex presents us with an interesting mystery. We have an image accompanied by the episodic formula and three scriptural references:

The three references are XDC.D=Matthew chapter 5, CO.IH.D=Luke chapter 8, and CO.I.XCAB chapter 9.

Matthew 5:15 and Luke 8:16 share in common the Parable of the Lamp under a Bushel. In Matthew, this ends with the exhortation, "Let your light so shine before men, that they may see your good works and give glory to your Father who is in heaven." I argue that the picture shows Jesus carrying a light, and the dots represent the illumination of the light.

But then what is CO.I.XCAB chapter 9? It wouldn't be Mark, because this parable is told in the 4th chapter of Mark, not the 9th. (In fact, Mark is strangely absent as a source in the episodic formulae).

Compounding the mystery is that the name of the third source begins with the same symbol as the name of Luke, and I can't find any other evangelists (canonical or otherwise) whose names start with L.

Thursday, August 7, 2014

VSO word order and the word "and"

From the text on page 23, it's pretty clear that the IX glyph represents the number 11. (There is a repeated formula starting with the number 4 and running through, and where 11 should occur, there is the glyph IX).

However, there are also numerous cases where the number IX doesn't make sense as 11. For example, in the possible crib for forty days and forty nights, we have IX before the number forty, which led me to suggest that IX might stand for a preposition.

As I was hunting for saints' names, I noticed the caption for the picture on page 89:

Caption from page 89: HK CUTB1C A CO D IX H HH

In this caption, I have already suggested CUTB1C = Christ, and A CO D is apparently a saint based on the common occurrence of RT A CO D. That leaves the sequence of IX H HH, of which H HH is a sequence that commonly occurs independent of IX.

To me, the placement of CUTB1C over the head of Christ (identified by the striped turban) and A CO D and H HH over the other two heads suggests that the latter are the names of the other two people in the picture.

In that case, reading IX as "and" could make some sense, both here and in "40 days and 40 nights".

That leaves HK, which could be a transitive verb for which Christ is the subject and the other two people are the object, indicating a VSO word order. The verb could be something like "appears to", giving a caption like Christ appears to X and Y.

We have HK also in the following caption on page 127:

Caption from page 127: HK QX XD CQ B1CU // RT CO C IX C ADD

The picture shows an angel making generally the same gesture as the Christ from page 89, but this time towards a figure on a bed. The caption begins with HK and ends with the name of a saint, CO C IX C ADD, who is presumably the figure on the bed.

Looking at the Latin and Greek texts of Matthew 2:19, where the angel of the Lord appears to Saint Joseph, suggests another possible reading for HK as "behold":

Latin: ecce apparuit angelus Domini in somnis Ioseph
Greek:  ιδου αγγελος κυριου κατ οναρ φαινεται τω ιωσηφ
English: Behold, the angel of the Lord appeared to Joseph in a dream

The name of the saint contains IX, but it is not clear whether this is morphemic or phonemic. Certainly RT CO C never appears without IX C ADD, and certainly only one individual is depicted on the bed, so presumably IX is part of the saint's name.

If we knew the name of the saint, we might get the name of the angel, and thereby get a crib. Delia Huegel notes of this image that, if we knew the text, we might get the name of the angel and the figure on the bed. Therein lies the trick.

Wednesday, August 6, 2014

Saints' Names

Doing a quick pass through the current (imperfect) transcription, I come up with the following rough list of 26 strings that follow the glyph RT (which I think may mean saint).

The evangelists are identified by the fact that they occur in the episodic formula. It is interesting that the most common possible saint name, XI D, is not apparently an evangelist (though it may not refer to a saint at all).

Two of the possible evangelists have names starting CO, which suggests Mark and Matthew. If this is phonetic, then I wonder if the Holy Noun is a nomen sacrum for Maria Mater [Dei].

On the other hand, a phonetic reading of CO=ma militates against the idea that CO, CX and C are variants of the same glyph, since the apparent name for Thomas begins with CX.

Possible Saint Frequency Notes
XI D 47
HF HS 29 John the Evangelist
CO IH D 24 Luke the Evangelist
XDC D 13 Matthew the Evangelist
A CO D 11
D 8
U 6
CX I CX [I] or CX I QX 5
CO I [I] WD 4 Author of a scripture with at least 15 chapters, perhaps as many as 35 chapters.
CX F O 4 Thomas the Apostle
N1CO 3
CUNSA I IX CUNSAR I I 3 possibly something like I Peter and II Peter
CUNSAR I I 2 possibly something like II Peter
V I V IX 2
Q C1A 2
A C IT 1
C C C D 1
C D K 1
C T D 1
CO I XCAB 1 Author of a scripture with at least nine chapters.

Tuesday, August 5, 2014

Numerical Oddities and Saint Thomas

The Rohonc numerals are more complicated than they seem at first.

The small numbers seem to be relatively clear, if a little strange. Page 23 gives us the numbers 4 through 11, and by hunting around throughout the text we can fill in the numbers going back to 1:

1: Q1C (top of tablet I, page 14)
2: I I
3: I I I
4: I I I I
5: I I I I I
6: CY
7: CY I or I I I I I I I
8: CY I I or I I I I I I I I
9: LT
10: T
11: IX

This system makes a certain kind of sense, but then what are we to make of numbers like the following?

163.L9: I I I I I I I T IX CY CY
21.L4: IX I I I I I I T
59.R10: I I T T I I I I I
59.R11: I I I T T T CY

The only sense I can make of these is that I I I I I I I T, for example, is meant to be read 70 (i.e. seven tens). In the case of 59.R10-11, I am inclined to think that the I I T T and I I I T T T are intended to write 20 and 30, with the T T and T T T being redundant, since I can't find any case of T T or T T T without a prefix of I I or I I I.

This reading would solve the mystery of John 22, where the picture apparently of Doubting Thomas on page 64 is accompanied by an apparent scriptural reference to chapter I I T T in the book of HF HS. Since the story of Doubting Thomas appears in chapter 20 of the book of John, reading I I T T as 20 saves us having to resort to Byzantine kephalaia to explain this reference.

This would mean HF HS refers to John. If Delia Huegel is right, and the image on page 64 depicts Doubting Thomas, then we might assume Thomas would be mentioned in the text accompanying the image. Assuming RT means saint, the most likely candidate for Saint Thomas is the name mentioned twice in this section: RT CX F O.

64.R8 *Saint Thomas

64.R10 *Saint Thomas

It is interesting that this name contains a glyph that looks like a rotated T, followed by a glyph that looks like an O, as though it contained the abbreviation TO for Thomas.

Note that the word I am tentatively reading as night(s) also contains the O glyph, and that the word for night contains a rounded vowel in a wide variety of Romance languages.