Thursday, December 26, 2013

Discrete logarithm as a polynomial

I finally had time to retool my "solver" so it could work on polynomials with multiple indeterminates. Much debugging was required, but in the process I discovered a number of ways to make it run more quickly.

Once I had it running correctly, I wanted to start exploring the modular logarithm function. I immediately ran into trouble because the modular exponent is not a permutation, so for some values there may be multiple valid logarithms, and for other values there may be none. I'll have to figure out how to make that work properly.

In the meantime I managed to put in place some manual workarounds. The solutions have a strange kind of beauty, but I'm not sure how to present them, and I think I should find the proper way to handle the shortcomings I noted above.

How long would it take to calculate the logarithm (mod N) as a polynomial?  Looking at sample solutions for N=3 through N=11, I think the calculation could be done in roughly 2N3 operations (multiplications and additions--provided unlimited memory to cache precalculated values).  Given that the brute force approach would only require a maximum of 2N multiplications and comparisons, I think it's fair to say that the discrete logarithm probably cannot be calculated in polynomial time with any kind of polynomial expression.

Monday, December 23, 2013

Multiplicative inverse

I was thinking about the conclusions I came to in my last post, and it occurred to me that I could build a little problem solver. You could feed part of a truth table into it, and it would generate the simplest polynomial that would explain the truth table.

I actually managed to write the thing, and it works. I have never written anything like this before, so it was an interesting experiment.

Of course, to write it, I had to write a function to calculate the multiplicative inverse of a number in [1..N-1] (mod N), so naturally the first thing I wanted to do with it when I was done is see what polynomial generates the multiplicative inverse.

My solver only works for prime values of N, and the result was interesting:

x-1 = x (mod 3)

x-1 = x3 (mod 5)
x-1 = x5 (mod 7)
x-1 = x9 (mod 11)

The pattern here, of course, is that x-1 = xN-2 (mod N). The Wikipedia article on modular multiplicative inverse mentions that this precise formula works, but only for prime values of N.

My current solver only works for polynomials of a single indeterminate, but I ought to be able to write one that works for multiple indeterminates. If I get the time to do it, the first thing I'm going to look at is the inverses of the exponent function.

Sunday, December 22, 2013

Solution to the polynomial question

In answer to the question that was bugging me in my last post, It turns out that every function operating on values [0..N-1] can be represented as a polynomial.  The proof is that you can generate the polynomial from the function's truth-table. (I'm not sure if that's a proof that a real mathematician would accept, but it works for me.)

The basic building block for this is the following expression, which returns a value of f(a) for x = a, and a value of 0 for all other values of x:


To build the polynomial for a function that takes only one argument, you sum up the polynomials for all possible values of x. So, you can say:


From there, it is just a hop, skip and a jump to functions with multiple arguments. For f(x,y), you can do this hideous thing:


Not necessarily practical, but it does prove the point that every possible function (mod N) can be represented as a polynomial.

(Note added 12/23/2013--I just realized this only works if N is prime, otherwise we don't have a modular multiplicative inverse for all (a-i), (i-j), (h-j). That suggests that for non-prime values of N there might be some functions that couldn't be expressed as polynomials, which is interesting.)

Saturday, December 21, 2013

Polynomial Puzzle

Think about a set of integer values [0..N-1].  The total number of functions that can take two values from this set and return a third value from the same set is NN(i.e. the number of distinct truth tables that can be built for those values).  This is the same as the number of polynomials of the following form (where kfij is also drawn from [0..N-1]):



So the question bugging me is this:  Can every function that takes a pair of values in [0..N-1] and returns a value in [0..N-1] be represented by one of these polynomials?

I think it can, but I'm having a hard time proving it to myself.

If so, this gives us another (perhaps more meaningful) way to represent functions as numbers.  Any function f(a, b) like the one above could be encoded as a number as:



In this system, addition would always be encoded as N+1, and multiplication would be encoded as NN+1.

For what it's worth.

Sunday, December 15, 2013

A few bits from the Rohonc Codex

Here, without any particular system of organization, are some suggested decipherments from the Rohonc Codex. (Note--I have updated this post with better images and added more discussion. As usual, the original was written late at night and I was very tired.)

I feel like the Rohonc Codex is a more "honest" text than the Voynich. For one thing, the scribe made mistakes in the RC, and struck through them or drew pictures over them. In the VM it seems there are no mistakes like this. The existence of a mistake implies a difference between correct and incorrect text, which lends weight to the idea that the text is not a hoax (since in a hoax there is no difference between correct and incorrect text).

In addition, the RC has apparent cultural features. It seems that all major figures depicted in the text can be identified based on their headgear, suggesting a tradition of stylized iconography. Christ can always be identified by his beard and striped turban, for example.

In some images, the figures are given names, which is where my suggested decipherments begin. Here is Jesus:



Jesus

This name is frequently followed, in the text, by another formula, which I think is probably Christ. For example, in the caption above an image of the Magi bearing gifts:


Jesus Christ

(Note 2/25/2014: Delia Huegel reads these characters the same way. See her post on Divine Designators.)

In one image, Jesus meets a seated figure wearing a crown who is mentioned frequently in the surrounding text. I think he may be Pontius Pilate:

Jesus and Pilate?

Pontius Pilate?

The reason I think this is Pilate is that they are surrounded by Roman soldiers (with pointed caps, as I mentioned in my first post), and the Pilate figure is seated on a throne with a crown. This individual is mentioned very frequently in some sections of the text, which makes me wonder if those portions aren't drawn from the apocryphal Acts of Pilate.

There are several depictions of the crucifixion, some of which show a board above Christ's head. In the images that I have, the board is difficult to read, but not impossible. Here are the best two:

INRI

Though the images here are not great, something bearing a strong resemblance to this formula can be found on a page where Pilate's name also appears frequently:

INRI in context with Pilate's name

What is interesting about the passage above is that it contains nearly the same text repeated twice, with apparent numbers before each repetition.

Is there a dialogue between Jesus and Pilate in the Acts of Pilate or Gospel of Nicodemus (or somewhere else) where Pilate asks Jesus more than once whether he is the King of the Jews? If so, that could be the underlying text here.

Tuesday, December 3, 2013

Figured it out

This morning (or last night...it was dark, I was sleepy) I figured out the answer to the question that started this whole function/number thing.

For a set of values N, I wanted to find out how many functions existed with the following properties:

f(f(a, b), c) = f(f(a, c), b)

∃ g : g(a, f(a, b)) = b

where a, b, c ← N

Now I have an answer, but maybe not enough time to explain it. The number of such functions is:

(|N|)! * (|N|-1)! * (|N|-2)! * ... * (1)! / (|N|-1)

So, for example, I determined experimentally that there were 96 such functions for |N| = 4, and this works out as 4! * 3! * 2! * 1! / 3 = 96.

In the process of working this out, I realized that these functions have the interesting property that there is always a value i such that f(x, i) = x. I also worked out (for the heck of it) a way to derive the truth table for the nth function. I ought to write it down somewhere, or I will almost certainly forget it, but I don't have time at the moment.

The reason I was interested in this question is that this type of function could be used in establishing a shared secret for cryptography. For example, I could publish a function f, a value a, and a value f(a, b), then you could publish f(a, c).  I could take your value and generate f(f(a, c), b), while you could take mine and generate the same value as f(f(a, b), c). The resulting value would be our shared secret, provided the inverse function g is difficult to discover.

Monday, December 2, 2013

Quickly: How to turn a number into a function

In these recent posts, I have been talking about hunting for functions with certain properties. I decided to try an exhaustive search of a particular function space to find the functions I was looking for, by identifying a 1:1 relationship between numbers and functions, then iterating through all of the numbers and testing the corresponding function.

In brief, I treated the number as an efficient encoding of the truth table for the function. By "efficient", I mean I used an encoding system in which every number in the range represented a valid and distinct truth table, and all possible functions were represented in the range. Since I'm talking about functions that operate on discrete, finite sets of values, the number of truth tables is finite.

First, I used a system I came up with a while ago to represent permutations as numbers. I originally devised this as a way to encode every possible shuffled state of a deck of cards, but it really works for any permutation. Encode the permutation fa as follows:

Rp(fa) = N! (fa:0(0)/N! + fa:1(1)/(N-1)! + fa:2(2)/(N-2)! ... fa:3(N-1)/1!)

Where fa:n returns the zero-based index of fa(n) within the ordered set of values returned by fa(x) where x >= n.

So, for example, the permutation fa(x) = x + 2 (modulo 4) can be represented as the number 10.  Here's how the encoding works:

fa:0 returns the zero-based index of fa(0) in the set [0, 1, 2, 3].  That is, it returns 2, since fa(0) = 2, and 2 is found at position 2 in the set.

fa:1 returns the zero-based index of fa(1) in the set [0, 1, 3].  That is, it returns 2, since fa(1) = 3, and 3 is found at position 2 in the set.

fa:2 returns the zero-based index of fa(2) in the set [0, 1].  That is, it returns 0, since fa(2) = 0, and 0 is found at position 0 in the set.

fa:3 returns the zero-based index of fa(3) in the set [1].  That is, it returns 0, since fa(3) = 1, and 1 is found at position 0 in the set.

So:

Rp(λx → x + 3) = 4! (2/4! + 2/3! + 0/2! + 0/1!) = 10

In my last post, I treated the function f(a, b) as selecting permutation fa from an indexed set, and applying it to value b. The N permutations required for distinct values of a can all be encoded together in a single number, as follows:

Rf(f) = Rp(f0)(N!)0Rp(f1)(N!)1 + ... + Rp(fN-1)(N!)N-1

So the function f(x, y) = x + y (modulo 4) can be encoded as follows, given that Rp(f0) = 0, Rp(f1) = 1, Rp(f2) = 10, Rp(f3) = 3.

Rf(λx, y → x + y) = 0(4!)0 + 1(4!)1 + 10(4!)2 + 3(4!)3 = 0 + 24 + 5760 + 41472 = 47256

For my function-space exploration, I simply reversed this process for each number to generate the truth table, and I used the truth table to test for the properties I was looking for.  However, for the fun of it, here is how you would apply the function to a set of values.  Given the function above, you would apply it to the values 2 and 3 as follows:

Rp(f2) = floor(Rf(f) / (4!)2) % 4! =  floor(47256 / 576) % 4 = 10

f2:0(0) = floor(4! Rp(f2) / 4!) % 4 = 2, therefore f2(0) = 2 (i.e. zero-indexed value 2 from [0, 1, 2, 3])
f2:1(1) = floor(3! Rp(f2) / 4!) % 3 =  2, therefore f2(1) = 3 (i.e. zero-indexed value 2 from [0, 1, 3])
f2:2(2) = floor(2! Rp(f2) / 4!) % 2 =  0, therefore f2(2) = 0 (i.e. zero-indexed value 0 from [0, 1])
f2:3(3) = floor(1! Rp(f2) / 4!) % 1 =  0, therefore f2(3) = 1 (i.e. zero-indexed value 0 from [1])

So 2 + 3 (modulo 4) = 1.

The looooooong way.

Sunday, December 1, 2013

Representing functions with numbers

In my last hurried post, I mentioned that I carried out a brute-force exploration of the function-space I was looking at, looking for functions with certain properties. I'm sure the technique I used is well-known, and there are probably better ways to do it, but I thought I would describe what I did anyway.

The functions I was looking at were invertible, so if there was a function f (a, b), then there had to exist another function g (a, c) such that g (a, f (a, b)) = b. In this case, all values are drawn from the set N.

Because of the invertibility requirement, we can say that any function fa (b) = (a, b) is a permutation of |N| values, so the number of distinct functions fa is |N|!.  If we think of f (a, b) as selecting a function fa from a table of values using the index a and applying that function to the value b, then the number of distinct tables of |N| functions drawn from a set of size |N|! is (|N|!)|N|.

That means for |N| = 4, for example, I needed to check (4!)4 = 331776 functions for those that had the additional commutative property I was looking for.

In order to check all of the functions exhaustively, I came up with a 1:1 relationship between every number in [0..331775] and every function in the set. Then I iterated through each number, generated the function from the number, and tested the function for the property I was looking for.

But parenting duties call again, so I'll have to finish this later.

Saturday, November 30, 2013

Functions with certain properties

I've been trying to figure out the general answer to this question. It seems like I should be able to sort it out, but I've just got a mental block.

For a set of values N, how many functions f exist, such that the following are true? (Forgive me if I make mistakes with the notation...I'm not a mathematician.)

f (a, b), c) = (a, c), b)


g : g (a, f (a, b)) = b

where a, b, c ← N

I've been able to determine the answer through brute-force exploration of the function space for the following sizes of N:

For |N| = 2, there are 2 functions with these properties.
For |N| = 3, there are 6 functions with these properties.
For |N| = 4, there are 96 functions with these properties.

It seems like it ought to be easy to say, for a given size |N|, how many of these functions there are. But I can't see it yet.

Note 12/3/2013: I figured it out.

Monday, November 25, 2013

Structure of political events

I've off and on been trying to find ways to create computational models of political events in order to predict their outcomes. I've started with an intuitive sense that events evolve in a structured way, and if we can predict the probability of the smaller steps in that evolution, we can predict the total outcome of the event.

For example, consider the question of whether Afghanistan will approve the bilateral security agreement offered by the United States. The Loya Jirga (assembly of elders) has just approved the agreement, and urged Karzai to sign it before the end of the year. In order for Karzai to sign it, the following internal events must occur:

E1: The Afghan National Assembly must approve the BSA. This will no doubt be influenced by the fact that the Loya Jirga has already approved it, since the Afghan constitution declares the Loya Jirga to be the highest expression of the will of the Afghan people. However, there is some question of whether the Loya Jirga was properly assembled according to the constitution, and some feeling that Karzai convened the Loya Jirga because he thought he could influence its decision more than he could influence the National Assembly directly. Let's say the odds of this are something like 0.9.

E2: Karzai must choose to sign the BSA before the end of the year. Currently he is calling for it to be signed next year.

Intuitively, the outcome of the total event should be something like:

PE1 * PE2

Where PE is the probability that event E will occur.

I do not know specifically why Karzai is holding out on signing the BSA this year. In the absence of knowledge, it is tempting to assign a 0.5 probability to PE2, giving final odds of 0.9 * 0.5 = 0.45. However, my recent experience is showing that there is another way to look at these questions.

In the most general description, event E2 is an irreversible state change. Once Karzai chooses to approve the BSA, he cannot (in a practical sense) reverse the approval and return to his original state. He could make his decision on any day, so PE2 should really decay towards zero according to a formula like the following:

PE2 = 1 - (1 - pE2)d

Where pE2 represents the odds on any given day that Karzai will change his mind, and d is the number of days remaining before the end of the year.

If delaying the BSA is good in itself for Karzai, then he will never change his mind, so we could say that pE2 is zero and so likewise is PE2. That might be the case if Karzai believes he could be charged with treason, assassinated, or otherwise subject to persecution/prosecution for signing the BSA.

On the other hand, Karzai may be threatening to delay signature of the BSA in order to extract some concession from the United States. In that case, for him to fail to sign the BSA by the end of the year signifies the failure of his gambit. The odds of his signature in that case are calculated very differently, so we should probably think about this as two separate events:

E2a: Karzai signs the BSA even though he believes he could be persecuted as a result.

E2b: Karzai signs the BSA after receiving a concession from the United States.

In order to calculate E2 from these two values, we need to decide the odds that either or both are true. Let's say there is are 0.1 odds that Karzai fears serious persecution and 0.9 odds that he is trying to wring out a concession:

PE2 = 0.1 PE2a + 0.9 PE2b

As the end of the year approaches, the value of Karzai's signature drops, so the value of what he expects in return should decrease.  Meanwhile, the value of what the US offers should gradually increase until it meets the value of what Karzai has to offer. If both players know each other well, then they have already calculated that they will reach an agreement before the end of the year, and the only question is whether they are wrong. In that case, PE2b should depend on the odds that the two players have correctly estimated each other's thresholds for negotiation.

PE2b = 1 - (1 - CE2b)d

Where CE2b represents the odds that the two parties have correctly assessed each other's positions and can reach an agreement on any given day.

So, if we estimate 0.9 odds that the National Assembly will concur with the decision of the Loya Jirga, and 0.8 odds that the two players have correctly estimated each other's positions, then the total likelihood of a timely signature is:

PEtotal = PE1 * (0.1 PE2a + 0.9 PE2b)
PEtotal = 0.9 * (0.1 * 0.0 + 0.9 * (1 - (1 - 0.8))d)

Under that model, the odds of Karzai signing the BSA hover at about 0.81 till the last three days of December, when they suddenly plummet to zero.



One of the things that bedevils me here, though, is the unknown intermediate steps. If I have time, I think I would like to see what kind of behavior emerges if I simulate situations where there are thousands of dependent steps and thousands of possible routes to a particular outcome. Do complex networks of interrelated events conform to a different set of rules en masse?

(Note added 12/17/2013: The Karzai example here does not work if, for example Karzai is engaged in negotiations with Iran. The reason is that the settled state of negotiations with the US gives Karzai the opportunity to develop other options and choose between them. So maybe we should expect that a less-than-satisfactory settled negotiation will usually stall while other options are developed.)

Sunday, November 10, 2013

A chartreuse analogy

This is a quick follow-up on the last post, where I sketched an idea for a functional programming language where the lambda was implicit.

First, let me repudiate the whole idea of the unresolve metafunction. There's really no need for it, and it just makes things messy. I'm going to go back and cross that bit out.

Second, it might have occurred to you that, if the lambda is implicit in the way I described, then there is no distinction between values and expressions that take no parameters. At first, that seems like a problem, but I don't think it is. In a pure functional context, what is really the difference?

Of course, even the purest functional language must eventually sully itself by making contact with this sordid and messy world, and there you might have a problem, since your parameterless functions need to be evaluated in order for any side-effects to occur.

Which brings me to an idea about how to separate the pure (and provable) side of a program from the impure, messy side: Rather than trying to incorporate both worlds in one language, let them be separate languages.

Functional languages are bad at being imperative, and imperative languages are bad at being functional. So strictly separate the two by writing the core code in a functional language, and accessing it from the real world via an imperative language. Naturally, you want to be able to compile the two together, and you want the languages to have common features, but let it be clear when you are coding in one or the other, and let the compiler enforce the purity of the functional side.

You could think of the functional (sub-)language as being like the eremitical side of a Carthusian monastery, and the imperative (sub-)language as being like the cenobitic side. In a Carthusian monastery you have the hermits, whose lives are dedicated to prayer and contemplation, and the lay monks, who serve the hermits and maintain the monastery. The two are strictly separated and work in different domains, but are integrated with each other in a symbiotic way.


Saturday, November 9, 2013

Implicit lambdas and thusness: a silly idea

When my brother and I were out camping with our dad, we used to sit around the fire at night and ask all kinds of "what if" questions.

Tonight, I've gotten the kids to sleep, gotten the dishes washed, and did a little amateur skimming of the Romani internet. (Who couldn't love a language that calls a black widow spider yalakráno, and a pear tree ambrolin?)

Now I am tired, and unable to do anything except ask "what if?"

What if you had a functional programming language where the lambda was implicit, so any expression with an unresolved term was a lambda expression, and the unresolved terms were the parameters?

Obviously, you would need a couple of things to make this work. First, you would need to do away with a lot of syntactic sugar, because your parser needs to be able to operate without necessarily knowing anything about the referents of the symbols it is operating on. Second, you would need to have some way to recognize which input values to a lambda expression match which parameters.

But both of those are just conventional matters, not real theoretical difficulties. Let's say, for the sake of this post, that we have a syntax like Lisp, and parameters are ordered as they appear in the source code.

Now for the second bit: What if you were able to reference a function from within itself using a special term, similar to the way you can use the variable this, self or me in object-oriented languages?

For the sake of this post, let's call this special term thus.

If you had a language with both of these features, you could write a lambda expression for the exponentiation function as follows:

(if (eq x 1) 1 (multiply x (thus (subtract x 1))))

We assume the functions if, eq, multiply, thus, 1 and subtract are known, so the unresolved term is x, which is the sole parameter to the expression.

Now, what if you could take a lambda expression in this language and override some of the known terms, producing functions with similar structure but different behavior? So, maybe have a metafunction called unresolve that returns a term of an expression from the known space to the parameter space. So, if you wanted a function like factorial that took the parameter x and also some function to use instead of multiply, you could say:

(unresolve factorial multiply)

Then you could substitute a different function, like divide:


((unresolve factorial multiply) divide)

These are the sorts of vapid ideas a tired mind produces.

Friday, November 8, 2013

FST via RDBMS

I was stuck in an airport a couple days ago with nothing to do, so I started a little project to pass the time. The challenge is this: Using a conventional database and a conventional programming language, what is the fastest program I can write that can take an inflected word in some given language and identify the root and the inflection(s)?

There are two ends of the spectrum in solutions. On one end, you have a "fat" solution involving a massive database containing every possible inflected form of every word in the language. On the other end, you have a "lean" solution with a smaller database of stems and inflectional information, and some complex code to use the two in combination.

The fat solution has the advantage of using mature indexing technology to efficiently search the massive database for the subject term.  Supposing you have a binary tree index, the cost of executing a query with the fat solution might be:

Tfat = Q + N log(si)

where Q is the fixed cost of opening a query to the database, N is the incremental cost of checking one node in the binary tree, and log(si) is the depth of the binary tree, based on the number of stems (s) and inflections (i).

However, for agglutinative languages this is a nightmare. Consider the standard Uyghur expression used when meeting someone new:

tonušqanliqimizdin xošalmen

The stem of the first word is ton, and the rest is all inflection (uš - qan - liq - imiz - din). A database containing all inflected forms of Uyghur verbs would be huge.

The leanest solution searches the stems and inflections separately. Since the inflections are a fixed set, it makes sense to factor them out first, then do a simple search of the stems.  The speed of the lean solution is the time required to query for all of the inflectional data (Ni), the time required to parse all possible inflections (P), and the time required to query roots (N logs):

Tlean = (Q + Ni) + P + (N logs)

which can be written as

Tlean = (Ni) + P + Tfat - N logi

In order for the lean solution to beat the fat solution, this has to be true:

Ni + P < N logi

That simply isn't doable, because Q, P and N are positive, and i > logi.  In order to get the lean solution to work, we've got to pre-fetch the inflectional data. In fact, we could write a routine to use the inflectional data to build a parser, in which case we're down to this:

Tlean = P + Tfat N logi

And now all we need is to make this work:

P < N logi

The fastest parser would probably be a finite state transducer. It seems to me you would end up with the FST traversing about the same number of states as the nodes traversed by the index search, but the FST has the advantage of running in memory, while the index search would have to deal with some of the necessary evils of relational database management (like query preparation, I/O buffering, and so forth) that are hiding in the value N.

Going back to the challenge requirement that this has to be implemented in a conventional programming language, the best marriage between conventional programming and an FST is probably found in regular expression processing, so that would probably be the tool to use for the parser.

Tuesday, October 29, 2013

The genome of a narrative

One of my hobbies is political forecasting. It's an interesting pursuit with many fascinating challenges, and one of them is the challenge of getting good information from unreliable sources.

The internet can be seen as a vast set of assertions of varying validity, produced and consumed by the "mindspace" of the networked world. Some forecasters try to get good information by aggregating many different assertions, on the principle that the process of aggregation will reduce the influence of errors. That is a good way to reduce noise, but it doesn't help when there are widespread misconceptions.


People don't like to change their minds, so very often the first idea they take up is the one they will stick with in the long run. That means that ideas which travel quickly can occupy territory in the mindspace ahead of ideas that travel more slowly. An idea travels quickly if it is easily passed on, so all it needs is to be simple, easy to explain, and make sense. Slower, more complex ideas lose the race.


I have also found that ideas carrying a strong emotional payload can effectively defend their territory in the mindspace against competitors. For example, Stars and Stripes recently published an article about a false story alleging that Obama wants to emasculate the US Marines by asking them to wear female covers. In this case, the falsehood triggers stronger emotions than the truth, so it gains and holds ground.


The end result is that the viability of an idea on the internet is not necessarily correlated with its truth, and a false idea may easily replicate enough to influence the results of aggregation.


To address this, I have adopted an approach that is similar to the narrative analysis used in the study of Folkloristics.  I try to identify the main narratives relating to a subject and trace the genealogy of each back to its original source (if possible). Then I attempt to explain why the original source released the narrative into the wild.


(The original version of this post had an example of a narrative here, but I took it out because it made the post too long.  Now I wish I had it back, because it was interesting.)

I am interested in the question of how (or whether) computational linguistics and other tools can be used to trace the genealogy of narratives on the internet. Among other things, I imagine this could lead to identifying large currents of thought--channels by which ideas spread from a small number of sources to a large audience.

Tuesday, October 1, 2013

I guess this is Voynich month

I hate to get stuck on things, but I feel like I have to exhaust this idea that the Voynich gallows letters could be stressed vowels.  I'll call this the GAV (gallows-as-vowels) hypothesis.

I was in a hurry with my last few posts, and I'm not someone who is extremely familiar with the VM, so I overlooked something that should really be incorporated into the proposal.  First, the gallows letters have an alternate form called "platform gallows" where they appear in ligature with another grapheme represented in EVA as <ch>.  Since it is not uncommon in Latinate scripts to form ligatures of vowels (like œ, æ) and vowels are by far the most common carriers of suprasegmental markers (like á, ä, ā, â) it seems reasonable to count <ch> and the apparent variant <sh> with the vowels somehow.

So the GAV thesis is this:
The [main] underlying language of the VM is a language with a bias towards stress on the first syllable. The words of the text are abbreviated by writing primarily the consonants, and excluding most unstressed vowels. The vowels that are retained are represented by gallows letters, by <ch> and <sh>, and by the ligatures of <ch> with the gallows letters.
This system of abbreviation would introduce a certain amount of collision, where the abbreviated forms of different words would be written the same.  However, many ambiguous forms could be understood from context, and there could also be a mechanism to avoid collisions for frequent or important words.

Interestingly, if this is really how it works, it suggests that the alphabet was deliberately designed to mislead the uninitiated by making the vowels look like consonants by giving them large, imposing shapes, while some consonants were given shapes more like vowels.

So, where do we go from here?  I think we go to syllable structure.

It has long been known that Voynich words adhere to some kind of structure.  Under the GAV thesis, much of this structure will correspond directly to the structure of the stressed syllables. If we look at the multiliteral clusters that appear before the *vowel, there is a set of common permissible clusters in <q, qo, qol, o, ol, l>, together with a set of largely impermissible clusters <*oq, *oql, *lq, *ql, *lo, *loq>.

If the language were English, the permissible clusters could represent (for example) [s, st, str, t, tr, r], while the impermissible ones could represent [*ts, *tsr, *rs, *sr, *rt, *rts].  Other solutions are possible, both in English and in other languages.

So, which languages should we look at? The manuscript is supposed to have been sent to Athanasius Kirscher from Prague, so we would probably start there.  The languages of Bohemia included Bohemian (Czech), which stresses the first syllable of words; and German, which tends to stress the first syllable.  Hungary is not far away, but while Hungarian stresses the first syllable, it doesn't tend to have as many word-initial consonant clusters as would be required under the GAV theory.

Lastly, we should probably throw in English just because Johannes Marci told Kirscher that Ferdinand III's Bohemian tutor thought the book had come from Francis Bacon.

So, given that <ol> is a common word, as well as being a member of the permissible consonant cluster series above, we should probably assume it is unstressed and written without its vowel.  Among these languages, where can we find a permissible consonant cluster series like <q, qo, qol, o, ol, l> where the term <ol> in the series could plausibly be a common word?

Tuesday, September 24, 2013

What would it mean if the Voynich gallows letters are vowels?

In my last post, I made a proposal about vowels, and inadvertently revealed that I am not a real Voynichologist, because I referred to [p,f,t,k] as "tall, loopy letters" instead of using their conventional name: gallows letters.

First, consider the distribution of the gallows letters within words.  The following graph shows the relative frequency of gallows letters compared to other letters in the first ten letters of Voynich words:


Gallows letters are heavily represented at the beginnings of words, and scarcely represented later in words.  If they represent vowels, then what does this imply?  Perhaps the language has syllable stress, and the stress normally falls on the first syllable of a word, and only the stressed vowel is written.

That could not be the whole picture, however.  In English, a language with a bias towards stressing the first syllable, the relative frequency of the vowels compared to other phonemes in the first ten phonemes looks like this (data from the Carnegie Mellon pronunciation database):


As you can see, stressed vowels are slightly over parity with other phonemes as the second phoneme of a word in English.  To explain the gallows letters as stressed vowels, we would additionally have to assume some particular stressed vowel or vowels in the first syllable if no other vowel is written, and this would have to apply to about 4/5 words that are stressed on the first syllable.

Monday, September 23, 2013

I couldn't resist: The Voynich Vowels

OK, I didn't mean to get started down this route because I'm a little overworked, but I ran my contextual distance analysis on the Voynich graphemes (in EVA transcription).  Based on that, here is my best guess at the Voynich vowels.

The following image shows a map of the similarity of the graphemes in the Voynich Manuscript.  Like the King James genesis, we have a small island of five graphemes which occur with the word boundary (#).  I will call these the *vowels (note the asterisk, which is my caveat that this is all hypothetical).


The *vowels, in EVA transcription, are the letters z, t, k, f and p.  The Voynich graphemes represented here are:


There is an obvious graphemic similarity here:  The tall, loopy letters are all *vowels, as is the z, which might be called a small, loopy letter.

Interestingly, the closest contextual similarities here are between the graphemes that look the most similar:

sim(t, k) = 0.88
sim(f, p) = 0.87

By contrast, the strongest similarities between any two vowels in Latin is sim(e, i) = 0.77, while in English it is sim(a, o) = 0.59.

If the *vowels are indeed the Voynich vowels, then I propose that we are looking at a script with only three vowels, and the EVA [t,k] and [f,p] are merely graphemic variants of each other.

Is it possible that the underlying orthography is based on something like the way Hebrew is used to write Yiddish, using the [א,ו,י] to represent vowels?

In fact, I'll go a step further, just so I can claim credit if this turns out to be true:  Given my earlier feeling that this is a polyglot or macaronic text, is it possible that the underlying text is in Hebrew and Yiddish (or another language of the Jewish diaspora, like Ladino)?

How would you use semantic similarities for decipherment?

A couple weeks ago, I started taking an online course in cryptography taught by Dan Boneh from Stanford on Coursera.  (It's excellent, by the way.)  I've been a little slow to post anything new, but I wanted to incompletely finish a thought.

Why should we care about mapping semantic relationships in a text?  Well, I think it could be a valuable tool for deciphering an unknown language.

First, what I've been calling "semantic similarity" is really contextual similarity, but it happens to coincide with semantic similarity in content words (content-bearing nouns, verbs, adjectives, adverbs). In function words, it's really functional similarity.  In phonemes (which I don't think I have mentioned before) it correlates with phonemic similarity.

Here's an example of a map generated from the alphabetic letters in the book of Genesis in King James English:


The most obvious feature of this map is that all of the vowels (and the word-boundary #) are off on their own, separated from all of the consonants.  If we had an alphabetic text in an unknown script in which vowels and consonants were written, we could probably identify the vowels.

There are finer distinctions that I may get into one day when I have more time, but the main point here is that contextual similarities of this type measure something valuable from a decipherment perspective. I believe it should be possible to get the general shape of the phonology of an alphabetic writing system, to separate out prepositions or postpositions (if they exist), and to eventually distinguish parts of speech.

Friday, September 13, 2013

Visualizing the semantic relationships within a text

In the last few posts, I had some images that were supposed to give a rough idea of the semantic relationships within a text.  Over the last few days I have been rewriting my lexical analysis tools in C# (the old ones were in C++ and Tcl), and that has given me a chance to play with some new ways of looking at the data.

When I depict this data in two dimensions, I'm really showing something like the shadow of a multi-dimensional object, and I have to catch that object at the right angle to get a meaningful shadow.  In the following images, I have imagined the tokens in the text as automata that are bound by certain rules, but otherwise behaving randomly, in an attempt to produce an organic image of the data.

For example, each point of this image represents a word in the lexicon of the King James version of the book of Genesis.  The words all started at random positions in space, but were made to iteratively organize themselves so that their physical distances from other words roughly reflected the semantic distance between them.  After running for an hour, the process produced the following image:


There are clearly some groups forming within the universe of words.  Since this process is iterative, the longer it runs the more ordered the data should become (up to a point)...but I haven't had the patience to run it for more than an hour.

The same process applied to the Voynich manuscript produces something smoother, as you can see below.  However, the VM has a much broader vocabulary, so I estimate I would need about nine hours of processing to arrive at the same degree of evolution as I have in the image above.


Reflecting on that, I thought I would try a new algorithm where words are attracted to each other in proportion to their similarity.  The result was not interesting enough to show here, since the words rapidly collapsed into a small number of points.  I am certain that this data would be interesting to analyze, but it is not interesting to look at.

Thinking about the fractal nature of this collapse, I thought I would use similarity data to model something like the growth of a plant.  In the following image, I have taken the lexicon of the King James Genesis again and used it to grow a vine.  I started with one arbitrarily chosen token, then added the others by joining them to whatever existing point was semantically nearest.  Each time, I tried up to ten times to find a random angle of growth that did not overlap another existing line, before finally allowing one line to overlap others.

I am quite pleased with the result here, since it shows what I have always intuitively felt about the data--that there are semantic clusters of words in the lexicon.


The same approach applied to the Voynich manuscript produces a denser image, due to the greater breadth of the vocabulary:


But how does this compare to random data?  To answer this question, I processed a random permutation of the Voynich manuscript, so the text would have the same word frequency and lexicon as the Voynich manuscript, but any semantic context would be destroyed.  Here is the result:


Intuitively, I feel that the random data is spread more evenly than the Voynich data, but to get beyond intuition I need a metric I can use to measure it.

Good night.

Monday, September 9, 2013

Problem with the problem with the Voynich Manuscript

There was a problem with the algorithm that produced the images I posted last night.  (These things happen when you are tired.)  The images for the English, Vulgar Latin and Wampanoag texts were basically correct, but the image for the Voynich Manuscript had a problem.

Part of the algorithm that produces the images is supposed to sort the lexicon in a way that will group similar words together, creating the islands and bands of brightness that the images show.  However, the sorting algorithm did not adequately handle the case where there were subsets of the lexicon with absolutely no similarity to each other.  As a result, the Voynich image only shows one corner of the full lexicon (about 10%).

The full image is much darker, like a starry sky, implying a text that is less meaningful.  That lends a little weight to the theory that the text may be (as Gordon Rugg has suggested) a meaningless hoax. However, I thought I should also test the possibility that it could be polyglot, since my samples so far have all been monoglot.  With a polyglot text, my sorting algorithm would be hard pressed to group similar words together, because the lexicon would contain subsets of unrelated words.

After fixing my sorting algorithm, I produced a similarity map from the polyglot medieval collection Carmina Burana, then compared it to a similar-sized monoglot English text, and a similar-sized piece of the Voynich Manuscript.  Here are the results.

First, the English text:


And now, the Carmina Burana.  The image is actually much larger, because the lexicon is larger. Here, you can see the "starry sky" effect.


And last, the Voynich Manuscript.  Again, the "starry sky" effect.


Of the images I have produced so far, the Voynich Manuscript looks most like the Carmina Burana.  However, I have the following caveats:

  • Carmina Burana is much smaller than the VM.  To do a better comparison, I will need a medieval polyglot/macaronic text around 250 kb in size.  So far I have not found one.
  • I should attempt to produce the same type of text using Cardan grilles, to test Gordon Rugg's hypothesis.
Lastly, I want to add a note about the narrow circumstances under which I think Gordon Rugg may be right, but generally why I don't think his theory will turn out to be correct.

I don't think the VM was generated using Cardan grilles because the work required to generate a document the size of the VM would be significant, and there would be an easier way to do it.  It would be much easier to invent a secret alphabet and simply babble along in a vernacular language. The Rugg hypothesis needs to address the question of why the extra effort would have been worth it to the con artist that generated the work.

The narrow circumstance under which this would make sense, to me, is if the VM were generated by someone using something like Cardan grilles as an oracular device, believing that he was thereby receiving a divine message.

However, if the VM is a hoax, I think we will one day decipher it, and we will find that it says something like "the king is a sucker. I can't wait till I'm done with this thing. I'm so sick of vellum...."

Good night.


Saturday, September 7, 2013

Meaning in the Voynich Manuscript

It turns out there is an established algorithm for measuring semantic distance between words in a text that is somewhat similar to the algorithm I have developed.  It is called Second-Order Co-occurrence Pointwise Mutual Information, and there is a paper about it on the University of Ottowa website.

However, my algorithm is a little bit simpler than SOC-PMI, and it treats context words differently based on whether they occur before or after the subject word.  I like mine better.

Using my algorithm (which I may someday put online, if I ever have time to tidy up the code), I have generated a series of images.  These images show the semantic relationship between words in the Voynich manuscript and several other texts that are approximately the size of the Voynich manuscript.  In these images, each point along the x and y axes represents a word in the lexicon of the text, and the brightness at any point (x, y) represents the semantic similarity between the words x and y.  There is a bright line at x = y because words are completely similar to themselves, and clusters of brightness represent clusters of words with common meanings.

First, here is a control text created from 2000 words arranged in random order.  The text is meaningless, so the image is basically blank except the line x = y.


The following image is generated from a text in Vulgar Latin.  Because it is an inflected language, and my algorithm doesn't recognized inflected forms as belonging to the same word, there are many small islands of similarity.


The next image is taken from an English text.  Due to the more analytic nature of English, the islands of similarity are larger.


The next image is taken from a text in Wampanoag.  I expected this text to be more like the Latin text, but I think the source text is actually smaller in terms of the number of words (though similar in terms of the number of graphemes).


Now, for the cool one.  This is the Voynich Manuscript.


I wonder if this shows a text written in two languages.  The light gray square in the upper left corner, together with the two gray bands in the same rows and columns, represents a subset of words that exist in their own context, a context not shared with the rest of the text.  One possible explanation would be that the "common" language of the text contains within it segments of text in another "minority" language.

About 25% of the lexicon of the text belongs to the minority language, and the remaining 75% belongs to the common language.  The minority language doesn't contain any significant islands of brightness, so it may be noise of some kind--either something like cryptographic nulls, or perhaps something like litanies of names.  If I have time, I'll try to split the manuscript into two separate texts, one in each language, and see what the analysis looks like at that point.

Good night.

A Problem with the Voynich Manuscript

ere yamji bi ilan hūntahan sain nure be omire jakade...

There is a problem with the Voynich manuscript, and the problem is on the level of the apparent words, not the graphemes.

I mentioned in a post a while back that I have an algorithm that roughly measures the contextual similarity between words in a large text.  Over the years I have gathered samples of the text of the book of Genesis because it is approximately the same size as the Voynich manuscript, and it has been translated into many languages.

Since I have recently been working on another (unrelated) cipher, I have developed a set of tools for measuring solutions, and I thought I could use them in combination with my semantic tools to look at the VM.

What I am finding right now is interesting.  I'm pulling the top 14 words from sample texts and creating a 14x14 grid of similarities between them.  The most frequent words tend to be more functional and less content-bearing, so I am really looking for groups like prepositions, pronouns, and so on.

In highly inflected languages, the top 14 words are not especially similar to each other.  The reason for this is that prepositions in inflected languages tend to go with different cases, and inflected nouns are treated as entirely different words by my algorithm.  So, for example, in Latin all of the top 14 words have a similarity score of lower than 0.2.  In Wampanoag, the majority are under 0.1.  In less inflected languages, like Middle English, Early Modern English, Italian and Czech, scores more commonly fall in the range 0.2-0.3, with a few in the 0.3-0.4 range.

The scores from the Voynich manuscript, however, are disconcertingly high.  All of the top 14 words are very similar to each other, falling in the 0.3-0.5 range.  This suggests that the most common words in the Voynich manuscript are all used in roughly similar contexts.  They average just a little under the numbers that I get for a completely random text.

Something is going on here that makes the words look as though they are in random order.

Thursday, September 5, 2013

Carcharias (a cipher)

Over the last few years, I've been studying modern cipher algorithms.  I've written implementations of a number of cipher algorithms as a way of understanding how they work, and I'm impressed by their economy.

But I also wonder whether there might be advantages to ciphers with huge keys.  In this post, I'll describe a big, ugly cipher I'll call Carcharias, which uses massive keys and lots of memory. Since Carcharias is a big fish, I'll talk in terms of bytes instead of bits.

The Key

Carcharias is a Feistel cipher with a block size of 512 bytes and a key size of 16,777,216 bytes.  The key is huge, right?  But you could store hundreds of them on a modern thumb drive without any problem, and it will fit in a modern processor's memory quite easily.  As with any other cipher, the ideal key is as random as possible, but the question of how to generate a good random key is outside the scope of this post.

The Round Function

Since Carcharias is a Feistel cipher, it processes the block in two halves (256 bytes) using a round function.  In the following pseudocode for the round function, the key is treated as a three-dimensional array of bytes (256 x 256 x 256), and the subkey, input and output are treated as arrays of bytes.  This isn't the most efficient implementation...it's just meant to be illustrative.

for (i = 0; i < 0x100; ++i) {
  output[i] = 0;
  for (j = 0; j < 0x100; ++j) {
    output[i] ^= key[i][j][input[j] ^ subkey[j]];
  }
}

Mathematically, it doesn't get much simpler than this.  You've got a couple XOR operations and some pointer math, but every bit of the output is dependent on every bit of the input.  You throw that into a Feistel network and you have pretty good confusion and diffusion.

Incidentally, in a few years I think processors will probably carry enough memory that you could implement a 4 GB key, treated as a four-dimensional array, in which case Carcharias could be replaced by Megalodon, using this line instead:

output[i] ^= key[i][j][input[j]][subkey[j]];

It's big. It's ugly. It's brutish.

Advantages

Provided the key is very random and the key transmission secure, I think Carcharias and Megalodon can only be attacked by brute force.  The brute force attack must use a large amount of memory, which might make it harder to farm out to a bunch of processors running in parallel.

If I have time, I'll write an implementation of Carcharias and post it on github.

Note (12/2/2013): This has some potential weaknesses if the key (which is essentially a huge S-box) is too close to a linear function. Also, it's really overkill, and there is no need for something like this in the world :)

Tuesday, September 3, 2013

Theory about pre-human language

I have a theory about the course of the evolution of human language that is too long to put into a blog post.  I have given it its own page, called Pre-Human Language.

Sunday, September 1, 2013

The Ants and the Cricket

I took a shot at translating one of Aesop's fables into the Grande Ronde dialect of Chinuk Wawa, as documented in Henry Zenk's excellent dictionary.  Here it is. (If any Wawa speakers happen to come across this, I am happy to hear feedback and corrections).

The original was "The Ants and the Grasshopper".  I made it "The Ants and the Cricket", since a cricket is called a "useless bird" (kʰəltəs-kələkələ), and the cricket sings, while the grasshopper doesn't.

t’siqʰwaʔ pi kʰəltəs-kələkələ
The ants and the cricket

kʰul-iliʔi ixt san t’siqʰwaʔ ɬas-munk-tlay məkʰmək ikta wam-iliʔi ɬas-munk-iskam.
One day in the winter, the ants were drying food which they had gathered in the summer.

kʰəltəs-kələkələ ya-chaku miməlust-ulu.
Cricket became extremely hungry.

ya-ɬatwa kʰapa t’siqʰwaʔ ɬas-haws, ya-aləksh məkʰmək.
He went to the ants' house, and he begged for food.

uk t’siqʰwaʔ ɬas-wawa, "pus-ikta wam-iliʔi wik ma-munk-iskam məkʰmək?"
The ants said to him, "Why did you not gather food in the summer?"

kʰəltəs-kələkələ ya-wawa, "wik na-t’uʔan lili pus na-munk-iskam məkʰmək.
The cricket said, "I didn't have time to gather food.

kʰanawi wam-iliʔi na-shati shati."
The whole summer long I sang and sang."

ɬas-munk-hihi-yaka, wawa, "pus kʰanawi wam-iliʔi ma-hihi pi shati, aɬqi kʰanawi kʰul-iliʔi ma-tanis pi ulu."
They laughed at him, and said, "If you laugh and sing all summer, then you dance and hunger all winter."

Friday, August 30, 2013

New York Spring Morning

A gilt and rusted nest of steel,
stone and brick and concrete,
haunted by its own electric soul,
awaits the tangled light of early spring.

Throughout the restless night,
heavy boats have cut across the bay
leaving scars upon its cold
and gray and gleaming skin.

But every sin committed
on the landscape by this city
suddenly appears to be forgiven
with the rising of the sun.

“Live again, and work.”
A million clocks have sprung.
The ragged threads of countless dreams
that clung to countless minds
are brushed away.

In unison they rise,
they wash,
they work.

The moon alone is left to take its subtle time,
to wander narrow strips of sky,
to climb the steel lattices and gaze away
across the moving clamor of the day.

Monday, August 19, 2013

Recursive call optimization

In a previous post, I wrote the map function in F#, compiled it, and disassembled it, and expressed my disappointment that it was recursive.  What I had hoped for was something like tail call optimization.

Often, in my imperative programming career, I have had to choose between implementing a process in a recursive way or a looping way.  In one language that I work with, the number of nested blocks allowed is limited, so I must be very careful about how I build recursive solutions.  I usually view the loop implementation as inferior because it lacks the beauty of the recursive one, but usually it is more efficient.  Since functional programming relies extensively on recursion, however, it is often optimal to turn recursive function calls into loops at compile time.

Functions require memory to process calculations.  For convenience, I'll refer to the units of memory used as variables, even if (strictly speaking) they are not variable.  The effective scope of a variable begins when it is assigned a value, and it ends at the last point that the value is referenced.  Since some control branches of a function may not reference a variable, the scope may cover parts of the expression tree, rather than a linear range of lines of code.
If no variables are in scope at the time of a recursive call, then the recursive call can be implemented as a pair of loops.  The first loop of the pair represents the descent into the recursive call, while the second loop represents the return.
Since most variables are out of scope by the end of a function, a recursive call at the end of the function can usually be implemented as a single loop (since nothing is done on return), which is called tail call optimization.  However, the general rule allows recursive calls to be implemented as a pair of loops anywhere, provided the recursive call falls outside the scope of all other variables.

Here is a contrived example of a recursive C# method that can be optimized in this way:

public int TestFunction(int n) {
  // Variable r is defined here, but not in scope yet,
  // because its initial value will not be used
  int r;

  // Start of scope for t
  int t = n * n + n;

  // Start of scope for j
  int j = 2 * t - 13;

  // Recursion should stop somewhere
  if (t < 100000 && j < 100000) {
    // Scope of t and j ends here on this execution branch.
    // Scope of r starts here
    r = TestFunction(t + j) - 7;
  } else {
    // Scope of t and j ends here on this execution branch.
    // Scope of r starts here
    r = t - j;
  }

  // Scope of r ends here
  // Scope of k starts here
  int k = 3 * r;

  // Scope of k ends here
  return k;
}

The optimized implementation of this recursive function is as follows:

public int TestFunction(int ntop) {
  int n;
  int t;
  int j;
  int r;
  int k;
  int loopCount = 0;

  // Descent loop
  n = ntop;
  while(true) {
    t = n * n + n;
    j = 2 * t - 13;
    if (t < 100000 && j < 100000) {
      n = t + j;
      ++loopCount;
    } else {
      break;
    }
  }

  // Bottom of call processing
  r = t - j;
  k = 3 * r;

  // Return loop
  while (loopCount > 0) {
    r = k;
    r -= 7;
    k = 3 * r;
    --loopCount;
  }

  return k;
}