Tuesday, October 29, 2013

The genome of a narrative

One of my hobbies is political forecasting. It's an interesting pursuit with many fascinating challenges, and one of them is the challenge of getting good information from unreliable sources.

The internet can be seen as a vast set of assertions of varying validity, produced and consumed by the "mindspace" of the networked world. Some forecasters try to get good information by aggregating many different assertions, on the principle that the process of aggregation will reduce the influence of errors. That is a good way to reduce noise, but it doesn't help when there are widespread misconceptions.

People don't like to change their minds, so very often the first idea they take up is the one they will stick with in the long run. That means that ideas which travel quickly can occupy territory in the mindspace ahead of ideas that travel more slowly. An idea travels quickly if it is easily passed on, so all it needs is to be simple, easy to explain, and make sense. Slower, more complex ideas lose the race.

I have also found that ideas carrying a strong emotional payload can effectively defend their territory in the mindspace against competitors. For example, Stars and Stripes recently published an article about a false story alleging that Obama wants to emasculate the US Marines by asking them to wear female covers. In this case, the falsehood triggers stronger emotions than the truth, so it gains and holds ground.

The end result is that the viability of an idea on the internet is not necessarily correlated with its truth, and a false idea may easily replicate enough to influence the results of aggregation.

To address this, I have adopted an approach that is similar to the narrative analysis used in the study of Folkloristics.  I try to identify the main narratives relating to a subject and trace the genealogy of each back to its original source (if possible). Then I attempt to explain why the original source released the narrative into the wild.

(The original version of this post had an example of a narrative here, but I took it out because it made the post too long.  Now I wish I had it back, because it was interesting.)

I am interested in the question of how (or whether) computational linguistics and other tools can be used to trace the genealogy of narratives on the internet. Among other things, I imagine this could lead to identifying large currents of thought--channels by which ideas spread from a small number of sources to a large audience.

Tuesday, October 1, 2013

I guess this is Voynich month

I hate to get stuck on things, but I feel like I have to exhaust this idea that the Voynich gallows letters could be stressed vowels.  I'll call this the GAV (gallows-as-vowels) hypothesis.

I was in a hurry with my last few posts, and I'm not someone who is extremely familiar with the VM, so I overlooked something that should really be incorporated into the proposal.  First, the gallows letters have an alternate form called "platform gallows" where they appear in ligature with another grapheme represented in EVA as <ch>.  Since it is not uncommon in Latinate scripts to form ligatures of vowels (like œ, æ) and vowels are by far the most common carriers of suprasegmental markers (like á, ä, ā, â) it seems reasonable to count <ch> and the apparent variant <sh> with the vowels somehow.

So the GAV thesis is this:
The [main] underlying language of the VM is a language with a bias towards stress on the first syllable. The words of the text are abbreviated by writing primarily the consonants, and excluding most unstressed vowels. The vowels that are retained are represented by gallows letters, by <ch> and <sh>, and by the ligatures of <ch> with the gallows letters.
This system of abbreviation would introduce a certain amount of collision, where the abbreviated forms of different words would be written the same.  However, many ambiguous forms could be understood from context, and there could also be a mechanism to avoid collisions for frequent or important words.

Interestingly, if this is really how it works, it suggests that the alphabet was deliberately designed to mislead the uninitiated by making the vowels look like consonants by giving them large, imposing shapes, while some consonants were given shapes more like vowels.

So, where do we go from here?  I think we go to syllable structure.

It has long been known that Voynich words adhere to some kind of structure.  Under the GAV thesis, much of this structure will correspond directly to the structure of the stressed syllables. If we look at the multiliteral clusters that appear before the *vowel, there is a set of common permissible clusters in <q, qo, qol, o, ol, l>, together with a set of largely impermissible clusters <*oq, *oql, *lq, *ql, *lo, *loq>.

If the language were English, the permissible clusters could represent (for example) [s, st, str, t, tr, r], while the impermissible ones could represent [*ts, *tsr, *rs, *sr, *rt, *rts].  Other solutions are possible, both in English and in other languages.

So, which languages should we look at? The manuscript is supposed to have been sent to Athanasius Kirscher from Prague, so we would probably start there.  The languages of Bohemia included Bohemian (Czech), which stresses the first syllable of words; and German, which tends to stress the first syllable.  Hungary is not far away, but while Hungarian stresses the first syllable, it doesn't tend to have as many word-initial consonant clusters as would be required under the GAV theory.

Lastly, we should probably throw in English just because Johannes Marci told Kirscher that Ferdinand III's Bohemian tutor thought the book had come from Francis Bacon.

So, given that <ol> is a common word, as well as being a member of the permissible consonant cluster series above, we should probably assume it is unstressed and written without its vowel.  Among these languages, where can we find a permissible consonant cluster series like <q, qo, qol, o, ol, l> where the term <ol> in the series could plausibly be a common word?