Friday, January 31, 2014

A possible candidate for the saha-islanders

In my last post I remarked on the word saha, apparently meaning "island" in the names of some islands just north of the mouth of the Tumen river.  As far as I can tell, this is not a word in a known language of the area.

In this post, I will propose two possible candidate ethnic groups for the speakers of that language. The following text is a Manchu-language description of groups of people called Fiyaka and Kiyakara who brought tribute to the Qing:
fiyaka, fiyaka sunggari ulai dergi ergi ten i bade bi, mederi tun i jakarame son son i tembi. nimaha butame gurgu buthašame banjimbi. hahasi hehesi gemu indahūn sukū be etuku arafi etumbi. juwari forgon de nimaha sukū i arambi. banin doksin becunure de amuran. tucire dosire de kemuni jeyengge agūra gaifi yabumbi. aniyadari seke jafambi.
kiyakara, kiyakara huncun mederi, jai fucin yose i jergi birai biturame son son i tembi. haha hehe gemu ferten de muheren etufi jurhun isire menggun teišun i araha niyalma be miyamigan obume tuhebumbi. hahasi buhū i sukū be mahala arambi. bosoi etuku etumbi. bethe niohušulembi. hehesi funiyehe be tuhebume, sifikū sifirakū, adasun de halai hacin i šeoleme wangnambi. boo ūlen jahūdai weihu be gemu alan i weilembi. ese asu baitalame bahanarakū. nimaha šakarame gurgušeme banjimbi. banitai heolen sula iktambume asaraha hacin akū. ceni ba i ici gisurere be kiyakaratambi sembi. aniyadari seke jafambi.
"The Fiyaka. The Fiyaka are in the high places on the east side of the Sunggari river. They are scattered along the island(s) of the sea. They make their living fishing and hunting. Men and women all make and wear clothing of dog skins. In the summer they make them from fish skins. By nature they are cruel and they love to fight. When they are out and about they walk carrying bladed spears. Every year they bring sable as tribute.
"The Kiyakara. The Kiyakara are scattered along the Hunchun Sea, and along such rivers as Fucin and Yose. The men and women all wear rings in their noses, and hang figurines as ornaments made of silver and copper as much as an inch long. The men make hats from deerskin, wear cloth, and go barefoot. The women let their hair down, and do not wear hairpins. They embroider their lapels with all kinds of different designs. They make every kind of house and boat from birch bark. They do not know how to use nets. They make their living by spearing fish and hunting. They are naturally lazy and idle, and do not customarily accumulate and set aside stores. The speech of their land is called kiyakaratambi. Every year they bring sable as tribute."
The Fiyaka and Kiyakara occupied similar territories, as far as I can tell. Both groups are now said to have been Tungusic, but I think the evidence for this is based largely on the fact that Tungusic speakers are known to have lived in the areas they inhabited. The same section of the tribute records also mentions Nanai (heje) and Udeghe (nadan hala), as well as non-Tungusic Ainu (guye or kuye) and Nivkh (kilen).

Thursday, January 30, 2014

In what language does saha mean "island"?

This evening, I have been poring over Danville's beautiful 18th century map of Korea and Manchuria, looking at the tangled area where Korean toponyms give way to Manchu/Jurchen (and other non-Korean) toponyms.

I started out hunting for two places called (in Manchu) ehe kuren and gūnaka kuren, which I expect to find on Danville's map with a spelling like *eghe couren, *counaca couren. I have not found them yet, but I have found something else that is interesting.

All along the Korean coast, small islands are given names ending in tao, which no doubt corresponds to Chinese 島, "island" (Mandarin dǎo).

Some Korean islands with names ending in "tao"

Along the Japanese coast, of course, we have small islands whose names end in sima, corresponding to the Japanese word shima, "island".

Some Japanese islands with names ending in "sima"

But at the mouth of the Tumen river, there is something I did not expect. Some of the islands have names ending in toun, which must be Manchu tun, "island", and some have names in loun, which I am certain is a scribal error for toun. But others have names in saba and saha, which I don't recognize.

A mix of islands with names ending in "toun/loun" and "saba/saha"

I expect saha is a scribal error for saba, because I see the same error in the name of one of the tributaries of the Tumen, called Cahari at one point and Cabari at another. It is more likely that the correct form is saba because intervocalic -h- is rare in these place names, being usually represented by -kh- or -gh-. But in what language does saba mean island?

Note 3 February 2014: It turns out I was wrong. The word is saha, as I note in a later post. I've corrected the remainder of this post.

The major language families of the area (broadly speaking) are Tungusic, Japonic, Korean, Nivkh, Ainu, Mongolic and Chinese. Of all of these, the only thing I have found so far that looks similar is Korean seom 섬, "island", which really looks closer to Japanese shima, as far as I know.

It looks like there is a somewhat large island called mama saha, and a smaller one called sarbatchou saha. Given similar island names throughout the world, it would not surprise me if these are "mother island" and "daughter island", so maybe that is a place to start.

In Manchu, sargan means "woman" or "wife", and sargan jui means daughter. However, Tsintsius lists this as a Manchu morpheme only, not attested in other Tungusic languages, so it falls into what I call the "enigmatic vocabulary" of Manchu: common words with no known origin.

I have often felt that the enigmatic vocabulary of Manchu provides evidence of close contact between the ancestors of the Jurchens and speakers of an otherwise unattested language. Under this theory, words like sargan would have been loaned into Manchu from the other language, which then disappeared. If sarbatchou saha really does mean "daughter island", perhaps it comes from the same source language, or a close relative.

A close study of the place names in Danville's map may lead to more interesting discoveries.

Saturday, January 25, 2014

A code for comparison to the Voynich

In an earlier post, I proposed something I called the GAV hypothesis, suggesting that the Voynich manuscript could be written in a shorthand, where all vowels accept the stressed vowel in a word are normally dropped, and that the stressed vowels were represented by the gallows letters.

As it happens, when visiting a used book store about 15 years ago, I picked up a manual for a secret society that is written in a similar code. The manual, printed in 1895, is written in an abbreviated form of English, and it is clear that the authors hoped that it would serve as a reminder to those who already knew its contents, but be impenetrable to the uninitiated. The owner of the bookstore told me that an old man would often come in and tease him about the book, saying he would never understand it.

Some parts of the manual are very obscure, apparently containing more critical secrets than others. At the moment, though, I am not so interested in the content of the manual, as much as the statistical properties of the code. In a way, this form of abbreviation serves to increase the entropy of the text in a way that would be similar to compression.

Here is a sample of my 1895 text:
Ths wd is cmpsd % fo Hebrw chrcts, crspdng in our lngg t J, H, V, H, @ cn nt b prncd wthot +| aid % thr sds % +| trigl, tt bng an mblm % De. Th Syric, Chldc, @ Egyptn wds, tkn as ons, is thrfr cld +| G O ℞ ₳ W. Thr is no gp in this °.
Which reads as:
This word is composed of four Hebrew characters, corresponding in our language to J, H, V, H, and can not be pronounced without the aid of three sounds of the triglot, that being an emblem of the (De=Deity?). The Syriac, Chaldaic and Egyptian words, taken as (ons=?), is therefore called the G O R A W. There is no (gp=?) in this degree. [Or maybe: "There is no gap in this circle"?]
This is fairly clear, but just imagine how hard it would be to work with if it were masked by a simple substitution cipher. Consider also the following passage, which is nearly impenetrable, even in plain text:
H P- (Xplns +| d-g, pnl-§, grn hlg-§, @ +| §s gvn at +| vls.)
Perhaps to be read something like:
High Priest- (Explains the ..., ..., ..., and the ...s given at the ...)
If I have time, I'll try to do a statistical analysis of the text for comparison to the VM. But I am thinking there are a couple of things here that could be really instructive. First, it is possible that two people could abbreviate the same language in different ways, creating "dialects" in the text. Second, the language of the code could change depending on the importance of the content.

Sunday, January 19, 2014

Secret sign

I was walking down a steep mountain path in Sichuan with a local guide, paying rapt attention as he told me stories about the area. There were graves in the hillside, he said, and as a troublesome young man he once looked into them, and was terrified by the corpses. Another time he lost his favorite horse, who slipped on the path and fell to his death in a deep ravine. That complex in the valley was a prison, where he had spent some time.

Let's get together in the village later in the day, he said finally. But let's lose the Chinese guy. I don't trust him.

Indeed, I had been curious about the soft, overweight Chinese man in our party. He did not seem physically suited to a three-day horse ride, and he seemed to prefer reading stories on his cell phone to enjoying the dramatic scenery of the Sichuan mountains. Why was he there? My guide seemed to find it suspicious.

If anyone had been listening in, they would have been completely unaware of our conversation on the matter. This is because my guide was deaf, and we were communicating in Chinese Sign Language, of which I had managed to learn a fair amount over the prior three days.

In a previous post, I mentioned some qualities of a good secret language. Here, let me extol the virtues of sign language as an effective means of secret communication in the 21st century.

A secret language is, roughly speaking, a substitution cipher that operates on the level of morphology and grammar. Experience teaches us that unknown languages are difficult to decipher, so as long as the "key" remains a secret, the language remains relatively secure. The "key", in this case, is the combination of lexicon and grammar.

As a cryptographic system, secret languages are terrible. The key is difficult to transmit, and once broken, a new key must be laboriously created and transmitted. However, the great saving grace of secret languages in the 21st century is that encryption can take place entirely within the only device that remains free of malware: the human brain.

In order to remain secure, however, encryption must remain within the human brain. One of the significant weaknesses of secret languages in the era of the surveillance state is that users may be tempted to store or transmit the lexicon and grammar in an electronic form that may be intercepted and compromised. Another weakness is that keywords in the secret language may be distinctive enough that secret messages may be easily identified and used for traffic analysis.

A secret sign language is more secure on both of these counts. First, the key is actually difficult to store in writing, and is most naturally communicated person-to-person. Second, the easiest way to transmit a message over the internet is by video, which requires much more extensive and complex analysis even to pick out the existence of the secret communication.

Today's surveillance states have vast means at their disposal, and can easily out-spend and out-compute most of their adversaries. For the time being, however, there are a few faculties of the human mind that remain out of the reach of conventional computation. A secret sign language takes advantage of many of these capabilities, at a relatively cheap cost.

Thursday, January 16, 2014

Markets that outsmart themselves

I believe financial markets outsmart themselves. I'm not talking about the Efficient Market Hypothesis, but something more meta. This idea probably has a proper name, but I don't know what it is.

Here's the gist of it: There is money to be made predicting financial markets, so people are motivated to predict them. However, once someone acts on a prediction, they alter the market, causing it to become an additional degree more complex. The end result is that the market, which had previously fit a model, no longer fits any model.

Being the geek that I am, of course I have always wanted to simulate this behavior. Tonight I finally sat down and did it.

In my simulation, I have a pool of market actors, each of which has its own algorithm for predicting a market with a single stock. Every day the actors use their models to predict the future price of the stock, and buy and sell shares according to their predictions.

Learning behavior is simulated in two steps. First, random, sporadic mutations occur in the pool of models. If the resulting model is unfit to survive in the market, the other models will naturally eat its lunch. In addition, a certain number of actors are randomly selected to copy the most successful algorithm, so winning algorithms prosper.

Here is a sample of the resulting stock price history. In this case, I had 200 market actors, and the graph shows days 100-200 of a simulation.


Interestingly, while the stock price remains generally in a consistent range, there are periods of irrational exuberance, including a five-day period in which the stock price leaps to 1000, then returns to normal.


In the lead-up to the period of irrational exuberance, the winning algorithms grew progressively simpler, going from six-term polynomials down to a single-term polynomial. If this were to occur in the real world, it would be something like a single wealthy individual getting a crazy idea, and everyone following suit until the market crashes.

Note that I don't assume the stock has any kind of real value here. My market actors are trying to outguess each other. Each actor in the market is trying to predict
how the aggregate of actors in the market will predict
how the aggregate of actors in the market will predict
how the aggregate of actors in the market will predict
...
how the market will behave.

Tuesday, January 14, 2014

The Black Fund

Sometimes I like to think of plots for the great novel I'll never write. Here's one:

Imagine that a powerful non-state actor (maybe a criminal organization) brings together a team of sympathetic people with trading and hacking skills. The team is organized into a strategic planning group, an information collection group, and a market manipulation group.

The strategic planners identify corporate targets and direct a campaign of illegal information gathering to get an insider's view of the corporation's performance and finances. They then gradually take long or short positions in markets all throughout the world, through a complex system of accounts that conceals the fact that a single actor is behind it all.

Having taken position in the market, they may then direct a market manipulation campaign, leaking confidential information obtained from within the company, or disseminating misinformation through hacked social media accounts like Twitter or LinkedIn. Perhaps they will even find ways to distribute misinformation by hacking into news agencies and sending stories out over the wire, or hacking into corporate websites and email accounts to distribute false or misleading news.

Their goal would be to trigger a sudden market movement, during which they could cash in on their positions, then begin slowly unwinding them and closing accounts. In the mass of buy and sell orders that follow the market movement, it may be difficult to see the pattern of accounts that profit from the manipulation campaign.

The more money they make, the more complex their campaigns become, and the more investment they get from unsavory sources. Pretty soon, they're managing billions of dollars in assets, and enriching the kind of people we don't really want to see enriched.

I'll call it The Black Fund. That's probably the tenth novel I'll never write.

[Update 12/1/2014] A NY Times article about hackers with a financial background.

Monday, January 6, 2014

New Chinese information security terminology

The subject of information security is always interesting to me because it involves emergent behavior in complex systems and requires experimental research. In fact, I recently downloaded and have been playing with some vulnerability analysis tools. (I'm only working on my own network, no intention to engage in malicious behavior, etc).

I'm also interested in Chinese software and technological innovations. This afternoon I decided to put these two things together and see what I could find about information security in Chinese. This brought me to a Chinese website describing how hackers operate, including screenshots of some exploitation tools that appear to be Chinese innovations.

One of these is called The Struts2 Ultimate Loophole Exploitation Utility. It takes advantage of weaknesses in the Apache Struts2 framework to execute code on the server. The title of the window in the screenshot includes not only the utility name, but also the names of two of the developers and a phone number.

The names of the developers were unique enough that I was able to find their Weibo accounts, as well as their accounts on a Chinese social site for those interested in IT security. The site lists an ungodly number of software vulnerabilities--19 added today alone. It seems this "white hat" site rewards users for reporting vulnerabilities, which are then passed on to manufacturers. Clever!

Reading through these, I've started to update my "Chinese Programming Terminology" page with new vocabulary related to information security. I've also found a Chinese translation of the manual for the Metasploit pen-testing utility--another goldmine for this type of stuff.

Sunday, January 5, 2014

Complexity of functions and their inverses

I'm interested in the question of trapdoor functions. I want to get a general idea of how common it is to have a function that is easy to calculate, whose inverse is difficult to calculate.

Before I go any further, I want to point out that in this blog post I'm only talking about calculating functions as polynomials. It's possible that a function that is difficult to calculate as a polynomial (using multiplication and addition as primitives) could be easy to calculate using other operators (like XOR, SHIFT, or exotic operators we've never used before). In another post I'll outline the parameters that I believe cover every possible set of primitives for modular calculations.

But for now, I'm just talking about traditional modular polynomials. For a set of invertible functions with inputs and outputs in ZP, what is the relationship between the complexity of the function and the complexity of its inverse?

For the purpose of this short post, I've looked at P=5 and invertible functions f(x,y) with an inverse g(x,z) such that g(x, f(x, y)) = y. Note that I am not looking for the commutative property that f(f(x, y), z) = f(f(x, z), y).

I'm measuring the complexity of each function as a measure of the estimated processing time for its polynomial, counting the number of terms minus one plus the sum of all of the exponents. (That is, counting the number of additions and multiplications required). So, for example, 3x4y2 + 17y would have a complexity if 2 terms - 1 + 4 + 2 + 1 = 8, reflecting the 7 multiplications and one addition.

In Z5 there are a total of 552 = 298,023,223,876,953,152 functions that take two parameters. Of these, (5!)5 = 24,883,200,000 are invertible as I described. All I have is a little laptop and a few hours here and there, so I can't even conceive of exploring this entire function space exhaustively.

For this short post, I took 1000 random samples from the function space for testing. It turns out that the complexity of every function was exactly the same as the complexity of its inverse. 799 of the samples had a complexity of 124, and 199 of the samples had a complexity of 123. Two samples had a complexity of 115.

So, when treated as polynomials, a function and its inverse seem to require the same processing time.

Nifty functions

About a month ago, I worked out how many functions had the following properties:


Since then, I've figured out how to write out pretty equations in Google Docs. (Can you tell?) The number of such functions, expressed prettily, is:


This type of function probably has a proper name, but I don't know what it is. For the purpose of this blog post, I'll call these "nifty functions". A month ago, I thought I had figured out how to generate the truth table for the nth nifty function, but it turns out I was wrong. Painfully wrong.

A couple weeks ago, I wrote some code that will take a truth table and solve for the simplest modular polynomial that would generate the truth table. I thought it would be neat to see what the nifty functions looked like as polynomials.

That's where I realized I was painfully wrong about how easy it would be to generate the truth table for the nth nifty function. While it appears I was right about the number of these functions, generating the truth table turned out to be far more complicated than I thought. However, I have emerged tired but victorious. I can now list out polynomials that represent the nifty functions for a given (mod P).

Naturally, quite a bit of processing is required relative to P, so I am only exploring small primes at the moment. I'm interested in two questions:

1. If you have a polynomial expression that is nifty in (mod Pa), is it also nifty in (mod Pb)?
2. Are there nifty functions where f() can be processed in polynomial time, but its inverse g() cannot be?

Given the tools I have developed so far, I should be able to test these questions experimentally for small P. In the mean time, I leave you with one of the most complex nifty polynomials for (mod 5).