Навигация

Недавно декодированные:

Atom logoлента Atom

Our language detection algorithm

During the development of dcde.ru we’ve faced an interesting algorithmic problem.

We cannot ever know for sure which representation of the given term is correct - the one submitted by the user, or the one presented to him after processing.

For example, if the user typed in “руки вверх” and it has been converted from cyrillic to latin “herb ddth[”, which one of them would be a meaningful phrase in Russian or English?

A naive approach would be to use a dictionary of known words, but this presents 3 significant problems:

  1. the dictionary would most probably not contain all possible words and their inflections without growing to extreme size,

  2. regardless of the above, dictionary’s size would probably be unneededly large,

  3. when a given word would exist in both dictionaries, there’s no simple answer as to which one should win.

We needed approach that would be free of the above deficiencies and that would capture the specific characteristics of the given written language as a whole, without being restricted to specific cases.

After considering different universal approaches (e.g. neural networks are the first thing that springs to mind), for this task we’ve decided to use a different data structure that is based on frequency of incidence of consecutive letters in words of the given language.

This technique takes advantage of the fact that various languages (at least the ones not represented using logograms) have certain statistical signatures with respect to which characters (phonograms) occur after which other characters, e.g. lots of English words contain the letter “i” followed by letter “n” - that is, the “in” sequence is very frequent. On the other hand, the “gk” sequence, for example, is very rare.

So, by parsing a long enough wordlist or text in a given language, you can build a table that can aid you in recognizing how likely any given text is to be written in that language.

Going further, when you take longer substrings (of 3, 4 or more characters) you get much more specific sequences which will be even more characteristic of the given language, but at some size you just start making a dictionary of full words and the size of your table will explode uncontrollably (as you’d be constructing something which resembles a full dictionary).

So for our needs we’ve experimentally determined, that we should build 4 separate occurrence tables, corresponding to substring sizes from 2 to 5.

More specifically, each table is a map with substrings as keys, and their occurrence counts as values. We parse a wordlist of all words or a long text in the given language and count the occurrences of all substrings of given length in the word. After this, we should get something like the following example (for 3-letter substrings):

…
“lvi”: 20,
“mys”: 15,
…

Then, all those maps are nested in another map, which maps substring lengths (integers) to these maps as its values, e.g.:

2: {map of two-letter substring occurrences},
3: {map of three substring occurrences},
4: {map of four-letter substring occurrences},
5: {map of five-letter substring occurrences}

Separately, we record the sum of all counts in all the maps for the given language - this will be used for normalizing the results so that they can be compared between different languages (whose maps are likely the result of processing a body of text of a different length, and therefore have a radically different size and counts of characteristic substrings).

This way, we’ve constructed a fingerprint of the given language (consisting of occurrence maps for substrings sized 2-5, and the overall sum of all counts for later normalization).

Now suppose we want to compute the likelihood that the given word or text fragment is in the given language (and compare that to the likelihood for the same word, but a different language).

We iterate over all substrings of all the supported lengths (2-5 in our case) present in that word, and for each substring, we look up its count in the given language’s fingerprint maps. We add the count to the overall score for that word. If a substring doesn’t occur in the map at all, it means we found a sequence of characters that shouldn’t ever occur in the given language - in such a case, we penalize it heavily by subtracting points from overall score for that word.

A careful reader would probably notice that with this scheme, languages with larger overall counts recorded in their table would be unjustly privileged - only because their tables were constructed using larger amounts of text.

Because of that, the overall score needs to be normalized - so we simply divide it by the aforementioned sum of all counts for the given language, which is an attribute of the language’s fingerprint.

We also divide the value by the length of analyzed text - so that texts of slightly different lengths can still be meaningfully assessed.

The resulting value should accurately describe the likelihood that the analyzed word or text is in the given language. This value isn’t useful in and of itself. But once you compute it using different language fingerprint structures and compare the results, you can decide which of those languages the text is most likely in.

By empiric studies we’ve determined that for reliably distinguishing Russian from English languages, the described algorithm need to be refined - namely the computation of score for the analyzed text. First, the score should be augmented by analyzed substring size - the longer substrings are way more characteristic, so they should contribute more to the overall score. They also should impose a larger penalty if they never occur in the given language.

The formula that is the most effective according to our experiments is:

where:

n = substring size (constant during the depicted pass)

ci = occurrence count of the given substring

b = number of bad substrings (that shouldn’t occur in the language at all and therefore aren’t present in maps)

S = sum of all occurrences in language maps

L = length of the analyzed text

This formula can very likely be improved upon since it’s been made up quickly using trial and error, and rigorous mathematical analysis would probably yield a much more effective one.

This is, of course, a simplified description of the process, that omits mundane and/or obvious processing steps, like filtering out invalid characters (which may make compared words different in size, signifying the importance of proper normalization). YMMV.

Русская версия