Jabberwocky: Language Detection and Gibberish

ENWritten Language Detection

Creating an algorithm to guess the language of a text can be very useful – for instance, say you have a multilingual RSS feed and want a text-to-speech software to pronounce the sentences, for instance:

rss airship

You need to know whether the sentence is in English or French before managing it correctly.

You could try to use Dictionaries to find words, but it’s not that reliable. It doesn’t work that well for sentences with foreign names, slight spelling mistakes or made-up words like band names.

Trigram method

The best methods are statistical, such as analysing the trigrams of the sentence.

First you need a big learning corpus, like 5 big novels for each language you want to guess. You extract all the trigrams (groups of three letters or less) and create a frequency dictionary. For instance, English is well-known for having a lot of the, of, and, it, is, i etc whereas French has a lot of le, la, les, du…

Then you extract the same data from your sentence and guess the most probable language, based on the frequency of the trigrams only.

Jabberwocky_ Language Detection and GibberishGibberish Detection

The statistical method will also work on English or French looking sentences and words.

Let’s take the Jabberwocky, the famous non-sense poem written by Lewis Carroll in 1871.

Jabberwocky

‘Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand:
Long time the manxome foe he sought—
So rested he by the Tumtum tree,
And stood awhile in thought.

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

“And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
He chortled in his joy.

‘Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

Such an algorithm will detect it looks like English, even if there are unknown words such as frabjous, vorpal or borogoves.

It really mimics our human ability to detect those same words are being fake English instead of fake French or fake Japanese.

Of course, things can get wrong if you’re faking to speak a language in front of a native speaker… The IT Crowd is so much fun:

Gibberish Generation

Gibberish detection can be used for fun or art for example. See this interesting video about how English sounds like to non-English speakers.

Skwerl by Brian and Karl:

If you want half-gibberish such as in this video or in the Jabberwocky poem – so you can still understand the structure and some words – a method using Markov chains can work.

Here is what I got:

“Norfolks and Somebody wish to speaks took at Road, and hemmed to paper–a day, in God known the crime wandest cart as I would, but at timent to set to deathere, too glad troubles who was by Anne Cathy’s lord, if I must inquillings friend of them in tea who overpower.”

If you want full gibberish you can just reverse the trigram method and pick trigrams at ponderated random, and get a series of English sounding syllables such as “theatinted“.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s