Tags: artsy data
I wrote a cute little program to generate poetry in toki pona, the 125-word language that was published by Sonja Lang in 2001.
Here's one of its sick rhymes:
jan sama pi jan,
ike li pona ala,
tawa mi la tan.
tenpo ni e kala.
And an attempted translation by u/jedgrei from Reddit. I say attempted because this particular poem doesn't fully make sense.
Because I dislike the enemy of the sibling, now [fish?].
Before I explain how it all works, I should share some relevant facts about toki pona:
Now for an EXPLANATION. The first thing I needed was a bunch of toki pona text (a "corpus"), which I gathered by downloading comments from r/tokipona using a script that I happened to have lying around. Then I created a big ol' Markov chain based on that corpus, using another script that I happened to have lying around.
So to generate text with this Markov chain graph thing, you randomly traverse the graph, beginning at the node called start, which represents the start of a sentence. The probability of going from one node to another is weighted by the number of times the words appear next to each other in the corpus. The weight is stored in the edge between the nodes. In this example, the word "toki" appeared at the start of 11 sentences, which means there's a probability of 11/(11+2) that you go from start to toki. Similarly, there's a probability of 2/(11+2) that you go from start to pona. Don't panic!, the probabilities add up to 1.
Here's a path we might follow through the Markov chain, resulting in the sentence "toki pona awen". It depends on which random numbers we pull from our hat, though. We could also end up with "toki awen" or "pona awen".
That's fine for normal text generation, but when you're generating poems, there are extra constraints on the output, such as the number of syllables per line. Or maybe the final word on a line should rhyme with a word from another line.
Here's how the algorithm works with the addition of constraints. Let's say we're generating a line in a poem using the Markov chain from before. We've followed the path start → toki → awen. But we need the last word to rhyme with "sona", which "awen" doesn't.
We retrace our steps back to toki.
Then we try picking pona as the next node, giving: start → toki → pona.
We can stop here, because "pona" happens to rhyme with "sona". We'd also have to backtrack if we exceeded the number of syllables allowed on the line.
That's pretty much how the poem generation works. A concise description would be: weighted random search through a graph, with backtracking.
Maybe, rather than throwing this poem generator on the trash heap of all the useless programs I've written, it'll be the basis for a Twitter or Mastodon bot. Stay tuned (update, May 9th 2021: it's here, after a couple of weeks stuck in Twitter's spam dragnet). In the meantime, here's the code, and below are a few more poems I've generated.
A limerick (rhyming scheme aabba).
a a mama mije ala pali wan,
wan wan seli e pilin pona pi jan.
lon li kalama en,
nasin ni li len,
noka nasa mute li pona e pan.
A Shakespearean sonnet (abab cdcd efef gg).
pona taso soweli sina ken,
ante la sijelo pi sitelen,
pona mute li nimi wan wan en,
nimi wan wan wan wan taso mi ken.
mani lili tan mi li kon li sin,
lon e kulupu tomo suli kin,
la sina lon pi sona pona kin,
la ona li pona e ma lukin.
luka en pilin telo seli sin,
e oko lili pi luka wan wan,
la nanpa lili ni la ona kin,
la lipu mute la mi ali jan.
pona ala tan ni li kepeken,
nasin sewi li suli li awen.
And a haiku (abc).
pilin ala ken,
ala mute o lukin,
e kon mun en tan.
Here are all groups of rhyming words, according to my definition of a rhyme. My definition of a rhyme, by the way, is where the last syllables of two words are the same, regardless of whether they're stressed syllables.
There are 61 rhyming words, split between 21 groups. That leaves 64 words without a rhyme buddy :(
It would be a waste not to do anything with the text data from r/tokipona, so here are some quick plots.
Here's a word cloud. The bigger a word is, the more common it is on r/tokipona.
Here's Zipf's law in action, in the form of a cumulative frequency distribution. It basically shows that the top 10 words make up about 50% of word occurrences, while the top 50 words make up almost 90%. This pattern shows up in all human languages, I think.
Aaaand here's a table of all the words, ordered by frequency. BUT WAIT. Something just occurred to me. A lot of people write in English on r/tokipona, so the letter "a" probably appears more popular here than it is in actual toki pona.
I'd be happy to hear from you at firstname.lastname@example.org.