A yegg stole my zebu
19 Oct 2008
A while ago I received one of those annoying chain emails which ask you to take the last word in a list, change one letter to find a valid word not already in the list, add it to the end and forward the whole thing to ten friends or you’ll die in a horrible car crash.
It got me wondering about words in English. Are most words connected to each other in this way? If not, what do the distinct groups (or cliques in graph parlance) look like? Do two-letter words, three-letter words or four-letter words have more cliques? I decided I’d find out.
Preliminaries
For simplicity, I limited the edit operation to only allow one letter in the word to change (ie. no deletions or insertions) at a time. You can think of the dictionary as an undirected graph, with each node being a word and each edge an edit operation permitting you to travel between adjacent nodes.
My data source is the 2of12inf.txt file from the 12 Dicts package from wordlists.sourceforge.net. It uses American spellings and seems to be a fairly decent list of words (which is to say my 10 minutes of hunting didn’t provide me with anything better). It contains:
- 2-letter words: 62
- 3-letter words: 642
- 4-letter words: 2546
- 5-letter words: 5122
Results
A little ruby script yielded the following results.
All 2-letter words belong to the same clique.
Of the 3-letter words, 631 (of 642) belong to the same clique, the remaining 11 are each entirely disconnected from each other. They are,
- nth, ism, urn, ebb, obi, qua, ova, use, ugh, gnu, aha
The 4-letter words are more interesting. 2415 of the 2546 belong to the same clique and the other 131 are divided up into 97 other cliques. Many of the 131 are common words. The 18 cliques with more than one word:
- ache, achy, acme, acne, acre, ashy
- info, into, onto, undo, unto
- high, nigh, sigh, sign
- afar, agar, ajar
- also, alto, auto
- eddy, edge, edgy
- icon, ikon, iron
- idle, idly, isle
- opal, oral, oval
- used, user, uses
- bevy, levy
- crud, crux
- demo, memo
- hadj, hajj
- idol, idyl
- ogle, ogre
- orzo, ouzo
- thou, thru
The remaining 79 that are stranded on their own:
- adze, agog, ague, ahoy, alga, ammo, amok, anal, ankh, apse, aqua, aura, avow, awol, ayah, bozo, ciao, ditz, ebbs, echo, ecru, egad, emus, ends, envy, epee, epic, espy, euro, evil, exam, expo, guru, hymn, ibex, iffy, imam, iota, isms, jato, judo, kiwi, liar, luau, lynx, meow, myna, nevi, nova, obey, oboe, odor, ohms, okay, okra, once, onyx, orgy, ovum, rely, rhea, rhos, semi, sexy, stye, sync, tofu, tuft, ugly, ulna, upon, urge, uric, urns, void, yegg, yeti, yuan, zebu
The 5-letter words have the most room for sizeable subsets. Of the 5122 words, 3935 fall into a common clique. The next 5 cliques with at least 10 members:
- reset, resew, resow, renew, beset, besot, besom, bosom, begot, begat, began, begun, begum, begin, vegan, bigot, bight, wight, tight, sight, sighs, signs, highs, right, night, might, light, fight, eight, beret, beget (31)
- round, wound, would, world, mould, moult, mount, fount, count, court, could, sound, pound, mound, hound, found, bound (17)
- acnes, acres, acmes, aches, ached, acted, anted, antes, antis, antic, attic, ashes, ashen, aspen, asses, asset, apses (17)
- comic, conic, cynic, tonic, toxic, toxin, topic, tunic, runic, sonic, ionic, colic (12)
- overs, overt, avert, alert, ovens, opens, omens, evens, event, avers (10)
The next 31 cliques have between 4 and 9 members, 34 have 3 members each, 108 have 2 members each, and 613 words are stranded on their own.
The joys of a Sunday evening well-spent. I also learnt that nth is a valid word without vowels, a zebu is a breed of cattle, and a yegg is a burglar or safecracker.