English Language Tables

jonruiter · Dec 13, 2016

Long time lurker, first time poster:

Anyone out there compiled a C/V/CV/CVF language chart for generating pseudo English words?

I realize in asking this, that if no one has, it will be expected of me. :CoW:

Thanks.

Jon Ruiter.

flykiller · Dec 13, 2016

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

'Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!'

He took his vorpal sword in hand:
Long time the manxome foe he sought --
So rested he by the Tumtum tree,
And stood a while in thought.

And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One two! One two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

'And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
Oh frabjous day! Callooh! Callay!'
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

Enoki · Dec 13, 2016

http://crr.ugent.be/programs-data/wuggy

Wintertree · Dec 14, 2016

I've done rather a lot of that. There's a big problem with using letter-based tables: only certain combinations of letters are valid in English (or any other language, of course; just different ones) and only in certain places in a word. There are some combinations that are limited to those exact letters: "-ough" for example. It's all over the place -- rough, cough, through -- but it's only those letters; it's not "-oufh" or "-oegh", though it's sometimes "-augh" such as laugh. You can start analyzing words in terms of phonemes, but that becomes a major hairball very, very quickly.

My solution has been syllable-based systems. Get a list of words, cut them into syllables -- beginning, middle, and end -- and then recombine the syllables. While you won't produce as wide a variety of words as you would with a letter-based system, you'll get a higher yield of plausible-sounding ones.

The first ancestor of TableMaster was a program called NameGen that did that sort of thing in suitable languages for my D&D world. (I didn't happen to be running Traveller at the time) I'm probably dating myself badly when I note that it was written on a Sinclair QL! It was the first random-generation program I wrote based on data tables, which is why I count it as an ancestor of TableMaster. It could do both syllable or letter combinations, but I found that I got far better results with the syllables when I was trying for words that fit a pattern, like how the names in a particular country should sound. The code (and the QL) was lost decades ago, but I learned a lot about the structure of words when I was designing the data lists for NameGen.

I'd suggest starting with something like one of those lists of the 1000 most common English words. That will give you a good source for your beginnings, middles, and ends, and then you can recombine them freely.

As an example, when I was writing the ancient Egyptian names table for TableMaster, I started with a list of known ancient Egyptians (mostly pharaohs). Males, since there are a lot more of them. I went through the list putting tabs in between the syllables, because my text editor of choice is a thing called EditPlus that lets me select things by columns, so once I was through with all my syllable-slicing, I could easily put the different categories (they weren't exactly beginning/middle/end, but close enough for discussion) in separate groups. After I had my syllables, I sorted each group alphabetically and removed duplicates -- that cut them down a lot, of course. Finally I turned each into a table (I wrote myself a little utility a while ago that turns lists into TBL scripts), put in a few options for word structure -- that is, which choices of syllables should go where for various lengths and types -- and started generating batches of about 100 names at a time. Each time, I'd spot the ones that looked terrible, figure out what part of the name didn't really work well, and just deleted that syllable from the tables. After I went through that cycle a few times, I'd thrown out about half the syllables I started with, but got some pretty decent output:

Nebetu
Ikerebek
Amensemo
Wosersekhkhaf
Amunnredut
Surpetkonekh
Haremhsithi
Nessudjhut-nakht
Khusenietnes
Nehetefsenb

Not necessarily great names, but certainly there are some good ones in there.

Do the same thing with English words and you should be able to turn out an unending supply of legitimate-sounding but bogus words.

flykiller · Dec 14, 2016

legitimate-sounding but bogus words.

this works if one is not familiar with the language. how can this work if one is familiar with the language?

lang-uage
fa-mil-i-ar
soun-ding

->

lang-mil-ding?
fa-guag-iar?

reading these one does not think, "oh, an english word with which I am not familiar", rather one thinks, "uh, what ... ?"

Wintertree · Dec 15, 2016

Part of the problem with building words for any language you know is that while you don't consciously know the rules, like "-ough" can't be "-iugh", you know them internally. You've been looking at properly-formed English words all your life. A syllable-based system is somewhat better than a letter-based one for that, in that you have at least legal syllables to combine together, but any system that doesn't know as many rules for the language as you do is going to be a problem. Therefore, you pretty much have to be a linguist to write one, and I, at least, most emphatically am not.

Some languages, such as Korean and Cherokee, can be very well represented syllabically; they have quite distinct syllables and combine them in fairly predictable ways. English, unfortunately, not only isn't one of those but when it comes to analysis, it's a total hairball. More particularly, it's a creole -- that is, a language that developed from a combination of other languages. English is a Germanic language (Anglo-Saxon) combined with a Romance language (Norman French), with various other random bits thrown in from other languages. Its spelling reflects this. English isn't really the hardest language in the world to learn -- in fact, basic English on the "me Tarzan, you Jane" level is extremely easy, as befits its origins as a trade language (from my reading, English actually originated with wool merchants needing to talk to wine merchants in cross-Channel trade). It's when you start getting past that basic level that the hairier aspects come in, and most of those are the result of that combination of languages that produced English. Which, relevant to our problem here, means that there's no one simple system for what is a "good" English word and what isn't, because we have sets of words following multiple patterns based on their language of origin, and neither pattern can be applied to the other. Doing it based on phonemes doesn't even work, because the words have to look right, and English spelling, using a very ill-fitting alphabet, is badly disconnected from English phonemes. Consider, as a simple example, that the letter C sounds like either S or K. And that's a really simple example; it gets a lot worse than that.

Going with the examples used, it's not quite as bad as it looks. First, there's the matter of splitting them into syllables. Remember that we usually split between two consonants, that suffixes like "-ing" are distinct syllables, and things like "ua" and "ia" are dipthongs, representations of some of those way-more-than-5 vowels with the 5 letters we have to work with.

lan-guage
fa-mil-iar
sound-ing

So with those examples we'd get things like:

laniar
faguage
laniaring

Perhaps not great words, but at least usable.

Obviously, if a real linguist (which I'm not) was writing a complex program (which I'm not) to construct English words, they could come up with something a lot better than that. Though part of the problem in that case is quite the opposite of what this gives us, namely the more the system corresponds to English structures, the more actual English words it will produce that will need to be weeded out.

Another problem with syllable-based systems is word length. Obviously, they only work for words of two or more syllables, and ... well, look at this sentence. 17 words, and only 4 of them have two or more syllables. While we might make "obtences" or "senvious" out of them, there's not much we can do about those monosyllablic words. That is one definite limitation of this type of system.

One option is a hybrid system: you start with rules for how to construct a syllable, plus a certain number of prefab syllables (mostly affixes of various sorts), then build syllables and glue them together. A couple of the TableMaster tables for generating strange alien words do that, using letter frequency tables to build the syllables. Of course, I have an advantage there because nobody knows what the words are supposed to look like to begin with!

This leads to an important question: what is the intended purpose of this? If it's just to write a scrap of a headline that would involve one type of emphasis, while to produce the equivalent of 'Jabberwocky' would need a very different one.

flykiller · Dec 15, 2016

an excellent post. thank you.

English is a Germanic language (Anglo-Saxon) combined with a Romance language (Norman French), with various other random bits thrown in from other languages.

actually considerably more than that. it's german with quite a bit of celtic (whiskey) and norse (sick), a big influx of norman french/latin (pork, castle), followed by a huge influx of french (all the -sion/tion words), followed by a vast infusion of ancient greek/roman (kilometer, microscope), with more words added all the time as they become handy (tsunami, kiosk, gulag, wa, gringo, blitz, pizza). heh, notice how the word "tortilla" is adopted whole, spelling and pronunciation intact.

Condottiere · Dec 15, 2016

Not to worry.

With what seems our current pop culture, educational system, and technical innovations, all English words will soon be able to be expressed as emojis.

flykiller · Dec 15, 2016

all English words will soon be able to be expressed as emojis.

like a mcdonald's cash register keyboard. "push the happy meal symbol, then the big mac symbol ...."

kilemall · Dec 16, 2016

flykiller said:
like a mcdonald's cash register keyboard. "push the happy meal symbol, then the big mac symbol ...."

Which gets us right back to hieroglyphics or the Chinese/Japanese forms.

Or these for that matter.

:rofl:

o:

Wintertree · Dec 16, 2016

flykiller said:
an excellent post. thank you.

actually considerably more than that. it's german with quite a bit of celtic (whiskey) and norse (sick), a big influx of norman french/latin (pork, castle), followed by a huge influx of french (all the -sion/tion words), followed by a vast infusion of ancient greek/roman (kilometer, microscope), with more words added all the time as they become handy (tsunami, kiosk, gulag, wa, gringo, blitz, pizza). heh, notice how the word "tortilla" is adopted whole, spelling and pronunciation intact.

I was simplifying.

To quote James Nicoll, "We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary."

And then we try to write it all down with an alphabet that was invented (or at least popularized) by the Phoenicians, munged around a bit and given some vowels by the Greeks, borrowed by the Etruscans, and looted by the Romans, which is unsuitable for representing the phonemes in all the languages spoken by people on that list except Phoenician, using spelling which has remained the same while pronunciation has shifted, even in the dialect it originally represented.

The names of the foods on my dinner plate tonight came from French (twice), Anglo-Saxon, Spanish, and Nahuatl. I had a salad, beef, string beans, potatoes, and a chocolate brownie.

And this is why trying to simulate English programmatically is such a screaming nightmare. The pattern for how either letters or syllables should be assembled into words varies depending on both the original language of the word and also how long ago it was adopted (or seized) by English. It further complicates things that the two primary source languages, even if we ignore all the others, are from quite different language families (Germanic and Romance) which have very different structures.

The short version is that there is no really good way to do this because of what English is. My solution would be to generate a page full of words and pick the ones that sound good. Admittedly I'm biased here, given that I created TableMaster and all -- as they say, when you have a hammer, everything looks like a nail. Beyond that, I wouldn't even know where to start. English is just too weird.

Wintertree · Dec 16, 2016

kilemall said:
Which gets us right back to hieroglyphics or the Chinese/Japanese forms.

Or these for that matter.

:rofl:o:

Actually, it's not quite that simple. Contrary to common belief, neither hieroglyphs nor hanzi/kanji are actually ideograms.

Hanzi (Chinese characters) in many cases act as, essentially, rebuses. The sub-characters (radicals) that make them up represent specific sounds, with additional elements to indicate meaning -- the latter necessary because Chinese is a tonal language. How you say something is important. If you've seen the movie "My Cousin Vinnie" (if you haven't, you should), remember the one guy's response to the accusation against him: "I shot the clerk?" said in a tone of disbelief. Part of the plot of the movie involves the difference between "I shot the clerk?" and "I shot the clerk!" Chinese -- or, more correctly, the group of closely-related languages that we collectively call "Chinese" -- is all like that. And instead of the difference between "What, you think I shot the clerk?" and "Yeah, I shot the clerk!" it can be the difference between (just making this up) "mother" and "hamster" or "father" and "elderberry". I don't know a lot about hanzi, mind you; just slightly more than average. You could memorize them and learn their meanings, but it would be much like memorizing "mother" and "hamster" as symbols without knowing how they're pronounced.

Egyptian hieroglyphs (btw, "heiroglyphic" is the adjective) are different --and much weirder. They include an alphabet (or at least an abjad), biliteral and triliteral signs, and determinatives. Also the names of gods. They often combine several of these in the same word, so sorting one out can be an exercise in redundancy.

There are something like 800 distinct symbols in Egyptian hieroglyphs. As you can imagine, that's not nearly enough for an ideographic or logographic system. Knowing what we know now about writing systems, etc., which is considerably more than what it was in Champollion's day, that would tell us right there that it's a mixed script: too many characters for an abjad, alphabet, abugida, or syllablary, not enough for anything even approaching an ideographic script.

I'll leave out the biliteral and triliteral signs for now, and stick with just the abjad and the determinatives for my explanation; they're complicated enough.

The thing that makes it hard is that the Egyptian writing system did not account for vowels, much like, say, Hebrew. Or ancient Phoenician, which is part of the reason why we have five symbols to represent somewhere between 14 and 21 (depending on dialect) vowel sounds in English. When you have a language that only records consonants, you can have a real problem getting across the right word. Imagine if English did that: the word "bt" might mean boot, beat, boat, bat (either baseball or flying), but, etc. Sometimes you might be able to pick it out of context, but what about "sw bt" ... I saw a boot? I saw a bat? I saw a boat? The way the Egyptians handled that was with what is called a determinative -- a symbol added at the end to categorize the word. So if you read "bt" with a little foot at the end, you'd know it was "boot" because it has to do with feet.

I'm studying hieroglyphs and, necessarily, Middle Egyptian at the moment. (so far, I can write really profound things like "the man is in this place" ... it's slow going) "Man" is written as, essentially, "z" -- we actually don't now if it was pronounced "zu" or "az" or "uza", but conventionally a short 'e' is written in when transliterating to English: "ze". I haven't encountered them yet, but I'm expecting there will be other words written as "z" (which is, by the way, a symbol that looks like a line with two lumps in the middle) which would have been obvious in spoken Egyptian -- "zo" and "uza" are obviously different words -- but not in writing, so whichever one meant "man" was indicated by a determinative that is a stylized picture of a man. Which, when I draw it, looks a lot like a frog with back problems. *sigh* The word for "woman" is "zt" -- "t" is the feminine ending; yeah, Middle Egyptian has grammatical gender, though it's on a par with, say, Spanish, not German, so at least there's that. "Zt" could be "zet" (as it's transliterated), or "uzut", or "zeta", or something else, so again, there's a determinative to show what category it falls into -- words having to do with women, in this case. Thankfully, it's somewhat easier to draw. One useful determinative is a rolled-up papyrus scroll (a narrow vertical rectangle with a line in the middle for the string) -- it indicates abstract concept. So the English word "wnt" with a pair of feet (the determinative for words related to movement) would be "went", but with the abstract-concept determinative, it would be "want." That gets used a lot, because when you think about it, an awful lot of words in any language are for abstract concepts. That's one reason why there is no known truly ideographic script: how do you draw "want" or "in the future" or "understood"? In hieroglyphs, you don't; you write the consonants of whatever the word for them is, and add the papyrus scroll symbol.

tl;dr: Egyptian hieroglyphs are not "picture writing" -- they're a representation of the consonant and quasi-vowel sounds of the spoken language, sometimes with ideograms (known as determinatives) to identify the category of the word in order to to disambiguate homonyms, of which, since it's an abjad instead of an alphabet, there are necessarily many.

jonruiter · Dec 17, 2016

See?

Original post had that can of worms...

Thanks for all the information and back-and-forth. Right now my generator is just using hex locations for names, and the 'hey, look at the population in 0424!' is getting kinda old.

I will proceed with the 'most common word syllabification' method to generate the 'kinda-based on english' system and world names.

Once i figure out the mapping of tidally locked planets.

Wintertree · Dec 17, 2016

*tosses another worm in the can*

If you're creating names, there's one other thing to take into account: Names, especially geographic names, are often conserved when the dominant language in an area changes. The invaders might put their own names on settlements (though not necessarily) but for things like rivers, they ask someone "what do you call this?" and that sticks.

Take, for example, the city of Schenectady, NY. The Dutch, who were the first Europeans to settle there, didn't name it that. Neither did the English who followed. That was the Mohawk, in whose language it meant something like "beyond the pines." No English-speaker would have named something "Schenectady"; it's full of letter combinations that don't normally occur in English. The English-speakers inherited the name, and it has vexed generations of schoolchildren on spelling tests ever since.

So in an area where the dominant language has changed repeatedly -- England, let's say -- some names are going to reflect a substrate that is not present, or at least not visible, in the current language.

Further, new settlements -- those established by occupiers or those built in unsettled places, which is more common in Traveller than in real life -- get all sorts of names. Some are meaningful in the language used, like "Bayport", but others are in other languages (Latin and Greek are big ones), or named for the places the settlers came from (Boston, MA is named for Boston, UK), or named for a prominent individual, a sponsor or relative thereof, the person who originally owned the land, or somebody's mother. (Aiken, where I live, is named for a former governor)

Take the Schenectady, NY area (yes, there's someone from there commenting on this post): I've mentioned Schenectady, itself, being Mohawk. Schenectady County is adjacent to Rensselaer County, named for the Dutch merchant who first purchased the land (from other Europeans; the locals weren't consulted): van Rensselaer. Towns in the area include Burnt Hills, with the obvious English meaning; Alplaus, from the Dutch "Aal Plaats", eel place; Troy, right out of Homer; Albany, named for the Duke of Albany (which is apparently a region of Scotland); speaking of Scotland, there's Scotia. So basically, you've got a mix of native names and two separate waves of immigrants. In other parts of the US, some of those immigrants were French or Spanish, giving their respective names to places like Bellefonte or Los Angeles.

So that's something you need to consider, too: why would people name it that?

Or you could be a sane person, and just make up a bunch of words. But then what would we have to talk about?

Ishmael · Dec 17, 2016

Here is something that might be useful; an online Markov based word generator. Although I've never used any custom training data, it looks like the ability is there. Adding an english-based training list shouldn't be a problem.

https://www.samcodes.co.uk/project/markov-namegen/

Wintertree · Dec 17, 2016

Advance warning: This thread has gotten me talking about generating random names; this is a subject I can natter on about for hours on end, with decades of practice in said nattering, so if you don't like great walls of text about random word/name generation, you're probably going to want to skip this post.

Now that is interesting looking. It's not something I could do in TableMaster -- well, not easily; TBL has grown into a powerful enough language that I wrote a bubblesort in it while I was waiting out the last couple hours of the Kickstarter, so it's not impossible per se, but that's not really what it's for -- but it really pushes my "cool thing with words" button. Admittedly that button is about the size of a pizza.

It's definitely something I need to learn more about, and play with. (contrary to popular belief, I don't actually live/think/breathe TableMaster; I've been known to go for whole minutes thinking about something else!)

Incidentally, in random diseases it came up with "mallpox". That should be a real word! Maybe what you get from being in a shopping mall during the Christmas season? "I'm beat ... I was out on Black Friday and now I've got mallpox."

Ah, here's something I can compare directly: dinosaur names. Here's the first 10 names the procedural name generator created:

Code:

Panosaurus
Regosaurus
Tylus
Edmong
Asylophos
Baryosaurus
Harascopely
Ansaurus
Yangia
Zhuchus

And here's the first 10 from a TableMaster table using the cut-and-reassemble technique:

Code:

Odonosaurus
Hadrosaurus
Claodemus
Topsgnathus
Barothon
Carnovenator
Megalopteryx
Carcharopteryx
Disaurus
Corythdectes

I used a list of dinosaur names and their meanings that I got off Wikipedia or somewhere to create the table. I didn't use syllables, in this case, but words that are elements of real dinosaur names, so they do mean something, whether or not it makes sense. Hmmmm ... carnovenator ... meat hunter ... that does not sound like something I want chasing me!

One big down side of this, though, is variety -- with only 162 first halves and 31 second halves, that's only 5022 possible dinosaur names. That's generally true of anything where the resulting word has to be meaningful. For example, on the way home today after running some errands, I was noticing the names given to housing developments, presumably to appeal to buyers. They're named things like "Pine Crest" ... which would probably be in totally flat terrain where the only pine tree within miles is the one painted on the sign; we know this. Naturally, this makes me want to write a table to generate such names. I'm just wired up that way. That's where the limitation issue comes in: "Pine Crest" is good, but "Pine Blorg" is right out. (even though it might be a better description; some of these places look pretty chintzy) So the only way to produce something like that would be with two lists of words, one with Pine, Sunrise, Breezy, etc., and the other with Crest, Ridge, Lake, and so on. Ten of each gives you 100 possible names. Probably none of them are as bad as Crystal Pointe, which is a real development near here, in an area that's got no crystals because it's all sand for hundreds of feet down (southeastern coastal plain) and is miles from the nearest pond, let alone any body of water large enough for a point(e) to really count. 20 of each would be 400 names. Which, I suppose, is adequate for normal gaming use ... "your contact is having a Christmas party in his house in Pine Crest" ... but I really prefer more variety. The problem is that it's virtually impossible if it has to make sense.

The only thing you can really do in that case is to come up with as many potentially usable words as you can (there's a reason my major reference works are not gaming books but thesauruses, dictionaries, and the wonderful Word Menu) and, if possible, work in some names that can be totally, or at least mostly, randomly generated, in order to add greater variety. For instance, if you're naming planets in Traveller, you might roll up ones with names like "Blorg's World", which then further allows you to name ships things like "Pride of Blorg's World", which is handy for the shipping news in the station's newsfeed. "The Pride of Blorg's World departed today, destination Krangmar."

Looking at the procedurally-generated dinosaur names again, I note that a lot end in "-saurus". I know why mine do -- that's one of the prefab second halves of the words, and has a high probability -- but I'm less sure about the Markov generator. Is it doing this all procedurally, or is it generating a word and adding an ending to it? I definitely need to study this more, because it utterly fascinates me.

I'm tempted, just out of curiosity, to try cutting up its training data and making a random name table out of that, to see how the results vary from what it comes up with. If I didn't have to burn 100 CDs of the TableMaster demo and ship them off to a convention (they're going in VIP goodie bags), I'd probably do it. But I'm late with those CDs already, and I really, really, really should not be making dinosaur name tables instead of, y'know, burning dozens of CDs. I shouldn't even be writing this post, but this whole thread speaks right to my obsession. I'm not fascinated with this kind of thing because I created TableMaster; I created TableMaster because I'm fascinated with this kind of thing. And I'd better go now, or I'm going to keep on about this all day!

Wintertree · Dec 17, 2016

Important note: I'm going to be traveling (family emergency) for the near future; I'll check in when I can, but since that'll be from an iPad in a motel room somewhere, things like actually posting will be iffy at best.

Enoki · Dec 23, 2016

If you just want readable, pronounceable nonsense words, use a Lorem Ipsum generator like this one.

http://www.webpagefx.com/tools/lorem-ipsum-generator/

simonh · Feb 17, 2017

Enoki said:
If you just want readable, pronounceable nonsense words, use a Lorem Ipsum generator like this one.

http://www.webpagefx.com/tools/lorem-ipsum-generator/

Those aren't nonsense words though. They're randomly selected valid Latin words.

In case anyone thinks random words look too alien, check out these perfectly valid English words.

abomasum
anfractuous
bardolatry
barmecide
bilboes
borborygmus
chanticleer
claggy
concinnity
deglutition
draff
funambulist
gnathic
humdudgeon
inspissate
loblolly
moonraker*
pantagruelian
rubiginous
skycap
velleity
zopissa

* For the Ian Fleming fans, this actually just means a person from Wiltshire. Or possibly a type of sail.

Simon Hibbs

maggot-iiss · Feb 17, 2017

Word Generation

Dictionary.

Traveller Word Generator @ https://github.com/MaggotIISS/WordGen

Download folder. Double click WordGen/dist/WordGen.jar

English Language Tables

SOC-1

SOC-14 5K

SOC-14 1K

SOC-6

SOC-14 5K

SOC-6

SOC-14 5K

SOC-14 5K

SOC-14 5K

SOC-14 5K

SOC-6

SOC-6

SOC-1

SOC-6

SOC-13

SOC-6

SOC-6

SOC-14 1K

SOC-12

SOC-12

Similar threads