My journey through the world’s languages

I seem determined to wait until a later and later month to make my first post of each calendar year. In this case, I’m not even posting a fully self-contained essay on a recently-contemplated topic; rather, this post is a culmination of a personal project that has carried on for longer than the past three years. It feels very satisfying to finally be able to put a cap on it now.

[Heads up: this post ends with a full diagram of language relationships that I constructed and is the part of this post that I put the most work into and am most proud of — feel free just to scroll to the bottom and check that out if you don’t feel like reading the full explanation and commentary and general linguistics geek gushing that I’m about to launch into.]

Origin of the project

It’s probably apparent to someone who looks through the content of this blog carefully enough that I have a passion for studying languages and linguistics, though only on an amateur level. The seed of this particular project was my fondness of Kenneth Katzner’s book The Languages of the World, in which each page or two is comprised of a summary of one of some 200 of the major languages of the world, complete with a text originally published in that language and its translation. This gave me an appreciation of the leisure activity of wandering through the world’s languages one by one, getting a small sample of each. During some months in graduate school, probably back around early 2012, I decided to look at one language a day out of Katzner’s book as bedtime reading, ordering them (as well as I could) according to linguistic genetic affiliation rather than geography as Katzner did, and reading something about the ones I found more interesting on Wikipedia as a supplement. This came out a little more haphazard than I’d hoped, particularly because I didn’t plot out the structures of each of the language families in a very systematic way before diving into them. (The fact that my copy of the Katzner book is from the 70’s, when many languages went by different names and knowledge of relationships between the languages was somewhat dated, didn’t help either.)

In the end, my reaction was to vow to myself that in a few years’ time I would embark on a similar mission but “do it properly this time” — this is usually my instinct regarding personal intellectual projects, even on the completions of ones that went reasonably well. My usual rule of thumb is to wait four years, but what with various other things distracting me intellectually in 2016, I waited until five years had passed. By this time of course I was very “active on the internet” in a way that I definitely hadn’t been back in 2012; in particular I had not only this blog but also a Tumblr blog where I felt perfectly comfortable posting plots of semi-meticulously constructed trees of languages (there’s something about the casual “jot down your thoughts” vibe of Tumblr that makes me feel okay about posting something extremely geekish that most people who run across it are likely not at all interested in — they can always scroll past it!). The result was a very extensive endeavor that took me all the way to the middle of this year.

What I did

Each stage of the project began with my plotting out a list of languages in a particular family (or plotting several smaller families spoken in the area at a time, as I wound up doing more and more) in such a way as to show their genetic (in the comparative linguistics sense) relationships. This boiled down to presenting each family as a tree whose trunk corresponds to the family itself, with the trunk splitting into branches, the branches often splitting into subbranches, etc. with the individual languages as the leaves. I chose not to use resources like Glottalog to categorize all the languages but stuck with Wikipedia as I knew that was the base resource I wanted to use for researching the languages anyway. (I have a very soft spot for Wikipedia as a source of articles on languages, as I knew Wikipedia primarily as the best such source back when I was a teenager slightly before the use of Wikipedia was even common knowledge!)

I began with the Indo-European family and planned to meander slowly around the world, roughly in an easterly direction, ordering the families mainly according to geography. The memory is coming back to me that I had initially planned to plot out all the languages in all the families for my very first post before embarking on the journey but then decided that Indo-European would do for the first leg of the journey and assumed that as Indo-European is one of the most extensive and prominent language families I would have only several legs following it. Of course, in retrospect I’m laughing my head off at this early expectation, as my initial installment consisting of the Indo-European languages turned out to be only the first of fifteen and wound up taking only a small fraction of my time and energy I wound up spending.

So, at each step I would get a family or a set of families laid out, in a tree-structured list complete with links to Wikipedia pages, saved as a Tumblr post. Not all of the links were attached to “leaves” of the tree, i.e. individual languages; I often linked the “trunks” to the Wikipedia pages of the families themselves and to each branch linked the Wikipedia page corresponding to that branch (if there was one) and/or the Wikipedia page of the ancestor language corresponding to that branch. For example, there is a Wikipedia page for Proto-Slavic so I put that into my label for the Slavic branch of Indo-European; I also included the page for Latin in my label for the subbranch of Romance languages.

Now that I had an ordered list of links on a single page, I would go through them one by one for what usually took some weeks, looking over one language (or sometimes an ancestor language, or the description of a family with a hypothesized ancestor) per day. For more minor languages with less information on Wikipedia or elsewhere online, I would often do two a day; for the most major languages with a wealth of detailed phonology/alphabet/grammar information on Wikipedia (e.g. Russian, Arabic, Mandarin, Japanese, Tagalog, to name a few) I would pause and take several days. I do want to point out here that my choices of languages to put on the list and how much attention I paid to each of them was not especially fair in terms of how widely they are spoken: for instance, I zipped rather quickly through the various dialects of Arabic and of Mandarin despite some of them having tens of millions of speakers while making disproportionate numbers of stops for tiny, sometimes nearly extinct, Australian or Oceanic languages, particularly if they had interesting features that I wanted to better expose myself to.

In a number of cases, Wikipedia disappointed me with shoddy writing or incomplete information, or else there was a particular aspect of the grammar that I wanted to study further — in these cases I did some Googling to find papers that I could spend some time looking over (at times, I found these papers directly through the Wikipedia page). I never at any time consulted a book — in particular, to do so, I would have felt compelled to check one out of an English-language library, which I didn’t have much access to during most of the past three years when I was living abroad. If a language I was looking at that day had a page in Katzner’s book, I would read that before bed that night, but this became less and less frequent as the project went on, as I was wandering through languages that were increasingly obscure. Most of the time, near the end of the day, I also made an effort to listen to audio samples of that day’s language. The idea was really to feel like I “went somewhere” and was able to taste a particular language as well as understand something about its structure.

Anyway, I quickly developed the tradition of then reblogging each post that contained a language list to discuss my experience researching those languages and then make bulleted notes of interesting facts about them that came up in the research — specifically, not things that I already knew beforehand, but things that I had newly learned. Later on in the journey, as I got to more “exotic” and obscure families — often which I had never even heard of let alone already knew anything about — these bullet points became quite extensive; it eventually began to feel like a running joke in my commentary that I would start it off by saying something like, “This time, I’ve once again outdone myself with the size of my report.”

Here are links to the original lists I drew up (sometimes edited between when I first posted them and when I reblogged for the follow-up report) along with their “Facts That I Learned” reblogs:

Over the three years that I was composing and working my way through these lists, my methods evolved and became somewhat more rigorous. I started out much more reliant on Katzner’s classifications, basing my outlines of language families off of his, and then double-checking with Wikipedia to adjust them. This became more and more impractical as the journey went on, especially when I came to the more obscure families that were either not mentioned in Katzner’s (again, rather dated) account or were classified by Katzner in an outdated way (which tended to involve putting groups of languages together in single families when the modern consensus is that there is no demonstrated relationship between them). I think it was when preparing my list of the (monstrously vast) Niger-Congo family that I learned to begin with Wikipedia and meticulously go down each and every branch of a “tree” of links to carefully make sure that I was including all branches/subbranches/languages and that they were arranged to be most compatible with the current consensus. Eventually I discovered that I had sloppily left out whole branches and even one or two semi-major languages from the groups I had done earlier, hence my going back and doing a “clean-up” step as you can see in one of the links above.

Additionally, around the time I was going through Niger-Congo, I discovered the collection of World Language Movies clips of Bible stories narrated in a vast variety of different languages; while I had previously been getting my audio samples from a YouTube series called Wikitongues, I quickly realized that World Language movies is (in my opinion) a lot more clear and reliable. In particular, they’ve done audio translations in almost literally every language I could possibly want to listen to, no matter how obscure, as long as it has, say, more than a two-digit number of speakers — those people are admirably dedicated to spreading their truth very thoroughly over all corners of the globe.

The number of languages I ended up visiting, as seen from a rough count on the trees shown at the bottom of this post, is around 670 or so, but these were only the ones I deemed sufficiently spoken/interesting and with enough accessible information online to include — there are vastly more languages listed online, faithfully endowed with Wikipedia pages which are mostly stubs. I can see from the timestamps on the posts I made when starting this on my other blog that I set off on my journey on June 11th, 2017 and finished on June 27th, 2020. While I took some major breaks along the way (for instance, almost all of last fall), as life sometimes forced me to do, it took me all the way until earlier this summer to get through what I believe are all the known families, branches, and languages of the world. So the project was completed successfully, at least in the sense of following an intended procedure, but did I get what I hoped for out of it?

What I learned

After going through each batch of languages, I wrote up a little commentary on my newly acquired knowledge, so here is a meta-commentary of sorts on the what I learned in general from the whole thing.

1. There are a whole hell of a lot of languages spoken in the world.

I mean, a lot.

Back in June of 2017, I seriously thought that I would get done with this within a year or so. After all, the proto-version of this project in 2012 only took me a matter of months. But it turns out, when you don’t stick to the more-spoken and better-known languages indexed by Katzner by default but instead systematically track down all known languages indexed by Wikipedia, the numbers are staggering, particularly among non-European families and particularly among indigenous populations which were colonized by Europeans at a time when they were still broken up into smaller, more isolated communities.

As I’ve said, the rough count among languages in the map shown below is around 670. This is only a fraction of the number of languages which have ever been documented and whose Wikipedia pages I actually visited (even if only for seconds, because many were almost empty). There are relatively few languages whose speakers number in the millions and a much greater numbers of ones whose speakers range from zero (recently gone extinct) to several hundred. This first properly hit home for me when I was compiling my list of Niger-Congo languages: I wound up spending a full Sunday afternoon in which I got through hundreds upon hundreds upon hundreds of languages before I even got to the main attraction, the Bantu languages, which turn out to be a fairly narrow subset of Niger-Congo. It is incredible to me, not so much that so many distinct languages are and have been spoken, but that humanity has managed to identify and document them and verify that they are in fact distinct languages.

In the end (or at least, starting at the point where I began investigating Niger-Congo, the first and most truly vast family), I made a point of clicking on every single living language I could find on Wikipedia, plus most of the extinct ones. I think it’s possible that I have clicked on every single page that Wikipedia has for each of the world’s languages (and proto-languages, and pages for families and branches), with perhaps some minor exceptions of little-attested extinct languages or very little-spoken languages/dialects in some of the western families. I feel a little proud (at least when I avoid contemplating alternative ways of spending that time) when I wonder how many people in the world can say that they’ve done this.

2. The comparative amounts of information available on languages from different families/areas is highly unbalanced.

It’s natural to imagine that accessibility of detailed information about a language is proportional to how widely spoken it is, but there are a surprising number of irregularities in this. I first came face to face with this phenomenon when I was still going through Indo-European languages and reached the Indic branch and found sudden difficulties in doing research on the languages of this branch (and it was the branch I came in knowing the least about in the first place). At the time I diagnosed this as indicative of something racism-adjacent, and I stand by that, but I think there is a stronger element of American/Western-centrism involved. There is a great deal more information available on many of the American aboriginal languages, as well as Hawaiian (and other Polynesian languages along with it) than there is for many languages in parts of the East that someone with a Western or American cultural background is a little more isolated from. Use of the Roman alphabet, as opposed to very obscure, little-used alphabets, may play a role here too.

3. There are far more separate language families, once considered related but now considered distinct, than I had realized.

Kenneth Katzner, at least back in the 70’s, would have us believe that there is only one Caucasian family with Kartvelian, Northeastern Caucasian, and Northwestern Caucasian only branches, that Turkic, Mongolic, and Tungusic are only branches of Altaic, that all Australian aboriginal languages are part of the same family, etc. A recurring theme throughout my journey, especially throughout the second half, was that what were once portrayed by the linguist community as single families are now considered (according to general consensus among experts) to be a number of completely independent families. It seems, at least from my layman point of view, that the default has switched from believing that languages in geographic proximity tend to be related, to skepticism that a given group of languages is related to another group of languages; when such a relationship is not blatantly obvious, the burden of proof nowadays seems to be on the one who proposes a relationship.

4. Immense complexity and nuance can be found within languages in different ways in every part of the world.

This was less of a revelation to me than most of these other points, because I had mostly already progressed beyond this biased belief on starting the project, but looking so much more closely at far-from-European languages helped me to realize more profoundly than ever that the European families, home of such intricate and ornate grammars as that of Sanskrit and Russian, do not come anywhere close to holding a monopoly over grammatical complexity. One of the main counterexamples to this that I had understood clearly for many years prior is that of Hawaiian grammar distinguishing between two types of possession in a way that, to the best of my knowledge, no European language does. I had also been aware of the existence of polysynthetic grammars, most of which have their origins in the Americas, and which technically (by definition of polysynthesis) contain more inflection than the grammars of even languages like Sanskrit and Latin. And of course I had known something about the elaborate hierarchy of respectful registers woven into the grammar of Japanese (although I hadn’t understood its full extent). But I had no idea of the amounts of grammatical distinction of nuances that would never have occurred to me in a million years, the elaborate noun class systems, the multitude of dimensions of verbal inflection, the finely precise demonstrative pronoun systems, and the intimidating sound laws (and other rules) regulating combinations of morphemes attached to words, that I would encounter when exploring far beyond my own linguistic backyard.

The degree of subtle complexity humans are able to put into their spoken languages, as well as the great diversity of dimensions along which this complexity can appear, is astonishing, and it knows no geographic, racial, or ethnic bounds. The great diversity of mind-expanding subtleties found among the enormous swaths of rapidly-dying-out language groups definitely adds an extra element of tragedy to the rapid extinction of most of the world’s languages in favor of very few of the most widely-spoken ones.

And it brings me almost directly to my next point…

5. There are so many features of real languages that I had never known of or that I had heard about but never fully understood.

I mentioned just above that I had long known about the fact that multiple levels of respect can be conveyed in Japanese grammar, but I had never understood the full extent of it, namely that there are entire alternate sets of vocabulary used only in conversations meant to reflect certain relationships between the speakers.

I also mentioned that I knew that polysynthesis (inflection so extensive that entire sentences can be conveyed by a single word) was a thing, but I had always been confused about how such a thing could be executed (do you just stick the subject noun and object noun to a verb and arbitrarily call it a single word?). Now I understand and appreciate it so much better that I feel strongly inspired to incorporate it into a conlang.

Those are just two examples of linguistic phenomena that I knew something about, or thought I did, but understand far more deeply now in a way that makes me feel far more satisfied whenever I ponder them.

But there were plenty of other phenomena that I had never heard of before, many of which I would have come up with even if I spent all of my time conlanging. Examples include

  • vast banks of personal pronouns which are not really pronouns at all but are most often kinship nouns
  • languages with almost no verbs (like 120 or so)
  • languages with nouns (usually body part terms) are ungrammatical without a possessor
  • switch-reference markings on verbs
  • something called focus-marking prevalent in Austronesian languages
  • something called classificatory verbs found in certain Native American languages
  • special “cases” of person pronouns indicating that they are modified by a number or specifying that they are alone (e.g. “he alone”)
  • and different speech registers requiring different sounds or vocabulary used between people with a particular familial relationship or only when performing a certain activity (this is very tied in with cultural norms in certain places I had been ignorant of).
6. My perception of particular features of languages being common/uncommon on a world scale is now greatly sharpened.

This is perhaps the deepest and most interesting to me of all of my reflections here.

Going in, I had thought I appreciated (as a native English speaker who had done some studying of other Indo-European languages) certain features of English that made it unusual among its neighbors (e.g. lack of grammatical gender, no second-person familiar/formal distinction, our “th” sounds); certain features of Indo-European languages that are fairly unusual elsewhere (e.g. grammatical gender, with Semitic languages being an exception) or not especially common elsewhere (e.g. gendered pronouns, definite articles); and features common to languages in other parts of the world but not near me (e.g. tone, polysynthesis). It turns out that my perspective was still very Indo-European-skewed, and now I feel that my eyes have been opened.

Here are just a few of the particular features whose prevalence I now understand to me much different from what I had thought.

  • It turns out that “f” and “v” sounds are absent from many entire families and geographic areas and seem to be just as rare around the world as English’s “th” sounds.
  • It turns out that noun inflection in Indo-European (several noun cases with intimidatingly complicated rules for inflection) and Finno-Uralic languages (well over a dozen noun cases but with simple rules) is almost as complex as it gets, but verbal inflection is far more complex in very many of the families in other parts of the world, even discounting polysynthesis.
  • At the same time, it turns out that verb conjugation for person and number of the subject, which I had thought of as a fairly European or Middle Eastern thing, is extremely common throughout language families all over the world, although many of these conjugate verbs for person and number of the object as well.
  • I had thought that using the same prefixes for person/number verb conjugation as for inflecting nouns to show possession was a distinctive trait of both Uralic languages and Turkic languages (to the point of actually calling this the “most strikingly narrow of all” of the features common to the two families in my follow-up commentary on them) is so common that it could almost be considered the default for the world as a whole.
  • It turns out that elaborate systems of noun classes are much more than just an African thing; they can be found in indigenous languages in Australia and all over the Americas.
  • It turns out that while on average, the world’s languages are less consonant-heavy than the European ones we tend to study, the distinction for having the most severe consonant clusters goes not to the Caucasian languages but to the Salishan languages over in the northeastern part of North America.
  • It turns out that having no true adjectives but only stative verbs meaning “is-[adjective]” is not just something I dreamed up for conlangs but is reasonably common in the Americas.
  • And it turns out that a complete lack of inflection, while fairly uncommon, can be found much more diversely than just among the Chinese languages (in which Mandarin is often held up as the least inflected language in the world) and Polynesian languages.
7. There are way, way more widespread and distinctive aerial features of unrelated languages than I had imagined.

If the last point I brought up is the one I could dive deepest into, this point is the one that slapped me in the face the hardest once I got into sets of smaller families that shared a geographic area.

My general conception of language evolution was — and mostly still is — similar to my conception of biological evolution. That is, languages split off directly from common ancestors and therefore have “genetic” relationships among them which usually take precedence over other influences, such as borrowing words or the spreading of particular features aerially (which means that grammatical or phonological properties in genetically unrelated languages can be spread through being spoken in the same geographic area). I had always seen aerial features as sort of a secondary phenomenon, and the only really concrete examples I knew of were within the Indo-European family: the Balkan Sprachbund and the fact that Germanic and Romance languages have developed the common innovation of using the verb “to have” to form present perfects (not to mention using a similar but non-cognate word for “have”, e.g. German haben, Spanish haber!).

It turns out that aerial features count for way, way, way more of the easily visible characteristics common to certain groups of languages, which are spoken in certain parts of the world, than I would have imagined! This was not so apparent when I had studied only a few large, geographically widespread families, but it became abundantly apparent when I started to look at clusters of tiny languages families scattered within fairly narrow zones of the world. Examples include

  • African and Asian languages being more tonal in general than other languages
  • Asian languages using a generous bank of kinship nouns in place of personal pronouns
  • Papuan languages having rather vowel-y phonologies with few consonants and consonant clusters, much like their Polynesian neighbors
  • Australian languages having few vowel phonemes, few or no fricative sounds, and no voice distinction in their consonant systems
  • many Australian languages having extremely few verbs
  • languages all the way up and down the Americas being polysynthetic (that’s not even a narrow geographic region!)
  • American languages having the quirk referred to above where certain nouns cannot appear in a non-possessed form
  • North American languages loving uvular stops
  • and certain word orders (e.g. Subject-Object-Verb) and “inclusive/exclusive we” distinction or lack thereof being prevalent in certain areas.
8. I feel profoundly more knowledgeable about the diversity of natural languages as a result of this project.

I set out to actually immerse myself a bit into the languages of this wide variety of language families, as well as one can by staying home in front of one’s laptop. And it worked to my satisfaction. I now feel as though I actually traveled somewhere (even if somewhat on an abstract and intellectual level). I actually feel more worldly now. Getting to tour the vast expanses of varieties of verbal communication that our species has come up with has been a treat which I recommend to anyone with a passion for linguistics similar to mine.

In short, I definitely got what I wanted out of this project.

My diagram of the full forest of languages

I wanted to assemble all the “trees” of languages that I had plotted out in a more aesthetically pleasing way than as they appeared in microblog posts and in a way that clearly showed the relationships and degrees of closeness between them, so after some hours of fiddling around, I figured out how to plot a group of trees (a “forest”) using the Latex package tikz-qtree. I didn’t feel the need to include links in the diagram as I had in the rough microblog posts; each and every one of these languages (and all of the language families, and most of the branches) has a page on Wikipedia which was always my starting point for research. The result is below.

But first, here are a few words of explanation about how these plots are displayed. Each of the trees has five “levels” going from left to right. The first, leftmost level shows the “trunks”, which represent each language family. The second level corresponds to main branches within families, the third level to subbranches, and only in the fifth, rightmost level do I put individual languages. The fourth level is used for multiple purposes in this graphic: it may correspond to some sub-subbranches, or it may be where I put older forms of modern languages (e.g. Old English). The evolution of old forms to modern forms is indicated by a double-headed arrow pointing right. There is a sort of fifth-and-a-half level for pidgin/creole languages that developed from other modern languages; a single-headed arrow is used to indicate where a pidgin/creole has sprouted from another language (or languages).(There is also an exception to the usual structure of levels in the Germanic languages section, where there are some more complicated subdivisions that I, perhaps from a place of bias, felt was important to show.)

There should be a very rough sense that going from left to right represents time moving forward, but I don’t claim that this is consistent: in particular, older forms of languages shown in the fourth level did not necessarily exist during the same period.

A dotted line/edge indicates a tenuous, highly disputed relationship. I do not claim consistency in deciding fairly whether a hypothesis that any two particular groups are connected has enough clout in the linguistic community to be shown with a dotted line on the map. There are so many hypothesized relationships between various families that to attempt to show them on the map would create a ridiculous amount of clutter and would ultimately be futile. More in general, there may be some mistakes with this graphic, or aspects of it which prove in the future to be “mistakes” after more research, so I’ve decided to treat this as a “living figure” and feel entitled to quietly edit it whenever and wherever appropriate.

Without further ado, enjoy my map of the spoken languages of the world! (For technical reasons owing to the absurd vertical length of the graphics, I had to upload this as three separate images.)

