From ja-goldsmith@uchicago.edu Tue Oct 31 09:21:45 2000 Date: Sun, 29 Oct 2000 13:58:43 -0800 From: John Goldsmith To: tony@math.sunysb.edu Subject: RE: information on information Just a follow-up note to the previous one, which was terminated on account of getting lunch for the kids. I'm attaching a graph of the same information as in the previous message. What this little bit of analysis suggests is that there is relatively little variation in information per letter. There are a couple of reasons to not be very interested in these results. The first is that we'd like to see samples from languages with more variety, and I just don't have corpora on hand, but I'll look into it. Another, more interesting reason is one that you will have noticed if you had a chance to look at the paper I pointed you to, which is that information in words is spread over morphemes, and a model which doesn't take that into account (such as a model in which probability is based on the preceding letter) is missing a good deal of the structure of human language. In your message below, you mention syllables, and while syllables are certainly important in language, I suspect that bringing syllables into our model will have relatively little impact on the results that we get (because the easier models are based on looking at the one or two preceding letters as the conditioning factor; and knowing syllable boundary will never improve that model in the case where you're looking at two preceding letters; it will occasionally improve the csae where you're conditioning on only one preceding letter). Coming back to the main point, quantifying information is always a matter of going back to Boltzmann's insight that entropy is the log of the number of possible states, and in language, there are always multiple ways of conceiving of the number of states. If we think about things phonologically (or orthograpically), then the state space is the space of sequences of letters, whereas if we think of things in terms of morphemes, the state space (for individual words) consists of a single dimension over which you select one stem, and then a more complex set of alternatives for suffixes (and occasionally the odd prefix). So a large part of the morphological information is bound up in (is quantified by) the entropy of the set of stems of the language. This gets to the heart of the point about whether one would speak slower or faster in a language in which the information was denser or less dense: the phonological information is by no means the same as the morphological information -- which in turn is not the same as the syntactic information, or the semantic information. About those last two we have less understanding at this point, but about the phonological and morphological we have some grasp. What do you have in mind to write about? What got you started in this area? best, John Goldsmith -----Original Message----- From: tony@math.sunysb.edu [mailto:tony@math.sunysb.edu] Sent: Thursday, October 26, 2000 5:15 PM To: ja-goldsmith@uchicago.edu; tony@math.sunysb.edu Subject: information on information Hello John I came across your Royaumont article while compiling a list of web resources on information theory to put with my web column this month. I'm planning it to be "The Mathematics of Communication" (to appear in http://www.ams.org/new-in-math which I edit and usually write). I have been interested in natural languages all my life and thought they could come together with my love of mathematics during those crazy days in the 50s when I worked with Yngve & Co. on MT as an MIT undergrad. I was soon disabused (by Lees himself who took me aside one day -during my summer job at IBM- and said, as I remember it, that there was no real hidden math and that TM was a chimera). But I did learn about information theory and that stuck with me. They had me cook up an optimal code for Russian. In those days IBM had a contract with the Air Force to provide a hard-wired Russian-English translating device. What I'm hoping you can do for me, and soon if possible, is let me know if there have been any useful studies of relative information content, say of syllables, across languages. For example, since Mandarin has a relatively small set of possible syllables (even counting tones), compared to English, one might think that the information per syllable must be lower in Mandarin, and that Mandarin speakers could/would speak more quickly and still be understood. Or, would have to speak more quickly to transmit the same information in the same amount of time. My personal axiom is that all spoken languages are equally efficient, but this may be wrong. Does anyone know one way or the other? I mean efficient in general. Clearly some particular things are more pithily expressed in one language than in another. Here's the kind of joke we used to tell. "The most interesting thing about any language is the way it resembles Russian." Tony Phillips [Part 2, Application/X-MSEXCEL 18KB] [Unable to print this part] From ja-goldsmith@uchicago.edu Tue Oct 31 09:22:20 2000 Date: Sun, 29 Oct 2000 09:33:24 -0800 From: John Goldsmith To: Tony Phillips Subject: RE: information on information I ran the following numbers on a few languages, and am enclosing results. I measured the letter-entropy on a corpus, and then the bigram (2-letter sequence) entropy. The difference between these is the conditional entropy, that is, the weighted average of the entropy of a letter given the preceding letter. That should be a reasonable measure of how much information each letter provides. Then I multiplied by the average number of letters per word. Now we need to factor in the average number of words per sentence, but I haven't done that yet. (KiRundi is a Bantu language, as is Swahili). If you have trouble with an excel attachment, let me know. best, John -----Original Message----- From: Tony Phillips [mailto:tony@math.sunysb.edu] Sent: Thursday, October 26, 2000 5:40 PM To: John Goldsmith Cc: Tony Phillips Subject: RE: information on information Wow. Thanks for your speedy answer. I'll look up the paper you mention and I'll be grateful for more if you can send it. Tony [Part 2, Application/X-MSEXCEL 18KB] [Unable to print this part]