17 November 2009

K-means vs GMM, sum-product vs max-product

I finished K-means and Gaussian mixture models in class last week or maybe the week before. I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.

There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs. (I say "typical" because you can modify these algorithms fairly extensively as you see fit.) The first difference is that GMMs just have more parameters. The parameters of K-means are typically the cluster assignments ("z") and the means ("mu"). The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma"). The GMM model is just richer. Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means). The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others. This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.

One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value. This leads to something that has much more of a K-means feel.

This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult. (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.) My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay." (This is purely from a practical perspective.) The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points. (Yes, this is hand-wavy and it's easy to construct examples where it fails. But it seems to work many times.) Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).

Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help. (Or maybe "rarely" would be more fair.) Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).

Now let's think about approximate inference in graphical models. Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z"). For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments. (Note that this is departing slightly from typical notation for EM.) Suppose I want to do inference in such a model. Here are three things I can do:

  1. Just run max-product. That is, maximize p(X,Z) rather than p(X).
  2. Just run sum-product. That is, compute expectations over X and Z, rather than just over Z.
  3. Run EM, by alternating something like sum-product on Z and something like max-product onX.
Of these, only (3) is really doing the "right thing." Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against exact inference. (Yes, this limits us to relatively small models with 10 or 20 binary nodes.)

What do you think happens? Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.

When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.

Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same. But max product also does almost the same.

This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not). Max product is just unbeatable.

In a sense, from a practical perspective, this is nice. It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM). But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over. Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).

Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use. Even if there's lots of uncertainty in the data. Of course, you may get much better assignments from GMM (or you may not, I don't really know). But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter. At all. Ever.

Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them. Shouldn't someone have noticed this before, though? For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?

(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks. Maybe there are other results out there that we just haven't been able to dig up.)

07 November 2009

NLP as a study of representations

Ellen Riloff and I run an NLP reading group pretty much every semester. Last semester we covered "old school NLP." We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present. There was a lot of overlap between Ellen's list and mine (not surprisingly). If people are interested, I can provide the whole list (just post a comment and I'll dig it up). The whole list of topics is posted as a comment. The topics that were actually selected are here.

I hope the students have found this exercise useful. It gets you thinking about language in a way that papers from the 2000s typically do not. It brings up a bunch of issues that we no longer think about frequently. Like language. (Joking.) (Sort of.)

One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of representations. Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems. To be more concrete, let's look at a few examples. It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.

I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988. This is basically "the" RST paper. I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies. Eg., this sentence provides background for that one, and together they argue in favor of yet a third. But this isn't really where RST begins. It begins by trying to understand the communicative role of text structure. That is, when I write, I am trying to communicate something. Everything that I write (if I'm writing "well") is toward that end. For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue. This current paragraph is supporting that claim by providing a concrete example, which I am using to try to convince you of my claim.

As a more detailed example, take the "Evidence" relation from RST. M+T have the following characterization of "Evidence." Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:

relation name: Evidence
constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constraints on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N
This is a totally different way from thinking about things than I think we see nowadays. I kind of liken it to how I tell students not to program. If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing. Don't start implementing first. I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math. Or rather, with plenty of math, but without a representational model of what it is that we're studying.

The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above. (Or at least this is my reading of it.) That is, they have laid out a representation of text and claimed that this is how texts get put together.

As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization." Here, the story is about how stories get put together. The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories. That is, I would naively think of a story as a series of events. The claim that Lehnert makes is that this is not the right way to think about it. Rather, we should think about stories as sequences of affect states. Effectively, an affect state is how a character is feeling at any time. (This isn't quite right, but it's close enough.) For example, Lehnert presents the following story:
When John tried to start his care this morning, it wouldn't turn over. He asked his neighbor Paul for help. Paul did something to the carburetor and got it going. John thanked Paul and drove to work.
The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and resolves (1). You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact. The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled"). Again, a high level representation of how stories work from the perspective of the characters.

So now I, W, hope that you, R, have an increased belief in the title of the post.

Why do I think this is interesting? Because at this point, we know a lot about how to deal with structure in language. From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something. It can even be unsupervised if it makes you feel better. So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep. Of course, features are a big problem; as a very wise man once said to me: "Language is hard. The fact that statistical association mining at the word level made it appear easy for the past decade doesn't alter the basic truth. :-)." We've got many of the ingredients to start making progress, but it's not going to be easy!

06 November 2009

Getting Started In: Bayesian NLP

This isn't so much a post in the "GSI" series, but just two links that recently came out. Kevin Knight and Philip Resnik both just came out with tutorials for Bayesian NLP. They're both excellent, and almost entirely non-redundant. I highly recommend reading both. And I thank Kevin and Philip from the bottom of my heart, since I'd been toying with the idea of writing such a thing (for a few years!) and they've saved me the effort. I'd probably start with Kevin's and then move on to Philip's (which is more technically meaty), but either order is really fine.

Thanks again to both of them. (And if you haven't read Kevin's previous workbook on SMT -- which promises free beer! -- I highly recommend that, too.)