21 July 2009

Non-parametric as memorizing, in exactly the wrong way?

There is a cool view of the whole non-parametric Bayes thing that I think is very instructive. It's easiest to see in the case of the Pitman-Yor language modeling work by Frank Wood and Yee Whye Teh. The view is "memorize what you can, and back off (to a parametric model) when you can't." This is basically the "backoff" view... where NP Bayes comes in is to control the whole "what you can" aspect. In other words, if you're doing language modeling, try to memorize two grams; but if you haven't seen enough to be confident in your memorization, back off to one grams; and if you're not confident there, back off to a uniform distribution (which is our parametric model -- the base distribution).

Or, if you think about the state-splitting PCFG work (done both at Berkeley and at Stanford), basically what's going on is that we're memorizing as many differences of tree labels as we can, and then backing off to the "generic" label when memorization fails. Or if we look at Trevor Cohn's NP-Bayes answer to DOP, we see a similar thing: memorize big tree chunks, but if you can't, fall back to a simple CFG (our parametric model).

Now, the weird thing is that this mode of memorization is kind of backwards from how I (as an outsider) typically interpret cognitive models (any cogsci people out there, feel free to correct me!). If you take, for instance, morphology, there's evidence that this is exactly not what humans do. We (at least from a popular science) perspective, basically memorize simple rules and then remember exceptions. That is, we remember that to make the past tense of a verb, we add "-ed" (the sound, not the characters) but for certain verbs, we don't: go/went, do/did, etc. You do little studies where you ask people to inflect fake words and they generally follow the rule, not the exceptions (but see * below).

If NP Bayes had its way on this problem (or at least if the standard models I'm familiar with had their way), they would memorize "talk" -> "talked" and "look" -> "looked" and so on because they're so familiar. Sure, it would still memorize the exceptions, but it would also memorize the really common "rule cases too... why? Because of course it could fall back to the parametric model, but these are so common that the standard models would really like to take advantage of the rich-get-richer phenomenon on things like DPs, thus saving themselves cost by memorizing a new "cluster" for each common word. (Okay, this is just my gut feeling about what such models would do, but I think it's at least defensible.) Yes, you could turn the DP "alpha" parameter down, but I guess I'm just not convinced this would do the right thing. Maybe I should just implement such a beast but, well, this is a blog post, not a *ACL paper :P.

Take as an alternative example the language modeling stuff. Basically what it says is "if you have enough data to substantiate memorizing a 5 gram, you should probably memorize a 5 gram." But why? If you could get the same effect with a 2 or 3 gram, why waste the storage/time?!

I guess you could argue "your prior is wrong," which is probably true for most of these models. In which case I guess the question is "what prior does what I want?" I don't have a problem with rich get richer -- in fact, I think it's good in this case. I also don't have a problem with a logarithmic growth rate in the number of exceptions (though I'd be curious how this holds up empirically -- in general, I'm a big fan of checking if your prior makes sense; for instance, Figure 3 (p16) of my supervised clustering paper). I just don't like the notion of memorizing when you don't have to.

(*) I remember back in grad school a linguist from Yale came and gave a talk at USC. Sadly, I can't remember who it was: if anyone wants to help me out, I'd really appreciate it! The basic claim of the talk is that humans actually memorize a lot more than we give them credit for. The argument was in favor of humans basically memorizing all morphology and not backing off to rules like "add -ed." One piece of evidence in favor of this was timing information for asking people to inflect words: the timing seemed to indicate a linear search through a long list of words that could possibly be inflected. I won't say much more about this because I'm probably already misrepresenting it, but it's an interesting idea. And, if true, maybe the NP models are doing exactly what they should be doing!

05 July 2009

Small changes beget good or bad examples?

If you compare vision research with NLP research, there are a lot of interesting parallels. Like we both like linear models. And conditional random fields. And our problems are a lot harder than binary classification. And there are standard data sets that we've been evaluating on for decades and continue to evaluate on (I'm channeling Bob here :P).

But there's one thing that happens, the difference of which is so striking, that I'd like to call it to center stage. It has to do with "messing with our inputs."

I'll spend a bit more time describing the vision approach, since it's probably less familiar to the average reader. Suppose I'm trying to handwriting recognition to identify digits from zero to nine (aka MNIST). I get, say, 100 labeled zeros, 100 labeled ones, 100 labeled twos and so on. So a total of 1000 data points. I can train any off the shelf classifier based on pixel level features and get some reasonable performance (maybe 80s-90s, depending).

Now, I want to insert knowledge. The knowledge that I want to insert is some notion of invariance. I.e., if I take an image of a zero and translate it left a little bit, it's still a zero. Or up a little bit. Of if I scale it up 10%, it's still a zero. Or down 10%. Or if I rotate it five degrees. Or negative five. All zeros. Same hold for all the other digits.

One way to insert this knowledge is to muck with the learning algorithm. That's too complicated for me: I want something simpler. So what I'll do is take my 100 zeros and 100 ones and so on and just manipulate them a bit. That is, I'll sample a random zero, and apply some small random transformations to it, and call it another labeled example, also a zero. Now I have 100,000 training points. I train my off the shelf classifier based on pixel level features and get 99% accuracy or more. The same trick works for other vision problem (eg., recognizing animals). (This process is so common that it's actually described in Chris Bishop's new-ish PRML book!)

This is what I mean by small changes (to the input) begetting good example. A slightly transformed zero is still a zero.

Of course, you have to be careful. If you rotate a six by 180 degrees, you get a nine. If you rotate a cat by 180 degrees, you get an unhappy cat. More seriously, if you're brave, you might start looking at a class of transformations called diffeomorphisms, which are fairly popular around here. These are nice because of their nice mathematical properties, but un-nice because they can be slightly too flexible for certain problems.

Now, let's go over to NLP land. Do we ever futz with our inputs?

Sure!

In language modeling, we'll sometimes permute words or replace one word with another to get a negative example. Noah Smith futzed with his inputs in contrastive estimation to produce negative examples by swapping adjacent words, or deleting words.

In fact, try as I might, I cannot think of a single case in NLP where we make small changes to an input to get another good input: we always do it to get a bad input!

In a sense, this means that one thing that vision people have that we don't have is a notion of semantics preserving transformations. Sure, linguists (especially those from that C-guy) study transformations. And there's a vague sense that work in paraphrasing leads to transformations that maintain semantic equivalence. But the thing is that we really don't know any transformations that preserve semantics. Moreover, some transformations that seem benign (eg., passivization) actually are not: one of my favorite papers at NAACL this year by Greene and Resnik showed that syntactic structure affects sentiment (well, them, drawing on a lot of psycholinguistics work)!

I don't have a significant point to this story other than it's kind of weird. I mentioned this to some people at ICML and got a reaction that replacing words with synonyms should be fine. I remember doing this in high school, when word processors first started coming with thesauri packed in. The result seemed to be that if I actually knew the word I was plugging in, life was fine... but if not, it was usually a bad replacement. So this seems like something of a mixed bag: depending on how liberal you are with defining "synonym" you might be okay do this, but you might also not be.