26 August 2006

NLG on NPR

I heard a story on NPR while driving today that referenced a recent Financial Times story about the use of natural language generation technology for producing financial stories. The financial company, Thomson, is now generating some of its stories using NLG. The NLG technology, based on the little bit of information I gathered from the NPR story and reading the associated article, seems to be straightforward template filling. This is not terribly surprising, given the stagnant nature of financial articles (grab any such article from WSJ and you see things like "The DOW went up ____ percent (___ points) because of ___." Most of this can be easily gathered from databases.

One might suspect that the final "___" (the reason) would be hard to come up with automatically, but apparently this is highly standardized as well. According to the NPR story following the one on the NLG technology, the number of reasons that fill that blank is nearly closed class ("profits" or "middle east" or "terrorism" or "bin Laden videotape" or "strong trading" or ...). And, based on some hear-say evidence, the reasons are largely bogus anyway (this is not surprising: summarizing one billion trades as being driven by a bin Laden videotape is highly suspect).

An amusing comment was made when the journalist at NPR asked the Thomson rep if his job was in danger. The response was that it was, only if he believed that his abilities topped out at filling out templates.

Anyway, if anyone knows more about the technology, and if there's anything interesting in it, it would be fun to know. Beyond that, I just thought it was fun that NLP technology made a spot on NPR.

25 August 2006

Doing Named Entity Recognition? Don't optimize for F1

(Guest post by Chris Manning. Thanks Chris!)

Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character. You wouldn't want to optimize for it!

This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!). I agree with Hal: the paper was a very nice piece of work technically. I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.

Everyone is familiar with the F1 measure for simple classification decisions. You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall. But now think about Named Entity Recognition. You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes. I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity. So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O). These are the non-coding regions of NER. Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).

Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case. The problem is that there are other events that can happen. A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O). A system can notice that there is
an entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O). Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O). I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).

I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text. But you can make it formal (I wrote a Perl script!). Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree. If you chop into subsequences like that, each can be assigned to one of the above seven classes.

Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall. In the simple classification case, that's the end of the story and the F1 measure is sensible. But when doing precision and recall over subsequences, there are these other three event types. Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit. Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.

(Two notes:

(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case. For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.

(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!). The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)

Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful. But, here, optimizing for F1 is encouraging a system to not mark entities.

Now let's look at some data. I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data. Here is how many events of each type there were:

tn 5583
tp 4792
fn 118
fp 120
le 472
be 102
lbe 75

Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized. The ratios of classes vary with the task. For example, in biological NER, you tend to get many more boundary errors. But in my experience it is always the case that lots of the errors are in the last 3 classes.

Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard. For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene. Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city. In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely. In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.

Has this been noticed before? I think so. The ACE program has a system for giving partial credit. But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants.... Do these observations undermine the last decade of work in NER? I don't think so. It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do. A well-known example is traditional readability measures.

p.s. As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it. If you're reading this post, I guess that at least shows that such content isn't actively filtered out!

24 August 2006

Machine learning has failed[?.]

When I was visiting Singapore, Lan Man gave a talk at NUS right before mine (for extra fun, try to find her web site without using my link!). The work she presented is, for the purposes of this discussion, essentially a feature weighting technique for document classification.

More specifically, let's consider the standard bag of words model for text classification. Here, we have a feature set that is exactly the size of the vocabulary. The easiest thing to do is to use binary features (word present or not), but other things seem to work better. For instance, the number of times the word appears (term frequency, or tf) often works better. This is not so surprising. The problem with just using tf is that words like "the" appear a lot and they're somehow less "important." So instead, what we often do is tf*idf, where idf is the "inverse document frequency." There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once. Then, we divide the number of documents by this count. So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the"). What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus). The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).

The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors. Here, downweighting things like "the" was of utmost importance, due to how cosine works. But it makes much less sense in supervised learning. That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature. The learned feature weight itself is what is supposed to determine the importance of a feature. If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.

I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing. Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.). The problem is that this is a problem! It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.

(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf." For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset. I wonder if this helps. If not, it may be because for smallish images, the number of features is small already. I wonder if there is an algorithm design for machine learning that would get around this problem.)

18 August 2006

Change of Notation

Ed Hovy has a particular label he assigns to a large variety of work in NLP: it performs a change of notation. The canonical example is POS tagging, where we are changing from one notation (words) to another notation (POS tags). He also applies this notion to most forms of parsing (though this is a bit more like adding notation), machine translation (one notation is Arabic, the other English), etc. My impression (though it's sometimes hard to tell where Ed really stands on such things) is that solving a task with a change-of-notation technique is suboptimal. I've never fully grasped what Ed was trying to say specifically until recently (and I make no promises I didn't get it wrong, but this is my interpretation).

I think what Ed means by change of notation is that what is being done is pure symbol manipulation. That is, we have a bunch of symbols (typically words) and we just push them around and change them, but the symbols don't actually "mean" anything. There's no sense that the symbol "chair" is grounded to an actual chair. Or, as a logician might say, the manipulation is purely syntactic, not semantic. There is no model in which statements are evaluated. My interpretation is that by "notation" Ed simply means "syntax" (the mathematical notion of syntax, not the linguistic one).

On the other hand, it's hard not to say that everything we do, by definition, must be symbol manipulation. Computer science is all about playing with symbols. There may be a sense in which the word "chair" is grounded to an actual chair, but this actual chair, once inside the computer, will again be a symbol!

Despite this, I think that what one can think about is something like "how large is the connection between my model and the real world?" For most applications, the connection is probably limited to the specific form of training data. This is probably as close to the smallest connection possible. Sometimes people attempt to integrate ontological information, either handcrafted (via something like WordNet) or automatically derived. This is, in some sense, attempting to open the "program to world" pipe a bit more. And there is some work that explicitly focuses on learning grounding, but I'm not aware of many actual uses for this yet. Additionally, some applications cannot help but to be tied more closely to the real world (eg., robot navigation). But for the majority of tasks, the connection is quite small.

13 August 2006

Human in the Loop Learning

What we, as NLPers, typically do vis-a-vis machine learning is what might be called "Human in the Loop" machine learning. Why? Typically, we develop a model and set of features, train it on some training data and evaluate it on some development data. We then augment/change the model and features, retrain and retest on the dev data. Only once we are satisfied with our dev results do we run it on the test data and report scores in a paper. This is "human in the loop" because in each iteration we're using our human knowledge to, say, add some additional features that will hopefully correct for errors on the development data.

To many machine learning people, this is not ideal: the whole point of machine learning is that the human is not in the loop. One can liken the "adding more features" to something akin to adding new sensors to a robot: when it is deployed on Mars, we cannot easily send a person to duct-tape on a new sensor.

But I'm not sure that this argument really applies here. We are, after all, trying to either learn new things about language or build systems that work. For both of these goals, retwiddling models and adding features is the right thing to do. If we could have an automated system to help us with this, that would be great. But I don't think we should shy away from keeping ourselves in the loop.

Perhaps another reason why machine learning tends to shy away from HITLL is that it goes against the long-standing models of supervised learning, which typically begin by saying something like "Let D be a distribution over pairs of inputs x and outputs y and let L be a labeled training set drawn iid from D." Here, the inputs (and hence the distribution) specify exactly what information we have access to in terms of features. This model essentially takes features for granted, as simply part of the learning problem. This is quite reasonable from the ML perspective, but is somehow lacking if one wants to talk about feature engineering, etc. There may be another, more apt model out there, but everything I can think of reduces trivially to the standard supervised model, unless inappropriate assumptions are made.

04 August 2006

Future DUC Tasks

The Document Understanding Conference features a yearly summarization competition. For the past few years, the task has been query-focused summarization of clusters of (essentially entirely) news documents. There will be a pilot task next year and based on comments made during DUC 2006, it appears it will be one of the following:

  1. Multidocument, (probably) query-focused summarization of blog posts.
  2. Multidocument summarization of news, with respect to known information.

The idea in (1) is that there are several "novel" aspects one has to deal with.  First, blog posts are out of domain for most parsers, etc., which means we'll get noisy input but not as noisy as speech.  Second, although the blog posts (the blogs would be from the TREC blog collection) will essentially all focus on news topics (saldy, NLPers is not in the corpus), they are almost certainly more emotionally fueled than vanilla news.  The identification of sentiment and opinion, which are both in vogue these days, will potentially become more useful.

The idea in (2) is that in most real world situations, the user who desires the summary has some background information on the topic.  The idea is that the summarization engine would be handed a collection of 5-10 documents that the user has presumably read, then 5-10 new documents to be summarized.  The novel aspect of this task is, essentially, detecting novelty.

Personally, I think both are potentially interesting, though not without their drawbacks.  The biggest potential problem I see with the blogs idea is that I think we're reentering the phase of not being able to achieve any sort of human agreement without fairly strict guidelines.  It's unclear if, say, two viewpoints are expressed, how a summary should reflect these.  The biggest problem I see with idea (2) is that it is very reminiscent of some TREC-style tasks, like TDT, and I'm not sure that doing anything more than essentially doing normal query-focused summarization with an MMR-style term to account for "known information."  That's not to say these aren't worth exploring -- I think both are quite interesting -- but, as always, we should be careful.