14 August 2009

Classifier performance: alternative metrics of success

I really enjoyed Mark Dredze's talk at EMNLP on multiclass confidence weighted algorithms, where they take their CW binary predictors and extend them in two (basically equivalent) ways to a multiclass/structured setting (warning: I haven't read the paper!). Mark did a great job presenting, as always, and dryly suggested that we should all throw away our perceptrons and our MIRAs and SVMs and just switch to CW permanently. It was a pretty compelling case.

Now, I'm going to pick on basically every "yet another classifier" paper I've read in the past ten years (read: ever). I'm not trying to point fingers, but just try to better understand why I, personally, haven't yet switched to using these things and continue to use either logistic regression or averaged perceptron for all of my classification needs (aside from the fact that I am rather fond of a particular software package for doing these things -- note, though, that it does support PA and maybe soon CW if I decide to spend 10 minutes implementing it!).

Here's the deal. Let's look at SVM versus logreg. Whether this is actually true or not, I have this gut feeling that logreg is much less sensitive to hyperparameter selection than are SVMs. This is not at all based on any science, and the experience that it's based on it somewhat unfair (comparing megam to libSVM, for instance, which use very different optimization methods, and libSVM doesn't do early stopping while megam does). However, I've heard from at least two other people that they have the same internal feelings. In other words, here's a caricature of how I believe logreg and SVM behave:



That is, if you really tune the regularizer (lambda) well, then SVMs will win out. But for the majority of the settings, they're either the same or logreg is a bit better.

As a result, what do I do? I use logreg with lambda=1. That's it. No tuning, no nothing.

(Note that, as I said before, I haven't ever run experiments to verify this. I think it would be a moderately interesting thing to try to see if it really holds up when all else -- eg., the optimization algorithm, early stopping, implementation, choice of regularizer (L1, L2, KL, etc.), and so on -- are held constant... maybe it's not true. But if it is, then it's an interesting theoretical question: hinge loss and log loss don't look that different, despite the fact that John seems to not like how log loss diverges: why should this be true?)

This is also why I use averaged perceptron: there aren't any hyperparameters to select. It just runs.

What I'd really like to see in future "yet another classifier" papers is an analysis of sensitivity to hyperparameter selection. You could provide graphs and stuff, but these get hard to read. I like numbers. I'd like a single number that I can look at. Here are two concrete proposals for what such a number could be (note: I'm assuming you're also going to provide performance numbers at the best possible selection of hyperparameters from development data or cross validation... I'm talking about something in addition):

  1. Performance at a default setting of the hyperparameter. For instance, SVM-light uses something like average inverse norm of the data vectors as the C parameter. Or you could just us 1, like I do for logreg. In particular, suppose you're testing your algorithm on 20 data sets from UCI. Pick a single regularization parameter (or parameter selection scheme, ala SVM-light) to use for all of them and report results using that value. If this is about the same as the "I carefully tuned" setting, I'm happy. If it's way worse, I'm not so happy.
  2. Performance within a range. Let's say that if I do careful hyperparameter selection then I get an accuracy of X. How large is the range of hyperparameters for which my accuracy is at least X*0.95? I.e., if I'm willing to suffer 5% multiplicative loss, how lazy can I be about hp selection? For this, you'll probably need to grid out your performance and then do empirical integration to approximate this. Of course, you'll need to choose a bounded range for your hp (usually zero will be a lower bound, but you'll have to pick an upper bound, too -- but this is fine: as a practitioner, if you don't give me an upper bound, I'm going to be somewhat unhappy).
Neither of these is totally ideal, but I think they'd be a lot better than the current situation of really having no idea! Maybe there are other proposals out there that I don't know about, or maybe other readers have good ideas. But for me, if you're going to convince me to switch to your algorithm, this is something that I really really want to know.

(As an aside, Mark, if you're reading this, I can imagine the whole CW thing getting a bit confused if you're using feature hashing: have you tried this? Or has someone else?)

03 August 2009

ACS: Machine Translation Papers at EMNLP

[Guest Post by Adam Lopez... thanks, Adam! Hal's comment: you may remember that a while ago I proposed the idea of conference area chairs posting summaries of their areas; well, Adam is the first to take me up on this idea... I still think it's a good idea, so anyone else who wants to do so in the future, let me know!]

Conferences can be exhausting, and back-to-back conferences can be really exhausting, so I want to convince you to pace yourself and save some energy for EMNLP at the end of the week, because we have some really interesting MT papers. I'll focus mainly on oral presentations, because unlike poster sessions, the parallel format of the oral sessions entails a hard choice between mutually exclusive options, and part of my motivation is to help you make that choice. That being said, there are many interesting papers at the poster session, so do take a look at them!

MT is a busy research area, and we have a really diverse set of papers covering the whole spectrum of ideas: from blue sky research on novel models, formalisms, and algorithms, to the hard engineering problems of wringing higher accuracy and speed out of a mature, top-scoring NIST system. I occasionally feel that my colleagues on far reaches of either side of this spectrum are too dismissive of work on the other side; we need both if we're going to improve translation.

Outside the Box

Before giving you a guided tour through that spectrum, I want to highlight one paper that I found thought-provoking, but hard to classify. Zaidan & Callison-Burch question a basic assumption underlying most machine learning approaches to NLP: that we must optimize on an easily computable approximation to the true loss function. They ask: why not optimize for human judgement? They design a metric that uses judgements on small snippets of a target sentence (defined by a spanning nonterminal in a parse tree of the aligned source sentence) and figure how many judgements they would need to collect (using Amazon Mechanical Turk) to cover an iteration of MERT, exploiting the fact that these snippets reoccur repeatedly during optimization. How hard is this exactly? I would say, in terms of this scale of loss functions, that their metric is a 2. Yet, it turns out to be cheap and fast to compute. The paper doesn't report results of an actual optimization run, but it's in the works... hopefully you'll learn more at the conference.

Connecting Theory and Practice

A few papers combine deep theoretical insight with convincing empirical results. Hopkins & Langmead improve on cube pruning, a popular approximate search technique for structured models with non-local features (i.e. translation with an integrated language model). They move cube pruning from its ad hoc roots to a firm theoretical basis by constructing a reduction to A* search, connecting it to classical AI search literature. This informs the derivation of new heuristics for a syntax-based translation model, including an admissible heuristic to perform exact cube pruning. It's still globally approximate, but exact for the local prediction problem that cube pruning solves (i.e., what are the n-best state splits of an item, given the n-best input states from previous deductions?). Amazingly, this is only slightly slower than the inexact version and improves the accuracy of a strong baseline on a large-scale Arabic-English task.

Li & Eisner show how to compute a huge number of statistics efficiently over a combinatorially large number of hypotheses represented in a hypergraph. The statistics include expected hypothesis length, feature expectation, entropy, cross-entropy, KL divergence, Bayes risk, variance of hypothesis length, gradient of entropy and Bayes risk, covariance and Hessian matrix. It's beautifully simple: they recast the quantities of interest as semirings and run the inside (or inside-outside) algorithm. As an example application, they perform minimum risk training on a small Chinese-English task, reporting gains in accuracy. For a related paper on minimum risk techniques, see the poster by Pauls et al.

Novel Modeling and Learning Approaches

Tromble & Eisner also connect translation to theory by way of a novel model, framing reordering as an instance of the linear ordering problem: given a matrix of pairwise ordering preferences between all words in a sentence, can we find a permutation that optimizes the global score? This is NP-hard, but they give a reasonable approximation based on ITG, with some clever dynamic programming tricks to make it work. Then they show how to learn the matrix and use it to reorder test sentences prior to translation, improving over the lexicalized reordering model of Moses on German-English.

However, most of the new models at EMNLP are syntax-based. In the last few years, syntax-based modeling has focused primarily on variants of synchronous context-free grammar (SCFG). This year there's a lot of work investigating more expressive formalisms.

Two papers model translation with restricted variants of synchronous tree-adjoining grammar (STAG). Carreras & Collins model syntax atop phrase pairs with a parser using sister adjunction (as in their 2008 parser). The model resembles a synchronous version of Markov grammar, which also connects it to recent dependency models of translation (e.g. Shen et al. 2008, Galley et al. 2009, Gimpel & Smith below, and Hassan et al. in the poster session). Decoding is NP-complete, and devising efficient beam search is a key point in the paper. The resulting system outperforms Pharaoh on German-English. DeNeefe & Knight model target-side syntax via synchronous tree insertion grammar (STIG). It's similar to synchronous tree substitution grammar (STSG; previously realized in MT as GHKM) with added left- and right-adjunction operations to model optional arguments. They show how to reuse a lot of the STSG machinery via a grammar transformation from STIG to STSG, and the results improve on a strong Arabic-English baseline.

Gimpel & Smith use a relatively new formalism: quasi-synchronous dependency grammar (QDG). In quasi-synchronous grammar, the generation of a target syntax tree is conditioned on (but not necessarily isomorphic to) a source syntax tree. Formally, each target node can be annotated with any source node. Since in dependency grammar the nodes are words, their QDG model resembles a word-to-word model. Decoding with QDG was not obvious given past work, and is one of several novel contributions of the paper. Another is the idea that all possible biphrases can fire an associated feature, regardless of overlap. Kääriäinen makes this idea central. Instead of reasoning over the latent derivations of a generative model, his model directly optimizes a feature-based representation of the target sentence, where the features consist of any biphrase in the training set (per standard heuristics). This raises some new problems -- such as how to find the target sentence given the optimal feature vector -- which are solved with dynamic programming. The decoder doesn't quite beat Moses when used with a language model, but it's an order of magnitude faster!

Three other papers operate on STSG models, with an emphasis on learning techniques. Cohn & Blunsom reformulate tree-to-string STSG induction as a problem in non-parametric Bayesian inference, extending their TSG model for monolingual parsing, and removing the dependence on heuristics over noisy GIZA++ word alignments. The model produces more compact rules, and outperforms GHKM on a Chinese-English task. This is a hot topic: check out Liu & Gildea's poster for an alternative Bayesian formulation of the same problem and language pair. Galron et al. look at tree-to-tree STSG (from a Data-Oriented Parsing perspective), with an eye towards discriminatively learning STSG rules to optimize for translation accuracy.

Bayesian inference also figures in the model of Chung & Gildea, who aim at bilingually-informed segmentation of a source language. The model is like IBM Model 1, except that the source positions are actually substrings of the source instead of single positions. Reasoning over the substring boundaries makes it resemble an HMM, and they use a sparse prior to avoid overfitting. Tokenizing new text uses the marginal distribution on source language segmentations, and this performs almost as well as a supervised segmenter on Chinese, and better on Korean, in end-to-end translation.

SCFG models aren't completely forgotten: Zhang & Li offer a new twist on reordering in binary-branching SCFG. Given a source parse, we could train a maximum entropy classifier to decide whether any binary production should be inverted; this requires a lot of computation over sparse vectors. They instead represent the features implicitly using a tree convolution kernel, showing nice gains in Chinese-English.

On the algorithmic side, Levenberg & Osborne look at language modeling under the condition that we have unbounded data streams in both source and target language, bounded computation, and the desire to bias our language model towards more recent language use without constantly retraining it. They accomplish this with online perfect hashing (extending previous work) in a succinct data structure that supports deletions, showing that they can draw on recent information in both the source and the target to incrementally update the model while keeping a bounded memory footprint.

Bai et al. focus on the problem of acquiring multiword expressions (i.e. idioms), showing why typical word alignment methods fail, and using a combination of statistical association measures and heuristics to fix the problem, with small gains in Chinese-English.

Decoding

Since SCFG models have become mainstream, there's been a greater emphasis on decoding. Following a recent strand of research on grammar transformations for SCFG, Xiao et al. observe that, in the space of possible transformations, many will pair source yields with huge numbers of target yields, which compete during decoding and thus result in more search errors. The trick is to select a transform that distributes target yields more evenly across source yields. They pose this as an optimization problem and give a greedy algorithm; the resulting grammar is reliably better under a variety of conditions on a Chinese-English task. Meanwhile, Zhang et al. engineer more efficient STSG decoding for the case in which the source is a parse forest and source units are tree fragments. The trick is to encode translation rules in the tree path equivalent of a prefix tree. On Chinese-English this improves decoding speed and ultimately translation accuracy, because the decoder can consider larger fragments much more efficiently. Finally, see Finch & Sumita's comprehensive poster on bidirectional phrase-based decoding for a huge number of language pairs.

Onwards and Upwards

The align/extract/MERT pipeline popularized by Moses and other NIST-style systems is incredibly hard to improve, but several papers manage just that.

Hermjakob's word aligner starts from lexical translation parameters learned by a statistical alignment model. Then, following some fairly general observations on different linguistic classes of words, it uses some well-motivated heuristics to fix a whole bunch of little things that many more principled models ignore: the different behavior of content words (improved via careful manipulation of pointwise mutual information) and function words (improved via constraints from parse structure) is treated along with careful handling of numbers, transliterations, and morphology to give strong improvements in Arabic-English.

Liu et al. then extract phrases by relaxing standard heuristic constraints. Given a posterior probability for every alignment point, they simply calculate the probability that a phrase would be extracted, and use this as their count in the typical frequency-based estimate. It's efficient and improves Chinese-English.

Three papers incorporate new feature types into strong baseline translation models, following a recent trend. Shen et al. devise some clever local features using source-side context, derivation span length, and dependency modeling to make impressive improvements on an already impressive baseline system in both Chinese-English and Arabic-English. Matsoukas et al. then show how a mixed-genre system can effectively be adapted for a particular target domain, by using a small amount data to tune weights tied to genre and collection types in the training corpus, again with strong results in Arabic-English. Mauser et al. take their previous triplet lexicon model (a probabilistic feature using an outside source word as additional conditioning context) and move it from a reranking step into the decoding step, with a nice experimental treatment showing improvements in large-scale Chinese-English and Arabic-English.

If you've seen the latest NIST results, you know that system combination gives huge improvements. Check out posters by He & Toutanova, Duan et al., and Feng et al. to learn the latest techniques. Last but not least, if you need a strategy for language pairs with very little parallel data, the poster by Nakov & Ng will interest you.

Thanks

EMNLP was the first time I've been area chair for a conference, and it was really rewarding to work with such great volunteers and< to see the great papers that were selected (I should note here that I included two papers not on my track that I'm quite familiar with -- the ones from Edinburgh). It was also very enlightening, but that's another story. Many thanks to Hal for offering this forum to share the results!

01 August 2009

Destination: Singapore

Welcome to everyone to ACL! It's pretty rare for me to end up conferencing in a country I've been before, largely because I try to avoid it. When I was here last time, I stayed with Yee Whye, who was here at the time as a postdoc at NUS, and lived here previously in his youth. As a result, he was an excellent "tour guide." With his help, here's a list of mostly food related stuff that you should definitely try while here (see also the ACL blog):

  • Pepper crab. The easiest to find are the "No Signboard" restaurant chain. Don't wear a nice shirt unless you plan on doing laundry.
  • Chicken rice. This sounds lame. Sure, chicken is kind of tasty. Rice is kind of tasty. But the key is that the rice is cooked in or with melted chicken fat. It's probably the most amazingly simple and delicious dish I've ever had. "Yet Kun" (or something like that) is along Purvis street.
  • Especially for dessert, there's Ah Chew, a Chinese place around Liang Seah street in the Bugis area (lots of other stuff there too).
  • Hotpot is another local specialty: there is very good spicy Szechuan hotpot around Liang Seah street.
  • For real Chinese tea, here. (Funny aside: when I did this, they first asked "have you had tea before?" Clearly the meaning is "have you had real Chinese tea prepared traditionally and tasted akin to a wine tasting?" But I don't think I would ever ask someone "have you had wine before?" But I also can't really think of a better way to ask this!)
  • Good late night snacks can be found at Prata stalls (eg., indian roti with curry).
  • The food court at Vivocity, despite being a food court, is very good. You should have some hand-pressed sugar cane juice -- very sweet, but very tasty (goes well with some spicy hotpot).
  • Chinatown has good Chinese dessert (eg., bean stuff) and frog leg porridge.
Okay, so this list is all food. But frankly, what else are you going to do here? Go to malls? :). There's definitely nice architecture to be seen; I would recommend the Mosque off of Arab street; of course you have to go to the Esplanade (the durian-looking building); etc. You can see a few photos from my last trip here.

Now, I realize that most of the above list is not particularly friendly to my happy cow friends. Here's a list of restaurants that happy cow provides. There are quite a few vegetarian options, probably partially because of the large Muslim population here. There aren't as many vegan places, but certainly enough. For the vegan minded, there is a good blog about being vegan in Singapore (first post is about a recent local talk by Campbell, the author of The China Study, which I recommend everyone at least reads). I can't vouch for the quality of these places, but here's a short list drawn from Living Vegan:
  • Mushroom hotpot at Ling Zhi
  • Fried fake meat udon noodles (though frankly I'm not a big fan of fake meat)
  • Green Pasture cafe; looks like you probably have to be a bit careful here in terms of what you order
  • Yes Natural; seems like it has a bunch of good options
  • Lotus Veg restaurant, seems to have a bunch of dim sum (see here, too)
  • If you must, there's pizza
  • And oh-my-gosh, there's actually veggie chicken rice, though it doesn't seem like it holds up to the same standards as real chicken rice (if it did, that would be impressive)
Okay, you can find more for yourself if you go through their links :).

Enjoy your time here!

Quick update: Totally forgot about coffee. If you need your espresso kick, Highlander coffee (49 Kampong Bahru Road) comes the most recommended, but is a bit of a hike from the conference area. Of course, you could also try the local specialty: burnt crap with condensed milk (lots and lots of discussion especially on page two here).