27 December 2012

Teaching (intro, grad) NLP

I had a post a while back on teaching ML that I still basically stand by.  I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once.  I find these courses really hard to teach.  And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.  Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes.  This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology.  Okay, so you have to explain what morphemes are and why they're important.  Now, you probably will take a finite state approach, so you have to explain transducers.  If you want these things to work, you have to explain weighted transducers.  Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time?  So now you want to do POS tagging or something.  Fine, you can do that with finite state models too.  But no one actually does this any more (except lingpipe :P).  So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something.  And we're still at POS tagging.  Now you switch to parsing.  Back to square one.  And then you want to do compositional semantics, now there's lots more structure, lots more features and so on.  But even now things are at least somewhat connected.  But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML.  I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types.  The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised).  In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type).  This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP.  The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models.  Here's how I envision it.  You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal.  You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester.  Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same.  You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features).  Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features.  The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features.  You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way.  Anything generative or unsupervised, which often go together.  For instance, word alignment via the IBM models won't fit.  Topic models won't fit (though I don't usually do them anyway -- maybe I should).  Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea.  I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...).  Or if anyone has tried it.

09 December 2012

NIPS stuff...

NIPS is over as of last night.  Overall I thought the program was strong (though I think someone, somewhere, is trying to convince me I need to do deep learning -- or at least that was the topic d'jour... or I guess d'an? this time).  I wasn't as thrilled with the venue (details at the end) but that's life.  Here were some of the highlights for me, of course excluding our own papers :P, (see the full paper list here)... note that there will eventually be videos for everything!

  • User-Friendly Tools for Studying Random Matrices
    Joel A Tropp

    This tutorial was awesome.  Joel has apparently given it several times and so it's really well fine-tuned.  The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices.  Really great talk.  If I ever end up summing random matrices, I'm sure I'll use this stuff!
  • Emergence of Object-Selective Features in Unsupervised Feature Learning
    Adam Coates, Andrej Karpathy, Andrew Y. Ng
    They show that using only unlabeled data that is very heterogenous, some simple approaches can pull out faces.  I imagine that some of what is going on is that faces are fairly consistent in appearance whereas "other stuff" often is not.  (Though I'm sure my face-recognition colleagues would argue with my "fairly consistent" claim.)
  • Scalable nonconvex inexact proximal splitting
    Suvrit Sra
    I just have to give props to anyone who studies nonconvex optimization.  I need to read this -- I only had a glance at the poster -- but I definitely think it's worth a look.
  • A Bayesian Approach for Policy Learning from Trajectory Preference Queries
    Aaron Wilson, Alan Fern, Prasad Tadepalli
    The problem solved here is imitation learning where your interaction with an expert is showing them two trajectories (that begin at the same state) and asking them which is better.  Something I've been thinking about recently -- very happy to see it work!
  • FastEx: Hash Clustering with Exponential Families
    Amr Ahmed, Sujith Ravi, Shravan M. Narayanamurthy, Alexander J. Smola
    The idea here is to replace the dot product between the parameters and sufficient statistics of an exp fam model with an approximate dot product achieved using locality sensitive hashing.  Take a bit to figure out exactly how to do this.  Cool idea and nice speedups.
  • Identifiability and Unmixing of Latent Parse Trees
    Daniel Hsu, Sham M. Kakade, Percy Liang
    Short version: spectral learning for unsupervised parsing; the challenge is to get around the fact that different sentences have different structures, and "unmixing" is the method they propose to do this.  Also some identifiability results.
  • Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
    Shay B. Cohen and Michael Collins
    Another spectral learning paper, this time for doing exact latent variable learning for latent-variable PCFGs.  Fast, and just slightly less good than EM.
  • Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
    Abner Guzman-Rivera Dhruv Batra Pushmeet Kohli
    Often we want our models to produce k-best outputs, but for some reason we only train them to produce one-best outputs and then just cross our fingers.  This paper shows that you can train directly to produce a good set of outputs (not necessarily diverse: just that it should contain the truth) and do better.  It's not convex, but the standard training is a good initializer.
  • [EDIT Dec 9, 11:12p PST -- FORGOT ONE!]
    Query Complexity of Derivative-Free Optimization
    Kevin G. Jamieson, Robert D. Nowak, Benjamin Recht
    This paper considers derivative free optimization with two types of oracles.  In one you can compute f(x) for any x with some noise (you're optimizing over x).  In the other, you can only ask whether f(x)>f(y) for two points x and y (again with noise).  It seems that the first is more powerful, but the result of this paper is that you get the same rates with the second!
  • I didn't see it, but Satoru Fujishige's talk Submodularity and Discrete Convexity in the Discrete Machine Learning workshop was supposedly wonderful.  I can't wait for the video.
  • Similarly, I heard that Bill Dolan's talk on Modeling Multilingual Grounded Language in the xLiTe workshop was very good.
  • Ryan Adam's talk on Building Probabilistic Models Around Deterministic Optimization Procedures in the "Perturbations, Optimization and Statistics" workshop (yeah, I couldn't figure out what the heck that meant either) was also very good.  The Perturb-and-MAP stuff and the Randomized Optimum models are high on my reading list, but I haven't gotten to them quite yet.
  • As always, Ryan McDonald and Ivan Titov gave very good talks in xLiTe, on Advances in Cross-Lingual Syntactic Transfer and Inducing Cross-Lingual Semantic Representations of Words, Phrases, Sentences and Events, respectively.
I'm sure there was lots of other stuff that was great and that I missed because I was skiing working hard on NAACL.

Really my only gripe about NIPS this year was the venue.  I normally wouldn't take the time to say this, but since we'll be enjoying this place for the next few years, I figured I'd state what I saw as the major problems, some of which are fixable.  For those who didn't come, we're in Stateline, NV (on the border between CA and NV) in two casinos.  Since we're in NV, there is a subtle note of old cigarette on the nose fairly constantly.  There is also basically nowhere good to eat (especially if you have dietary restrictions) -- I think there are a half dozen places on yelp with 3.5 stars or greater.  My favorite tweet during NIPS was Jacob Eisenstein who said: "stateline, nevada / there is nothing but starbucks / the saddest haiku".  Those are the "unfixables" that make me think that I'll think twice about going to NIPS next year, but of course I'll go.

The things that I think are fixable... there was no where to sit.  Presumably this is because the casino wants you to sit only where they can take your money, but I had most of my long discussions either standing or sitting on the ground.  More chairs in hallways would be much appreciated.  There was almost no power in rooms, which could be solved by some power strips.  The way the rooms divided for tutorials was really awkward, as the speaker was clear on one side of the room and the screen was on the other (and too high to point to) so it was basically like watching a video of slides online without ever seeing the presenter.  Not sure if that's fixable, but seems plausible.  And the walls between the workshop rooms were so thin that often I could hear another workshop's speaker better than I could hear the speaker in the workshop I was attending.  And the internet in my hotel room was virtually unusably slow (though the NIPS specific internet was great).