natural language processing blog: 2012

27 December 2012

Teaching (intro, grad) NLP

I had a post a while back on teaching ML that I still basically stand by. I've taught intro grad NLP here at UMD twice now, and a sort-of-similar-course back at Utah once. I find these courses really hard to teach. And not for the usually bemoaned reason of the CS/linguistics mix -- I think it's possible to deal with that, and certainly it's an issue that's been talked about a lot.

What I find difficult is that NLP (and CL) is a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP. Even if you manage to answer questions about "what sort of topics are most interesting?" you're still faced with this problem that every time you switch topics, the entire context in which you're discussing them changes. This is exacerbated by the problem that things like tagging and parsing are hopelessly boring (in comparison to all the cool interesting stuff in NLP these days), but yet so many modern ideas are based on understanding basic dynamic programming for tree structures and things like that.

To make things a bit more concrete, a standard intro NLP class might start with morphology. Okay, so you have to explain what morphemes are and why they're important. Now, you probably will take a finite state approach, so you have to explain transducers. If you want these things to work, you have to explain weighted transducers. Do you do probabilities, in which case there's the whole local vs global normalization stuff that takes more time? So now you want to do POS tagging or something. Fine, you can do that with finite state models too. But no one actually does this any more (except lingpipe :P). So you have to explain POS stuff, perhaps how this works in non-English, and then you can leave them with HMMs (maybe talking about Viterbi algorithm) or do lots of ML so you can get to CRFs or structured perceptron or something. And we're still at POS tagging. Now you switch to parsing. Back to square one. And then you want to do compositional semantics, now there's lots more structure, lots more features and so on. But even now things are at least somewhat connected. But then you talk about lexical semantics: be it distributed representations or WSD or whatever, but the problem is new, the techniques are new (do you teach Yarowsky?), the evaluation is now and so on.

I think it's worth contrasting this with ML. I find ML remarkably easy to teach (so I'm flipping the classroom this coming Spring for the UG version to make it more exciting) despite the fact that the material is (in many ways) much harder for CS types. The thing that is nice about ML is that the problem is basically always the same (or at least changes only once, when you switch from supervised to unsupervised). In that sense, ML tends to be a course about techniques for a relatively fixed problem (or at least fixed problem type). This makes for significantly less context switching, which makes learning easier (and thereby makes teaching easier).

So the question I wanted to ask is: can we do something similar in NLP. The crazy idea that I'm sure everyone will say is insane is the following: teach NLP as a course about what you can do with log-linear models. Here's how I envision it. You spend the first day talking about NLP and why data is important, ambiguity, etc, just like normal. You spend the next two days explaining enough about log linear models that you can treat them as given for the rest of the semester. Maybe you tell how to optimize them by gradient descent or something, but basically enough that anyone who is simultaneously taking ML will get more out of it, but those that are not are fine with LL models as a black box.

Now, when you teach different topics, the framework in which you discuss them is the same. You have a structured problem (which forces you to talk about algorithms like Viterbi or CKY) with interesting ambiguities (which forces you to talk about features). Then, the class essentially becomes a sequence of problems, associated algorithms and relevant features. The rest is left as a black box, which can be provided off the shelf for programming projects, and they can focus on the interesting and more NLP-ish problems of algorithms and features. You could even start with something like sentiment classification (at a document level) to make the beginning gentle.

I realize there are some things you couldn't do this way, or would be very awkward to do this way. Anything generative or unsupervised, which often go together. For instance, word alignment via the IBM models won't fit. Topic models won't fit (though I don't usually do them anyway -- maybe I should). Probably there are some other things too.

Anyway, I'd be curious to hear what people think of this idea. I know it's biased by my own view of the world, but hey -- that's why I'm a professor (or at least why I assist professors...). Or if anyone has tried it.

09 December 2012

NIPS stuff...

NIPS is over as of last night. Overall I thought the program was strong (though I think someone, somewhere, is trying to convince me I need to do deep learning -- or at least that was the topic d'jour... or I guess d'an? this time). I wasn't as thrilled with the venue (details at the end) but that's life. Here were some of the highlights for me, of course excluding our own papers :P, (see the full paper list here)... note that there will eventually be videos for everything!

User-Friendly Tools for Studying Random Matrices
Joel A Tropp
This tutorial was awesome. Joel has apparently given it several times and so it's really well fine-tuned. The basic result is that if you love your Chernoff bounds and Bernstein inequalities for (sums of) scalars, you can get almost exactly the same results for (sums of) matrices. Really great talk. If I ever end up summing random matrices, I'm sure I'll use this stuff!
Emergence of Object-Selective Features in Unsupervised Feature Learning
Adam Coates, Andrej Karpathy, Andrew Y. Ng
They show that using only unlabeled data that is very heterogenous, some simple approaches can pull out faces. I imagine that some of what is going on is that faces are fairly consistent in appearance whereas "other stuff" often is not. (Though I'm sure my face-recognition colleagues would argue with my "fairly consistent" claim.)
Scalable nonconvex inexact proximal splitting
Suvrit Sra
I just have to give props to anyone who studies nonconvex optimization. I need to read this -- I only had a glance at the poster -- but I definitely think it's worth a look.
A Bayesian Approach for Policy Learning from Trajectory Preference Queries
Aaron Wilson, Alan Fern, Prasad Tadepalli
The problem solved here is imitation learning where your interaction with an expert is showing them two trajectories (that begin at the same state) and asking them which is better. Something I've been thinking about recently -- very happy to see it work!
FastEx: Hash Clustering with Exponential Families
Amr Ahmed, Sujith Ravi, Shravan M. Narayanamurthy, Alexander J. Smola
The idea here is to replace the dot product between the parameters and sufficient statistics of an exp fam model with an approximate dot product achieved using locality sensitive hashing. Take a bit to figure out exactly how to do this. Cool idea and nice speedups.
Identifiability and Unmixing of Latent Parse Trees
Daniel Hsu, Sham M. Kakade, Percy Liang
Short version: spectral learning for unsupervised parsing; the challenge is to get around the fact that different sentences have different structures, and "unmixing" is the method they propose to do this. Also some identifiability results.
Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs
Shay B. Cohen and Michael Collins
Another spectral learning paper, this time for doing exact latent variable learning for latent-variable PCFGs. Fast, and just slightly less good than EM.
Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
Abner Guzman-Rivera Dhruv Batra Pushmeet Kohli
Often we want our models to produce k-best outputs, but for some reason we only train them to produce one-best outputs and then just cross our fingers. This paper shows that you can train directly to produce a good set of outputs (not necessarily diverse: just that it should contain the truth) and do better. It's not convex, but the standard training is a good initializer.
[EDIT Dec 9, 11:12p PST -- FORGOT ONE!]
Query Complexity of Derivative-Free Optimization
Kevin G. Jamieson, Robert D. Nowak, Benjamin Recht
This paper considers derivative free optimization with two types of oracles. In one you can compute f(x) for any x with some noise (you're optimizing over x). In the other, you can only ask whether f(x)>f(y) for two points x and y (again with noise). It seems that the first is more powerful, but the result of this paper is that you get the same rates with the second!
I didn't see it, but Satoru Fujishige's talk Submodularity and Discrete Convexity in the Discrete Machine Learning workshop was supposedly wonderful. I can't wait for the video.
Similarly, I heard that Bill Dolan's talk on Modeling Multilingual Grounded Language in the xLiTe workshop was very good.
Ryan Adam's talk on Building Probabilistic Models Around Deterministic Optimization Procedures in the "Perturbations, Optimization and Statistics" workshop (yeah, I couldn't figure out what the heck that meant either) was also very good. The Perturb-and-MAP stuff and the Randomized Optimum models are high on my reading list, but I haven't gotten to them quite yet.
As always, Ryan McDonald and Ivan Titov gave very good talks in xLiTe, on Advances in Cross-Lingual Syntactic Transfer and Inducing Cross-Lingual Semantic Representations of Words, Phrases, Sentences and Events, respectively.

I'm sure there was lots of other stuff that was great and that I missed because I was ~~skiing~~ working hard on NAACL.

Really my only gripe about NIPS this year was the venue. I normally wouldn't take the time to say this, but since we'll be enjoying this place for the next few years, I figured I'd state what I saw as the major problems, some of which are fixable. For those who didn't come, we're in Stateline, NV (on the border between CA and NV) in two casinos. Since we're in NV, there is a subtle note of old cigarette on the nose fairly constantly. There is also basically nowhere good to eat (especially if you have dietary restrictions) -- I think there are a half dozen places on yelp with 3.5 stars or greater. My favorite tweet during NIPS was Jacob Eisenstein who said: "stateline, nevada / there is nothing but starbucks / the saddest haiku". Those are the "unfixables" that make me think that I'll think twice about going to NIPS next year, but of course I'll go.

The things that I think are fixable... there was no where to sit. Presumably this is because the casino wants you to sit only where they can take your money, but I had most of my long discussions either standing or sitting on the ground. More chairs in hallways would be much appreciated. There was almost no power in rooms, which could be solved by some power strips. The way the rooms divided for tutorials was really awkward, as the speaker was clear on one side of the room and the screen was on the other (and too high to point to) so it was basically like watching a video of slides online without ever seeing the presenter. Not sure if that's fixable, but seems plausible. And the walls between the workshop rooms were so thin that often I could hear another workshop's speaker better than I could hear the speaker in the workshop I was attending. And the internet in my hotel room was virtually unusably slow (though the NIPS specific internet was great).

26 September 2012

Sure, you can do that....

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers. It's just more fun to write strong rhetoric :).

Let me write down a simple Markov Chain:

Download some data from the web
Call part of that data the input and part of it the label
Train a classifier on bag of words and get 84% accuracy
Submit a paper to *ACL
Go to 1

Such papers exist in the vision community, too, where you replace "bag of words" with "SIFT features" and "*ACL" with "CVPR/ICCV." In that community (according to my one informant :P), such papers are called "data porn." Turns out this is actually a term from journalism, in which one definition is "where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story."

There's a related paper that looks at this issue in one specific setting: predicting political outcomes. On Arxiv back at the end of April, we got this wonderful, and wonderfully titled paper:

"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data by Daniel Gayo-Avello

The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right. This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?" The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks. Well, actually I already knew that. So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad. There are many things one can do (and people often do do) that make these results interesting:

Picking a real problem -- i.e., one that someone else might actually care about. There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day." For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else. I've certainly been guilty of this. But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."
Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before. That's great! Teach me what they've learned! How, qualitatively, do your results compare to their hypotheses? If they agree, then great. If they disagree, then explain to me why this would happen: is there something your model can see that they cannot? What's going on?
On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model! Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here. Teach me how to think about the relationship between the input and the output, not just the fact that one exists.

In some sense, these things are obvious. And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun. But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done." I don't think this is all together wrong. Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others. For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more). The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text. I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

15 September 2012

Somehow I totally missed NIPS workshops!

I don't know how it happened or when it happened, but at some point NIPS workshops were posted and papers are due about a week from now and I completely missed it! The list of workshops is here:

http://nips.cc/Conferences/2012/Program/schedule.php?Session=Workshops

Since my job as a blogger is to express my opinion about things you don't want to hear my opinion about, I wish they'd select fewer workshops. I've always felt that NIPS workshops are significantly better than *ACL workshops because they tend to be workshops and not mini-conferences (where "mini" is a euphemism for non-selective :P). At NIPS workshops people go, really talk about problems and it's really the best people and the best work in the area. And while, yes, it's nice to be supportive of lots of areas, but what ends up happening is that people jump between workshops because there are too many that interest them, and then you lose this community feeling. This is especially troubling when workshops are already competing with skiing :).

Anyway, with that behind me, there are a number that NLP folks might find interesting:

Algorithmic and Statistical Approaches for Large Social Networks: Social networks aren't that big of a topic in NLP, but I think it's potentially a space we'll move in to and overlap with, especially as more and more text becomes available on such networks.
Discrete Optimization in Machine Learning: Structure and Scalability: One of the things that's perhaps relevant to our field but approached from a different perspective.
Social network and social media analysis: Methods, models and applications: Another social network thing, perhaps focused more on media than the networks
Spectral Learning Workshop: Spectral methods have made a bit of a splash in ACL land and there's definitely lots of interesting work to do here.
Cross-Lingual Technologies: This is probably the closest to our hearts. Anything multilingual!!!
Big Learning: New Perspectives, Implmentations and Challenges: Who doesn't like big learning. I'm still not convinced that the ML notion of "big" is as big as our notion of "big" but it's definitely getting there!
Personalizing Education With Machine Learning: I don't think there's much in the way of NLP in personalizing education, but I see this as a potential big avenue for large breakthroughs in NLP in the coming 5 years, especially as we start talking about automated grading in MOOCs and whatnot. ETS folks, I hear you!

With the deadlines so close I don't imagine anyone's going to be submitting stuff that they just started, but if you have things that already exist, NIPS is fun and it would be fun to see more NLPers there!

13 June 2012

NAACL 2012 Retrospective

Like many people, I spent last week in lovely Montreal (at least lovely for the second half) at NAACL. Despite the somewhat unfortunate submission date, I thought the program this year was quite good. Of course I didn't see every talk and haven't read many of the papers yet, but I figured I'd point out what I saw that I liked and other people can comment likewise.

Identifying High-Level Organizational Elements in Argumentative Discourse (Madnani, Heilman, Tetreault, Chodorow). This is maybe one of the first discourse papers I've seen where I actually believe that they have a chance of solving the problem that they've set out. Here, the problem is separating the meat (content of an essay) from the shell (the bits of discourse that hold the meat together). It's a cool problem and their solution seems to work well. Very nice paper. (And Nitin's talk was great.)

Risk Training of Approximate CRF-Based NLP Systems (Stoyanov, Eisner). This paper is basically about training approximate models based on some given loss function. Reminds me a lot of the Ross et al. CVPR 2011 paper on Message-Passing. It's a cool idea, and there's software available. Being me, the thing I wonder the most about is whether you can achieve something similar being completely greedy, and then whether you need to do all this work to get a good decision function. But that's me -- maybe other people like CRFs :).

MSR SPLAT, a language analysis toolkit (Quirk, Choudhury, Gao, Suzuki, Toutanova, Gamon, Yih, Cherry, Vanderwende). This is a demo of a system where you send them a sentence and they tell you everything you want to know about it. Never run your own parser/NER/etc. again. And, having see it in action at MSR, it's fast and high quality.

Parsing Time: Learning to Interpret Time Expressions (Angeli, Manning, Jurafsky). This was a great paper about semantic interpretation via compositional semantics (something sort of like lambda calculus) for time expressions. I cannot find myself getting super jazzed up about time, but it's a nice constrained problem and their solution is clean. I'm actually thinking of using something like this (or a subset thereof) as a course project for the intro CL course in the Fall, since I'm always starved for something to do with compositional semantics.

Getting More from Morphology in Multilingual Dependency Parsing (Hosensee, Bender). If you have morphology, you can do better parsing by modeling things like agreement (really? people hadn't done this before??). Caveat: they use gold standard morphology. But cool idea still.

Unsupervised Translation Sense Clustering (Bansal, DeNero, Lin). If you want to build a bilingual dictionary from parallel text, you need to cluster translations into senses. Here's a way to do it. Nice results improving using bilingual contexts, which was nice to see.

I also feel like I should raise my glass to the organizers of NLP Idol and congrats to Ray for winning with Robert Wilensky's paper "PAM." (If anyone can find an online version, please comment!) Though I would actually encourage everyone to read all three papers if you haven't already. They all changed how I was thinking about problems. Here are the others: Burstiness (Church), Context (Akman), Suppertagging (Bangalore, Joshi).

18 February 2012

Making sense of Wikipedia categories

Wikipedia's category hierarchy forms a graph. It's definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that "Chicago Stags coaches" are a subcategory of "Natural sciences"? If you don't believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

Biology
Zoology
Subfields of zoology
Ethology
Behavior
Human behavior
Recreation
Games
Ball games
Basketball
Basketball teams
Defunct basketball teams
Defunct National Basketball Association teams
Chicago Stags
Chicago Stags coaches

I guess it kind of makes sense. There are some other fun ones, like "Rhaeto-Romance languages", "American World War I flying aces" and "1911 films". Of course, these are all quite deep in the "hierarchy" (all of those are at depth 15 or higher).

So if you're trying to actually find pages about Natural sciences, maybe it's enough to limit the depth of your breadth first search down the graph.

This is sort of reasonable, and things up to and including depth four are quite reasonable, including topics like "Neurochemistry", "Planktology" and "Chemical elements". There are a few outliers, like "Earth observation satellites of Israel" which you could certainly make a case might not be natural science.

At depth five, things become much more mixed. On the one hand, you get categories you might like to include, like "Statins", "Hematology", "Lagoons" and "Satellites" (interesting that Satellites is actually deeper than the Isreal thing). But you also get a roughly equal amount of weird things, like "Animals in popular culture" and "Human body positions". It's still not 50/50, but it's getting murky.

At depth six, based on my quick perusal, it's about 50/50.

And although I haven't tried it, I suspect that if you use a starting point other than Natural sciences, the depth at which things get weird is going to be very different.

So I guess the question is how do deal with this.

One thought is to "hope" that editors of Wikipedia pages will list the categories of pages roughly in order of importance, so that you can assume that the first category listed for a page is "the" category for that page. This would render the structure to be a tree. For the above example, this would cut the list at "Subfields of zoology" because the first listed category for the Ethology category is "Behavioral sciences", not "Subfields of zoology."

Doing this seems to make life somewhat better; you cut out the stags coaches, but you still get the "Chicago Stags draft picks" (at depth 17). The path, if you care, is (Natural sciences -> Physical sciences -> Physics -> Fundamental physics concepts -> Matter -> Structure -> Difference -> Competition -> Competitions -> Sports competitions -> Sports leagues -> Sports leagues by country -> Sports leagues in the United States -> Basketball leagues in the United States -> National Basketball Association -> National Basketball Association draft picks). Still doesn't feel like Natural sciences to me. In fairness, at depth 6, life is much better. You still get "Heating, ventilating, and air conditioning" but many of the weird entries have gone away.

Another idea is the following. Despite not being a tree or DAG, there is a root to the Wikipedia hierarchy (called Category:Contents). For each page/category you can compute it's minimum depth from that Contents page. Now, when you consider subpages of Natural sciences, you can limit yourself to pages whose shortest path goes through Natural sciences. Basically trying to encode the idea that if the shallowest way to reach Biology is through Natural sciences, it's probably a natural science.

This also fails. For instance, the depth of "Natural sciences" (=5) is the same as the depth of "Natural sciences good articles", so if you start from Natural sciences, you'll actually exclude all the good articles! Moreover, even if you insist that a shortest path go through Natural sciences, you'll notice that many editors have depth 5, so any page they've edited will be allowed. Maybe this is a fluke, but "Biology lists" has depth of only 4, which means that anything that can be reached through "Biology lists" would be excluded, something we certainly wouldn't want to do. There's also the issue that the hierarchy might be much bushier for some high-level topics than others, which makes comparing depths very difficult.

So, that leaves me not really knowing what to do. Yes, I could compute unigram distributions over the pages in topics and cut when those distributions get too dissimilar, but (a) that's annoying and very computationally expensive, (b) requires you to look at the text of the pages which seems silly, (c) you now just have more hyperparameters to tune. You could annotate it by hand ("is this a natural science") but that doesn't scale. You could compute the graph Laplacian and look at flow and use "average path length" rather than shortest paths, but this is a pretty big graph that we're talking about.

Has anyone else tried and succeed at using the Wikipedia category structure?

11 February 2012

De-Authorship attribution

I received the following (slightly edited) question from my colleague Jon Katz a few days ago:

I was thinking about the problem of authorship attribution... Have people thought about the flip side of this problem? Namely, "anonymizing" text so that it would be hard to attribute it to any author?

This is something I've actually wondered about in the context of blogging for a while. I noticed at some point that my "blogger voice" is very similar to my "reviewer voice" and started worrying that I might be too identifiable as a reviewer. This might either be due to lexical choice ("bajillion" or "awesome") or due to some more subtle stylistic choices.

There is quite a bit of work on authorship attribution. I think the first time I heard a talk on this topic was on March 24, 2004, when Shlomo Argamon gave a talk at ISI (no, I don't have an amazing memory, I cheated) on "On Writing, Our Selves: Explorations in Stylistic Text Categorization." The basic hypothesis of the talk, at least as I remember it, was that if you're trying to do authorship attribution, you should throw out content words and focus on things like POS tag sequences, parse tree structures, and things like that.

There's been a lot of subsequent work in this, and related areas. One very related area is on things like trying to predict demographic information (age, gender, socio-economic status, education level, and, yes, astrological sign) from tweets, blog posts or emails (or other forms). One of the key distinctions that I think is important in all of this work is whether the original author is intentionally trying to hide information about him or herself. For instance, someone trying to impersonate Shakespeare, or a child predator pretending to be a different age or gender, or a job applicant trying to sound more educate than is true. This latter is a much harder problem because the stupid topically stereotypical features that pop out as being indicative (like men talking about "wifes" and "football" and women talking about "husbands" and "yoga") and the silly features that don't really tell us anything interesting (on twitter, apparently men tend to put "http://" before URLs more than women -- who knew?) because these "pretenders" are going to intentionally try to hide that information (now that everyone knows to hide "http://" to trick gender recognizers!). It also means that falling back on topic as a surrogate for demography should not work as well. This seems to be a very different problem from trying to identify whether a blog post is written by me or by Jon, which should be 99.9% do-able by just looking at content words.

The reason I bring this all up is because we don't want to anonymize by changing the topic. The topic needs to stay the same: we just need to cut out additional identifying information. So, getting back to Jon's question, the most relevant work that I know of is on text steganography (by Ching-Yun Chang and Stephen Clark), where they use the ability to do paraphrasing to encode messages in text. Aside from the challenge of making the output actually somewhat grammatical, the basic idea is that when you have two ways of saying the same thing (via paraphases), you can choose the first one to encode a "0" and the second to encode a "1" and then use this to encode a message in seemingly-natural text.

I also remember having a conversation a while ago while a (different) colleague about trying to build a chat system where you could pretend that you're chatting with someone famous (like Obama or Harry Potter or Scooby Doo). A similar problem is trying to paraphrase my own writing to sound like someone else, but zoinks, that seems hard! A basic approach would be to build a Scooby Doo language model (SDLM) and then run my blog posts through a paraphrase engine that uses the SDLM for producing the output. My vague sense is that this would work pretty poorly, primarily because the subtleness in phrase structure selection would be lost on a highly-lexicalized language model. I imagine you'd get some funny stuff out and it might be amusing to do, but I don't have time to try.

As far as pure anonymization goes, it seems like doing something similar to the steganography approach would work. Here, what you could do is generate a random sequence of bits, and then "encode" that random sequence using the steganography system. This would at least remove some identifying information. But the goal of the steganography isn't to change every phrase, but just to change enough phrases that you can encode your message. It also wouldn't solve the problem that perhaps you can identifying a bit about an author by the lengths of their sentences. Or their oscillation between long and short sentences. This also wouldn't be hidden.

An alternative, human-in-the-loop approach might be simply to have an authorship recognition system running in your word processor, and then any time you type something that enables it to identify you, it could highlight it and you could be tasked with changing it. I suspect this would be a frustrating, but fairly interesting experience (at least the first time).

p.s., I'm now officially tweeting on @haldaume3.

natural language processing blog