Modified Mon Mar 27 09:29:19 2017 by Eric.Woldridge.
Challenge Problem #5: Natural Language Processing
Context-free grammars (CFGs) provide a simple model for the structure of language and are widely-applied in natural language processing systems. Probabilistic context-free grammars (PCFGs) extend CFGs to define a probability distribution over the sentences and parse-trees generated by the grammar by specifying a separate multinomial distribution for each non-terminal symbol.
A weakness of both CFGs and PCFGs is that each time a non-terminal is expanded using one of its rules, the choice is made independently of all other choices. This independence prevents PCFGs from capturing many important linguistic regularities.
In this challenge problem, we will investigate an approach pioneered by Matsuzaki et al. in which the annotations are latent. That is, we view the parse trees in the treebank as having missing annotations. Recent work on this model includes Petrov et al. and Cohen et al.
Existing implementations of these models are very complex software systems. For example, Petrov's system requires 41,240 lines of code (including comments). Probabilistic Programming languages hold the promise of making it much easier to write these programs. A second opportunity is to demonstrate that different probabilistic model components (finite mixtures, HDP mixtures) can be implemented as modular components and combined flexibly.
This challenge problem will be presented in two stages:
- Stage 1: Fit a standard PCFG to a subset of the WSJ corpus.
- Stage 2: Fit a latent PCFG to a subset of the WSJ corpus. The number of latent annotations will be fixed.
- Stage 2 HDP option: Fit a latent PCFG to a subset of the WSJ corpus. The number of latent annotations will be flexible and modeled via a Hierarchical Dirichlet Process.
Training data will consist of a text corpus (from the WSJ section of OntoNotes 5.0) with associated constituency-based parse trees (i.e., one terminal associated with each nonterminal leaf node). The trees will be binarized, which is a standard transformation intended to simplify the parsing code. A small synthetic data set will be provided for program development and testing.
July 2015: Probabilistic Context-Free Grammar
The initial challenge will be to fit a probabilistic context-free grammar (PCFG) to a training set of parse trees (a subset of the Penn WSJ Treebank drawn from the Ontonotes corpus). Your program will output the fitted PCFG in a designated format. This will then be evaluated by loading it into the Berkeley parser and parsing an evaluation set of documents.
January 2016: PCFG with Latent Annotations
The PCFG-LA model is an extension of PCFG in which each nonterminal is assumed to have a latent annotation. Your program will fit this model to the same Penn Treebank, but now you will need to learn the latent annotations of the grammar rules as well as their probability parameters. We will specify the number of latent annotations for each non-terminal. As a stretch goal, ambitious teams are encouraged to replace this assumption of a fixed number of latent annotations with a variable number, modeled using a Hierarchical Dirichlet Process (HDP) prior. In either case, your code will output a fitted PCFG-LA model in a designated format. This will be fed to the Berkeley parser, which will compute the MAP parse trees on the test corpus by marginalizing out the latent annotations.
The OntoNotes corpus is available at no-cost from LDC, the Linguistic Data Consortium, to both member and non-member organizations.
LDC members can access the corpus using their membership credentials.