Models for many natural language tasks benefit from the
flexibility to use overlapping, non-independent features.
For example, the need for labeled data can be drastically
reduced by taking advantage of domain knowledge in
the form of word lists, part-of-speech tags, character ngrams,
and capitalization patterns. While it is difficult to
capture such inter-dependent features with a generative
probabilistic model, conditionally-trained models, such
as conditional maximum entropy models, handle them
well. There has been significant work with such models
for greedy sequence modeling in NLP (Ratnaparkhi,
1996; Borthwick et al., 1998).
Conditional Random Fields (CRFs) (Lafferty et al.,
2001) are undirected graphical models, a special case of
which correspond to conditionally-trained finite state machines.
While based on the same exponential form as
maximum entropy models, they have efficient procedures
for complete, non-greedy finite-state inference and training.
CRFs have shown empirical successes recently in
POS tagging (Lafferty et al., 2001), noun phrase segmentation
(Sha and Pereira, 2003) and Chinese word segmentation
(McCallum and Feng, 2003).
Given these models’ great flexibility to include a wide
array of features, an important question that remains is
what features should be used? For example, in some
cases capturing a word tri-gram is important, however,
there is not sufficient memory or computation to include
all word tri-grams. As the number of overlapping atomic
features increases, the difficulty and importance of constructing
only certain feature combinations grows.
This paper presents a feature induction method for
CRFs. Founded on the principle of constructing only
those feature conjunctions that significantly increase loglikelihood,
the approach builds on that of Della Pietra et
al (1997), but is altered to work with conditional rather
than joint probabilities, and with a mean-field approximation
and other additional modifications that improve
efficiency specifically for a sequence model. In comparison
with traditional approaches, automated feature induction
offers both improved accuracy and significant reduction
in feature count; it enables the use of richer, higherorder
Markov models, and offers more freedom to liberally
guess about which atomic features may be relevant
|