Wednesday, October 21, 2015

TreeTagger - a language tagger / stemmer

 The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Sample output:
word                pos      lemma
The                   DT       the
TreeTagger      NP      TreeTagger
is                      VBZ      be
easy                    JJ      easy
to                     TO      to
use                   VB      use
.                       SENT      .

The TreeTagger can also be used as a chunker for English, German, French, and Spanish.

Executable code for Linux and Windows PCs as well as Intel-Macs, and parameter files for various languages can be downloaded via the links below.

This software is freely available for research, education and evaluation.


No comments:

Post a Comment