Summary
Linguistics
Debian Science Linguistics packages
This metapackage is part of the Debian Pure Blend "Debian Science"
and installs packages related to Linguistics.
The list to the right includes various software projects which are of some interest to the Debian Science Project. Currently, only a few of them are available as Debian packages. It is our goal, however, to include all software in Debian Science which can sensibly add to a high quality Debian Pure Blend.
For a better overview of the project's availability as a Debian package, each head row has a color code according to this scheme:
If you discover a project which looks like a good candidate for Debian Science
to you, or if you have prepared an unofficial Debian package, please do not hesitate to
send a description of that project to the Debian Science mailing list
Links to other tasks
|
Debian Science Linguistics packages
Official Debian packages with high relevance
|
Apertium
Shallow-transfer machine translation engine
|
| Versions of package apertium |
| Release | Version | Architectures |
| squeeze | 3.1.0-1.2 | amd64,armel,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,sparc |
| wheezy | 3.1.0-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 3.1.0-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 3.2.0 |
| Debtags of package apertium: |
| field | linguistics |
| role | program |
|
License: DFSG free
|
|
An open-source shallow-transfer machine translation
engine, Apertium is initially aimed at related-language pairs.
It uses finite-state transducers for lexical processing,
hidden Markov models for part-of-speech tagging, and
finite-state based chunking for structural transfer.
The system is largely based upon systems already developed by
the Transducens group at the Universitat d'Alacant, such as
interNOSTRUM (Spanish-Catalan, http://www.internostrum.com/welcome.php)
and Traductor Universia (Spanish-Portuguese,
http://traductor.universia.net).
It will be possible to use Apertium to build machine translation
systems for a variety of related-language pairs simply providing
the linguistic data needed in the right format.
|
|
|
Artha
Handy off-line thesaurus based on WordNet
|
| Versions of package artha |
| Release | Version | Architectures |
| squeeze | 0.9.1-1 | amd64,armel,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,sparc |
| wheezy | 1.0.2-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 1.0.2-1 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 1.0.3 |
| Debtags of package artha: |
| field | linguistics |
| interface | x11 |
| role | program |
| uitoolkit | gtk |
| x11 | application |
|
License: DFSG free
|
|
Artha is a off-line English thesaurus with distinct features like:
- hot-key press word look-up (select text on any window and press
a preset hot-key for look-up)
- regular expressions based search (broaden search using wild-cards
like *, ?, etc.)
- passive desktop notifications (of word definitions for
uninterrupted work-flow)
- spelling suggestions (when the exact spelling is vague/not known)
Once launched, it monitors for a preset hot-key combination. When
some text is selected on any window and the hot-key is pressed, it
pops-up with the word looked-up. Should the user prefer passive
notifications, this can be done by enabling the notifications option.
When the term looked for is vague/not known, then either the search
can be broadened with the use of regular expressions (*, ?, etc.) in
the search string or spelling suggestions when a term is incorrect.
For regular expressions based search to work, wordnet-sense-index
package is required.
|
|
|
Dimbl
Distributed Memory Based Learner
|
| Versions of package dimbl |
| Release | Version | Architectures |
| wheezy | 0.11-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.11-1 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| Debtags of package dimbl: |
| role | program |
|
License: DFSG free
|
|
Dimbl is a wrapper around the k-nearest neighbor classifier in TiMBL, offering
parallel classification on multi-CPU machines. Dimbl splits the original
training set, builds separate TiMBL classifiers per training subset, and
merges their nearest-neighbor sets per classified instance
Dimbl's features are:
- Wraps neatly around TiMBL, retaining all command line options;
- Knows what to do with your multiple, duo, or quad cores;
- Makes use of the OpenMP specification for parallel programming;
- Can attain superlinear speed gains compared to standard TiMBL.
Dimbl is a product of the ILK Research Group (Tilburg University, The
Netherlands).
If you do scientific research in Natural Language Processing using the
Memory-Based Learning technique, Dimbl will likely be of use to you.
|
|
|
Frog
tagger and parser for Dutch language
|
| Versions of package frog |
| Release | Version | Architectures |
| wheezy | 0.12.15-3 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.12.16-4 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 0.12.17 |
|
License: DFSG free
|
|
Memory-Based Learning (MBL) is a machine-learning method applicable to a wide
range of tasks in Natural Language Processing (NLP).
Frog is a modular system integrating a morphosyntactic tagger, lemmatizer,
morphological analyzer, and dependency parser for the Dutch language. It is
based upon it's predecessor TADPOLE (TAgger, Dependency Parser, and
mOrphoLogical analyzEr). Using Memory-Based Learning techniques, Tadpole
tokenizes, tags, lemmatizes, and morphologically segments word tokens in
incoming Dutch UTF-8 text files, and assigns a dependency graph to each
sentence. Tadpole is particularly targeted at the increasing need for fast,
automatic NLP systems applicable to very large (multi-million to billion word)
document collections that are becoming available due to the progressive
digitization of both new and old textual data.
NB: Frog can be considered alpha software, and is in a fair state of flux.
Frog is a product of the ILK Research Group (Tilburg University,
The Netherlands) and the CLiPS Research Centre (University of Antwerp,
Belgium).
If you do scientific research in NLP, Frog will likely be of use to you.
|
|
|
Link-grammar
Carnegie Mellon University's link grammar parser
|
| Versions of package link-grammar |
| Release | Version | Architectures |
| squeeze | 4.6.7-1 | amd64,armel,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,sparc |
| wheezy | 4.7.4-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 4.7.4-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 4.7.11 |
| Debtags of package link-grammar: |
| field | linguistics |
| interface | commandline |
| role | program |
| use | checking |
| works-with | dictionary |
|
License: DFSG free
|
|
In Selator, D. and Temperly, D. "Parsing English with a Link Grammar"
(1991), the authors defined a new formal grammatical system called a
"link grammar". A sequence of words is in the language of a link
grammar if there is a way to draw "links" between words in such a way
that the local requirements of each word are satisfied, the links do
not cross, and the words form a connected graph. The authors encoded
English grammar into such a system, and wrote this program to parse
English using this grammar.
link-grammar can be used for linguistic parsing for information
retrieval or extraction from natural language documents. It can also be
used as a grammar checker.
This package contains the user-executable binary.
|
|
|
Mbt
memory-based tagger-generator and tagger
|
| Versions of package mbt |
| Release | Version | Architectures |
| wheezy | 3.2.8-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 3.2.9-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 3.2.10 |
| Debtags of package mbt: |
| field | linguistics |
| role | program |
|
License: DFSG free
|
|
MBT is a memory-based tagger-generator and tagger in one. The tagger-generator
part can generate a sequence tagger on the basis of a training set of tagged
sequences; the tagger part can tag new sequences. MBT can, for instance, be
used to generate part-of-speech taggers or chunkers for natural language
processing. Features:
- Tagger generation: tagged text in, tagger out,
- Optional feedback loop: feed previous tag decision back to input of next
decision,
- Easily customizable feature representation; can incorporate user-provided
features,
- Automatic generation of separate sub-taggers for known words and unknown
words,
- Can make use of full algorithmic parameters of TiMBL.
MBT is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
If you do scientific research in natural language processing, MBT will
likely be of use to you.
|
|
|
Mbtserver
Server extensions for the MBT tagger
|
| Versions of package mbtserver |
| Release | Version | Architectures |
| wheezy | 0.5-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.6-1 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 0.7 |
|
License: DFSG free
|
|
MbtServer extends Mbt with a server layer, running as a TCP server. Mbt is a
memory-based tagger-generator and tagger for natural language processing.
MbtServer provides the possibility to access a trained tagger from multiple
sessions. It also allows one to run and access different taggers in parallel.
MbtServer is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
If you do scientific research in natural language processing, MbtServer will
likely be of use to you.
|
|
|
Timbl
Tilburg Memory Based Learner
|
| Versions of package timbl |
| Release | Version | Architectures |
| wheezy | 6.4.2-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 6.4.3-1 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 6.4.4 |
| Debtags of package timbl: |
| role | program |
|
License: DFSG free
|
|
Memory-Based Learning (MBL) is a machine-learning method applicable to a wide
range of tasks in Natural Language Processing (NLP).
The Tilburg Memory Based Learner, TiMBL, is a tool for NLP research, and for
many other domains where classification tasks are learned from examples. It
is an efficient implementation of k-nearest neighbor classifier.
TiMBL's features are:
-
Fast, decision-tree-based implementation of k-nearest neighbor
classification;
-
Implementations of IB1 and IB2, IGTree, TRIBL, and TRIBL2 algorithms;
- Similarity metrics: Overlap, MVDM, Jeffrey Divergence, Dot product, Cosine;
-
Feature weighting metrics: information gain, gain ratio, chi squared,
shared variance;
-
Distance weighting metrics: inverse, inverse linear, exponential decay;
- Extensive verbosity options to inspect nearest neighbor sets;
- Server functionality and extensive API;
- Fast leave-one-out testing and internal cross-validation;
- and Handles user-defined example weighting.
TiMBL is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
If you do scientific research in NLP, timbl will likely be of use to you.
|
|
|
Timblserver
Server extensions for Timbl
|
| Versions of package timblserver |
| Release | Version | Architectures |
| wheezy | 1.4-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 1.6-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 1.7 |
| Debtags of package timblserver: |
| role | program |
|
License: DFSG free
|
|
timblserver is a TiMBL wrapper; it adds server functionality to TiMBL. It
allows TiMBL to run multiple experiments as a TCP server, optionally via HTTP.
The Tilburg Memory Based Learner, TiMBL, is a tool for Natural Language
Processing research, and for many other domains where classification tasks are
learned from examples.
TimblServer is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
If you do scientific research in NLP, TimblServer will likely be of use to you.
|
|
|
Ucto
|
| Versions of package ucto |
| Release | Version | Architectures |
| wheezy | 0.5.2-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.5.2-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 0.5.3 |
| Debtags of package ucto: |
| role | program |
|
License: DFSG free
|
|
Ucto can tokenize UTF-8 encoded text files (i.e. separate words from
punctuation, split sentences, generate n-grams), and offers several other
basic preprocessing steps (change case, count words/characters and reverse
lines) that make your text suited for further processing such as indexing,
part-of-speech tagging, or machine translation.
Ucto is a product of the ILK Research Group, Tilburg University (The
Netherlands).
If you are interested in machine parsing of UTF-8 encoded text files, e.g. to
do scientific research in natural language processing, ucto will likely be of
use to you.
|
|
|
Wordnet
electronic lexical database of English language
|
| Versions of package wordnet |
| Release | Version | Architectures |
| squeeze | 3.0-24 | amd64,armel,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,sparc |
| wheezy | 3.0-29 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 3.0-31 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| Debtags of package wordnet: |
| field | linguistics |
| interface | x11 |
| role | program |
| scope | application |
| uitoolkit | tk |
| use | checking |
| works-with | dictionary |
| x11 | application |
|
License: DFSG free
|
|
WordNet(C) is an on-line lexical reference system whose design is
inspired by current psycholinguistic theories of human lexical
memory. English nouns, verbs, adjectives and adverbs are organized
into synonym sets, each representing one underlying lexical
concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton
University under the direction of Professor George A. Miller (Principal
Investigator).
WordNet is considered to be the most important resource available to
researchers in computational linguistics, text analysis, and many
related areas. Its design is inspired by current psycholinguistic and
computational theories of human lexical memory.
Binary and manpages of WordNet as well as general manpages.
|
|
Official Debian packages with lower relevance
|
Libfolia1-dev
implementation of the FoLiA document format
|
| Versions of package libfolia1-dev |
| Release | Version | Architectures |
| wheezy | 0.9-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.9-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 0.10 |
| Debtags of package libfolia1-dev: |
| devel | library |
| role | devel-lib |
|
License: DFSG free
|
|
FoLiA is an XML-based format for Linguistic Annotation suitable for
representing written language resources such as corpora.
Its goal is to unify a variety of linguistic annotations in one single rich
format, without committing to any particular standard annotation set.
Instead, it seeks to accommodate any desired system or tagset, and so offer
maximum flexibility. This makes FoLiA language independent.
see http://ilk.uvt.nl/folia/ for more information.
libfolia is a product of the ILK Research Group, Tilburg University (The
Netherlands).
This package provides the FoLiA header files required to compile C++ programs
that use libfolia.
|
|
|
Libmbt0-dev
memory-based tagger-generator and tagger - development
|
| Versions of package libmbt0-dev |
| Release | Version | Architectures |
| wheezy | 3.2.8-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 3.2.9-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 3.2.10 |
| Debtags of package libmbt0-dev: |
| devel | library |
| role | devel-lib |
|
License: DFSG free
|
|
MBT is a memory-based tagger-generator and tagger in one. The tagger-generator
part can generate a sequence tagger on the basis of a training set of tagged
sequences; the tagger part can tag new sequences. MBT can, for instance, be
used to generate part-of-speech taggers or chunkers for natural language
processing.
MBT is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
If you do scientific research in natural language processing, MBT will
likely be of use to you.
This package provides the header files required to compile C++ programs that
use libmbt.
|
|
|
Libtimbl3-dev
Tilburg Memory Based Learner - development
|
| Versions of package libtimbl3-dev |
| Release | Version | Architectures |
| wheezy | 6.4.2-1 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 6.4.3-1 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 6.4.4 |
| Debtags of package libtimbl3-dev: |
| devel | library |
| role | devel-lib |
|
License: DFSG free
|
|
The Tilburg Memory Based Learner, TiMBL, is a tool for Natural Language
Processing research, and for many other domains where classification tasks are
learned from examples. It is an efficient implementation of k-nearest neighbor
classifier.
TiMBL is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
This package provides the TiMBL header files required to compile C++ programs
that use TiMBL.
|
|
|
Libtimblserver2-dev
Server extensions for Timbl - development
|
| Versions of package libtimblserver2-dev |
| Release | Version | Architectures |
| wheezy | 1.4-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 1.6-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 1.7 |
| Debtags of package libtimblserver2-dev: |
| devel | library |
| role | devel-lib |
|
License: DFSG free
|
|
timblserver is a TiMBL wrapper; it adds server functionality to TiMBL. It
allows TiMBL to run multiple experiments as a TCP server, optionally via HTTP.
The Tilburg Memory Based Learner, TiMBL, is a tool for Natural Language
Processing research, and for many other domains where classification tasks are
learned from examples.
TimblServer is a product of the ILK Research Group (Tilburg University, The
Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium).
This package provides the header files required to compile C++ programs that
use timblserver
|
|
|
Libucto1-dev
Unicode Tokenizer - development
|
| Versions of package libucto1-dev |
| Release | Version | Architectures |
| wheezy | 0.5.2-2 | amd64,armel,armhf,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| sid | 0.5.2-2 | amd64,armel,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mips,mipsel,powerpc,s390,s390x,sparc |
| upstream | 0.5.3 |
| Debtags of package libucto1-dev: |
| devel | library |
| role | devel-lib |
|
License: DFSG free
|
|
Ucto can tokenize UTF-8 encoded text files (i.e. separate words from
punctuation, split sentences, generate n-grams), and offers several other
basic preprocessing steps (change case, count words/characters and reverse
lines) that make your text suited for further processing such as indexing,
part-of-speech tagging, or machine translation.
Ucto is a product of the ILK Research Group, Tilburg University (The
Netherlands).
This package provides the ucto header files required to compile C++ programs
that use ucto.
|
|
Debian packages in experimental
|
Sequitur-g2p
Grapheme to Phoneme conversion tool
|
| Versions of package sequitur-g2p |
| Release | Version | Architectures |
| experimental | 0+r1668-1 | amd64,armhf,hurd-i386,i386,ia64,kfreebsd-amd64,kfreebsd-i386,mipsel,powerpc,s390,s390x,sparc |
|
License: DFSG free
|
|
Sequitur G2P is a data-driven grapheme-to-phoneme converter. It can
be applied to any monotonous sequence translation problem, provided
the source and target alphabets are small (less than 255
symbols). Data-driven means that you need to train it with example
pronunciations. Training takes a pronunciation dictionary and
creates a model file. The model file can then be used to transcribe
words that where not in the dictionary.
|
|
No known packages available
|
Wnsqlbuilder
SQL version of WordNet 3.0
|
License: GPL
Debian package not available
|
|
WordNet SQL Builder is a Java utility to generate SQL database from
WordNet standard database as released by the WordNet Project (Princeton
University)
Features
- Support for MySql and PostGreSQL.
- Complete port (however, orphaned morphological forms are dropped, and
so are VerbNet/XWordNet data that cannot be linked to WordNet entries).
- Incremental build support.
- Retains synset index as primary key allowing easy reference to wordnet
original database
- Includes support for WordNet 3.0
- Includes support for WordNet 2.0 to 2.1, 2.1 to 3.0, 2.0 to 3.0 sense maps
- Includes support for VerbNet 2.3
- Includes support for XWordNet 2.0-1.1
- Ready-to-use database (see wnsqldatabase package in download section) including
- WordNet 3.0
- WordNet 2.0 to 2.1, 2.1 to 3.0, 2.0 to 3.0 sense maps
- VerbNet 2.3
- XWordNet 2.0-1.1
- British National Corpus statistical data (for commonly used-words)
|
|