A library for processing plain text corpora to various corpus formats.¶
In most cases, each NLP tool uses its own idiosyncratic input format. This library helps you to convert a corpus very easy to the desired format.
It is called Forpus, because you are formatting a corpus, but this is also a genus of parrot in the family Psittacidae.
This library supports conversions to
- JSON
- Document-term matrix
- David Blei’s LDA-C
- Thorsten Joachims’ SVMlight
See Getting Started for how to install Forpus.