API and Documentation¶

class forpus.forpus.Corpus(source, target, fname_pattern='{author}_{title}')[source]¶

Bases: object

Converts a plain text corpus into a NLP-specific corpus format.

Construct this class, if you have a directory of plain text files (.txt), and want to convert the content of those files into a NLP-specific corpus format. In most cases, each NLP tool uses its own idiosyncratic input format. This class helps you to convert a corpus very easy to the desired format.

This class does not store the whole corpus at once in RAM, which is useful when handling very large corpora. Documents are streamed from disk in a lazy fashion, one document at a time, and closed before the next one is opened. Have a look at stream_corpus(), if you are interested in how this is implemented.

There is a plenty of formats available:

JSON, see to_json()
Document-term matrix, see to_document_term_matrix()
Graph, see to_graph()
- GEXF
- GML
- GraphML
- Pajek
- SparseGraph6
- YAML
David Blei’s LDA-C, see to_ldac()
Thorsten Joachims’ SVMlight, see to_svmlight()

Once instantiated, you can convert the corpus only once. The concept of this library is to construct one class for each target format. For example:

>>> CorpusJSON = Corpus(source='corpus', target='corpus_json')
>>> CorpusJSON.to_json()
>>> CorpusTEI = Corpus(source='corpus', target='corpus_tei')
>>> CorpusTEI.to_tei()

and so on…

This should help you to keep an overview and avoid storing all kind of different corpus formats in the same directory.

Args:

source (str): The path to the corpus directory. This can be an: absolute or relative path.
target (str): The path to the output directory. Same as above,: either an absolute or relative path.
fname_pattern (str, optional): The pattern of the corpus’s: filenames. Metadata wil be extracted from the filenames based on this pattern. If the pattern is None or does not match the structure, only the basename (without suffix) will be considered as metadata. An example for the filename parsons_social.txt would be {author}_{title}. parsons will be recognized as author, social as the title.

Attributes:

source (str): The path to the corpus directory. This can be an: absolute or relative path.
target (str): The path to the output directory. Same as above,: either an absolute or relative path.
pattern (str, optional): The pattern of the corpus’s filenames.: Metadata wil be extracted from the filenames based on this pattern. If the pattern is None or does not match the structure, only the basename (without suffix) will be considered as metadata. An example for the filename parsons_social.txt would be {author}_{title}. parsons will be recognized as author, social as the title.
corpus (iterable): This is an iterable of (metadata, text).: metadata is a pandas.DataFrame containing metadata extracted from the filename. text is the content of the file as str.

stream_corpus()[source]¶

Streams a text corpus from disk.

This method is used to instantiate the corpus. Each file in the directory source will be opened and yielded in a for loop.

Yields:: A tuple of (metadata, text). metadata is a pandas DataFrame containing metadata extracted from the filename. text is the content of the file as str.

to_document_term_matrix(tokenizer, counter, **preprocessing)[source]¶

Converst the corpus into a document-term matrix.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

Args:

tokenizer (function): This must be a function for: tokenization. You could use a simple regex function or from NLTK.
counter (function): This must be a function which counts: elements of an iterable. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. But you can simply use the Counter provided in the Python standard library.
**preprocessing (function, optional): This can be one or: even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.

Returns:

None, but writes the formatted corpus to disk.

to_graph(tokenizer, variant='gexf', **preprocessing)[source]¶

Converst the corpus into a graph.

In mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense related. This method creates nodes (objects) for each document (basically the filename), as well as for each type in the corpus. Each document node has one or more attributes based on the metadata extracted from the filenames. If a type appears in a document, there will be an edge between document node and type node.

You can convert the graph to various graph-specific XML formats:

Args:

tokenizer (function): This must be a function for: tokenization. You could use a simple regex function or from NLTK.
variant (str): This must be the kind of XML foramt you want: to convert the graph to. Possible values are gexf, gml, graphml, pajek, graph6, and yaml.
**preprocessing (function, optional): This can be one or: even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.

Returns:

None, but writes the formatted corpus to disk.

to_json(onefile=True)[source]¶

Converts the corpus into JSON.

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. For more information on this format, follow this link.

This method converts your plain text corpus to JSON. Besides the content of your documents, metadata will be included in the JSON. Have a look at the basic description of Corpus for proper metadata recognition.

You have two options:

1. In case you want to write the whole corpus into one single file, set the parameter onefile to True. Be aware, the whole corpus will be in RAM.

2. If onefile is False, there will be one JSON file for each document.

Args:

onefile (bool): If True, write the whole corpus in one file.: Otherwise each document will be written to single files.

Returns:

None, but writes the formatted corpus to disk.

to_ldac(tokenizer, counter, **preprocessing)[source]¶

Converts the corpus into the LDA-C format.

In the LDA-C corpus format, each document is succinctly represented as a sparse vector of word counts. Each line is of the form:

[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string. This will be in the file corpus.ldac.

The vocabulary, exactly one term per line, will be in the file corpus.tokens. Furthermore, metadata extracted from the filenames will be in the file corpus.metadata.

Args:

tokenizer (function): This must be a function for: tokenization. You could use a simple regex function or from NLTK.
counter (function): This must be a function which counts: elements of an iterable. There are various schemes for determining the value that each entry should take. One such scheme is tf-idf. But you can simply use the Counter provided in the Python standard library.
**preprocessing (function, optional): This can be one or: even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.

Returns:

None, but writes three files to disk.

to_svmlight(tokenizer, counter, classes, **preprocessing)[source]¶

Converts the corpus into the SVMlight format.

In the SVMlight corpus format, each document is succinctly represented as a sparse vector of word counts. Each line is of the form:

[c] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [c] is the identifier of the instance class (in the context of topic modeling this is 0 for all instances), and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string. This will be in the file corpus.svmlight.

The vocabulary, exactly one term per line, will be in the file corpus.tokens. Furthermore, metadata extracted from the filenames will be in the file corpus.metadata.

Args:

tokenizer (function): This must be a function for: tokenization. You could use a simple regex function or from NLTK.
counter (function): This must be a function which counts: elements of an iterable. There are various schemes for determining the value that each entry should take. One such scheme is tf-idf. But you can simply use the Counter provided in the Python standard library.
classes (iterable): An iterable of the classes of the: documents. For instance, +1 as the target value marks a positive example, -1 a negative example respectively.
**preprocessing (function, optional): This can be one or: even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.

Returns:

None, but writes three files to disk.