API and Documentation¶
-
class
forpus.forpus.
Corpus
(source, target, fname_pattern='{author}_{title}')[source]¶ Bases:
object
Converts a plain text corpus into a NLP-specific corpus format.
Construct this class, if you have a directory of plain text files (.txt), and want to convert the content of those files into a NLP-specific corpus format. In most cases, each NLP tool uses its own idiosyncratic input format. This class helps you to convert a corpus very easy to the desired format.
This class does not store the whole corpus at once in RAM, which is useful when handling very large corpora. Documents are streamed from disk in a lazy fashion, one document at a time, and closed before the next one is opened. Have a look at
stream_corpus()
, if you are interested in how this is implemented.- There is a plenty of formats available:
- JSON, see
to_json()
- Document-term matrix, see
to_document_term_matrix()
- Graph, see
to_graph()
- GEXF
- GML
- GraphML
- Pajek
- SparseGraph6
- YAML
- Graph, see
- David Blei’s LDA-C, see
to_ldac()
- Thorsten Joachims’ SVMlight, see
to_svmlight()
- JSON, see
Once instantiated, you can convert the corpus only once. The concept of this library is to construct one class for each target format. For example:
>>> CorpusJSON = Corpus(source='corpus', target='corpus_json') >>> CorpusJSON.to_json() >>> CorpusTEI = Corpus(source='corpus', target='corpus_tei') >>> CorpusTEI.to_tei()
and so on…
This should help you to keep an overview and avoid storing all kind of different corpus formats in the same directory.
- Args:
- source (
str
): The path to the corpus directory. This can be an - absolute or relative path.
- target (
str
): The path to the output directory. Same as above, - either an absolute or relative path.
- fname_pattern (
str
, optional): The pattern of the corpus’s - filenames. Metadata wil be extracted from the filenames based on
this pattern. If the pattern is
None
or does not match the structure, only the basename (without suffix) will be considered as metadata. An example for the filenameparsons_social.txt
would be{author}_{title}
.parsons
will be recognized as author,social
as the title.
- source (
- Attributes:
- source (
str
): The path to the corpus directory. This can be an - absolute or relative path.
- target (
str
): The path to the output directory. Same as above, - either an absolute or relative path.
- pattern (
str
, optional): The pattern of the corpus’s filenames. - Metadata wil be extracted from the filenames based on this pattern.
If the pattern is
None
or does not match the structure, only the basename (without suffix) will be considered as metadata. An example for the filenameparsons_social.txt
would be{author}_{title}
.parsons
will be recognized as author,social
as the title. - corpus (
iterable
): This is an iterable of(metadata, text)
. metadata
is apandas.DataFrame
containing metadata extracted from the filename.text
is the content of the file asstr
.
- source (
-
stream_corpus
()[source]¶ Streams a text corpus from disk.
This method is used to instantiate the
corpus
. Each file in the directorysource
will be opened and yielded in a for loop.- Yields:
- A tuple of
(metadata, text)
.metadata
is a pandas DataFrame containing metadata extracted from the filename.text
is the content of the file asstr
.
-
to_document_term_matrix
(tokenizer, counter, **preprocessing)[source]¶ Converst the corpus into a document-term matrix.
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
- Args:
- tokenizer (
function
): This must be a function for - tokenization. You could use a simple regex function or from NLTK.
- counter (
function
): This must be a function which counts - elements of an iterable. There are various schemes for
determining the value that each entry in the matrix should
take. One such scheme is
tf-idf. But you can
simply use the
Counter
provided in the Python standard library. - **preprocessing (
function
, optional): This can be one or - even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.
- tokenizer (
- Returns:
- None, but writes the formatted corpus to disk.
-
to_graph
(tokenizer, variant='gexf', **preprocessing)[source]¶ Converst the corpus into a graph.
In mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense related. This method creates nodes (objects) for each document (basically the filename), as well as for each type in the corpus. Each document node has one or more attributes based on the metadata extracted from the filenames. If a type appears in a document, there will be an edge between document node and type node.
- You can convert the graph to various graph-specific XML formats:
- Args:
- tokenizer (
function
): This must be a function for - tokenization. You could use a simple regex function or from NLTK.
- variant (
str
): This must be the kind of XML foramt you want - to convert the graph to. Possible values are
gexf
,gml
,graphml
,pajek
,graph6
, andyaml
. - **preprocessing (
function
, optional): This can be one or - even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.
- tokenizer (
- Returns:
- None, but writes the formatted corpus to disk.
-
to_json
(onefile=True)[source]¶ Converts the corpus into JSON.
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. For more information on this format, follow this link.
This method converts your plain text corpus to JSON. Besides the content of your documents, metadata will be included in the JSON. Have a look at the basic description of
Corpus
for proper metadata recognition.- You have two options:
1. In case you want to write the whole corpus into one single file, set the parameter
onefile
to True. Be aware, the whole corpus will be in RAM.2. If
onefile
is False, there will be one JSON file for each document.- Args:
- onefile (
bool
): If True, write the whole corpus in one file. - Otherwise each document will be written to single files.
- onefile (
- Returns:
- None, but writes the formatted corpus to disk.
-
to_ldac
(tokenizer, counter, **preprocessing)[source]¶ Converts the corpus into the LDA-C format.
In the LDA-C corpus format, each document is succinctly represented as a sparse vector of word counts. Each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where
[M]
is the number of unique terms in the document, and the[count]
associated with each term is how many times that term appeared in the document. Note that[term_1]
is an integer which indexes the term; it is not a string. This will be in the filecorpus.ldac
.The vocabulary, exactly one term per line, will be in the file
corpus.tokens
. Furthermore, metadata extracted from the filenames will be in the filecorpus.metadata
.- Args:
- tokenizer (
function
): This must be a function for - tokenization. You could use a simple regex function or from NLTK.
- counter (
function
): This must be a function which counts - elements of an iterable. There are various schemes for
determining the value that each entry should take. One such
scheme is tf-idf.
But you can simply use the
Counter
provided in the Python standard library. - **preprocessing (
function
, optional): This can be one or - even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.
- tokenizer (
- Returns:
- None, but writes three files to disk.
-
to_svmlight
(tokenizer, counter, classes, **preprocessing)[source]¶ Converts the corpus into the SVMlight format.
In the SVMlight corpus format, each document is succinctly represented as a sparse vector of word counts. Each line is of the form:
[c] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where
[c]
is the identifier of the instance class (in the context of topic modeling this is 0 for all instances), and the[count]
associated with each term is how many times that term appeared in the document. Note that[term_1]
is an integer which indexes the term; it is not a string. This will be in the filecorpus.svmlight
.The vocabulary, exactly one term per line, will be in the file
corpus.tokens
. Furthermore, metadata extracted from the filenames will be in the filecorpus.metadata
.- Args:
- tokenizer (
function
): This must be a function for - tokenization. You could use a simple regex function or from NLTK.
- counter (
function
): This must be a function which counts - elements of an iterable. There are various schemes for
determining the value that each entry should take. One such
scheme is tf-idf.
But you can simply use the
Counter
provided in the Python standard library. - classes (
iterable
): An iterable of the classes of the - documents. For instance, +1 as the target value marks a positive example, -1 a negative example respectively.
- **preprocessing (
function
, optional): This can be one or - even more functions which take the output of your tokenizer function as input. So, you could write a function which counts the terms in your corpus and removes the 100 most frequent words.
- tokenizer (
- Returns:
- None, but writes three files to disk.