segram.nlp.corpus module

class segram.nlp.corpus.Corpus(vocab: Vocab, nlp: Language | None = None, *, count_method: Literal['words', 'lower', 'lemmas'] = 'lemmas', resolve_coref: bool = True)[source]

Bases: Mapping

Corpus class.

token_dist: Token distribution.

count: Count raw words, lowercased words or lemmas.

resolve_coref: If True then token coreferences are resolved when calculating token text and lemma frequency distributions.

add_doc(doc: Doc | str) → None[source]

Add document to the corpus.

The method recognizes identical documents and do not add the same ones more than once. The identity check is based on segram.nlp.Doc.id().

See also

segram.nlp.Doc.id: persistent document identifier.
segram.nlp.Doc.coredata: data used to generate the identifier.

Raises:: AttributeError – If a language model is not defined under the attribute self.nlp.

add_docs(docs: Iterable[Doc | str], *, progress: bool = False, **kwds: Any) → None[source]

Add documents to the corpus.

**kwds are passed to tqdm.tqdm() with progress used to switch the progress bar (i.e. it is used as disable=not progress).

count_tokens(what: Literal['words', 'lower', 'lemmas']) → None[source]

(Re)count tokens.

what specifies what kind of tokens should be counted. Recount is done only when necessary, i.e. when the call changes the previous count_method method.

copy() → Self[source]

Make a copy.

Language model object is passed but not copied. Document objects are copied.

get_docbin(attrs: Iterable[str] = ('HEAD', 'TAG', 'POS', 'DEP', 'LEMMA', 'MORPH', 'ENT_IOB', 'ENT_TYPE', 'ENT_KB_ID'), user_data: bool = True) → DocBin[source]

Get documents packed as spacy.tokens.DocBin.

Parameters:

attrs – Token attributes to serialize.
user_data – Should user data be stored. Setting to True requires clearing the cached grammar objects linked to all tokens, spans and docs to allow for serialization. This does not affect any functionalities of existing documents, but temporarily affects performance as the cache must be first reconstructed during further use.

ensure_cpu_vectors() → None[source]: Ensure that word vectors are stored on CPU.

prefer_gpu_vectors(*args: Any, **kwds: Any) → bool[source]

Put word vectors on GPU if possible.

Arguments are passed to segram.utils.misc.prefer_gpu_vectors().

classmethod from_texts(nlp: Language, *texts: str, pipe_kws: dict[str, Any] | None = None, progress: bool = False, tqdm_kws: dict[str, Any] | None = None, **kwds: Any) → Self[source]

Construct from texts.

Parameters:

nlp – Language model to use to parse texts.
*texts – Texts to parse.
pipe_kws – Keyword arguments passed to spacy.language.Language.pipe().
**kwds – Passed __init__(). Vocabulary is taken from the language model.

to_data(*, vocab: bool = True, nlp: bool = False) → dict[str, Any][source]

Dump to data dictionary.

Parameters:

vocab – Should self.vocab be used.
nlp – Should self.nlp be used.

classmethod from_data(data: dict[str, Any], **kwds: Any) → Self[source]

Construct from data dictionary.

**kwds are passed to add_docs().

to_disk(path: str | bytes | PathLike, **kwds: Any) → None[source]

Save to disk.

**kwds are passed to to_data().

Construct from disk.

Parameters:

nlp – Use vocab and nlp to pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk with vocab=False and/or nlp=False.
vocab – Use vocab and nlp to pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk with vocab=False and/or nlp=False.
**kwds – Passed to from_data().