segram.nlp.corpus module
- class segram.nlp.corpus.Corpus(vocab: Vocab, nlp: Language | None = None, *, count_method: Literal['words', 'lower', 'lemmas'] = 'lemmas', resolve_coref: bool = True)[source]
Bases:
MappingCorpus class.
- token_dist
Token distribution.
- count
Count raw words, lowercased words or lemmas.
- resolve_coref
If
Truethen token coreferences are resolved when calculating token text and lemma frequency distributions.
- add_doc(doc: Doc | str) None[source]
Add document to the corpus.
The method recognizes identical documents and do not add the same ones more than once. The identity check is based on
segram.nlp.Doc.id().See also
segram.nlp.Doc.idpersistent document identifier.
segram.nlp.Doc.coredatadata used to generate the identifier.
- Raises:
AttributeError – If a language model is not defined under the attribute
self.nlp.
- add_docs(docs: Iterable[Doc | str], *, progress: bool = False, **kwds: Any) None[source]
Add documents to the corpus.
**kwdsare passed totqdm.tqdm()withprogressused to switch the progress bar (i.e. it is used asdisable=not progress).
- count_tokens(what: Literal['words', 'lower', 'lemmas']) None[source]
(Re)count tokens.
whatspecifies what kind of tokens should be counted. Recount is done only when necessary, i.e. when the call changes the previous count_method method.
- copy() Self[source]
Make a copy.
Language model object is passed but not copied. Document objects are copied.
- get_docbin(attrs: Iterable[str] = ('HEAD', 'TAG', 'POS', 'DEP', 'LEMMA', 'MORPH', 'ENT_IOB', 'ENT_TYPE', 'ENT_KB_ID'), user_data: bool = True) DocBin[source]
Get documents packed as
spacy.tokens.DocBin.- Parameters:
attrs – Token attributes to serialize.
user_data – Should user data be stored. Setting to
Truerequires clearing the cached grammar objects linked to all tokens, spans and docs to allow for serialization. This does not affect any functionalities of existing documents, but temporarily affects performance as the cache must be first reconstructed during further use.
- prefer_gpu_vectors(*args: Any, **kwds: Any) bool[source]
Put word vectors on GPU if possible.
Arguments are passed to
segram.utils.misc.prefer_gpu_vectors().
- classmethod from_texts(nlp: Language, *texts: str, pipe_kws: dict[str, Any] | None = None, progress: bool = False, tqdm_kws: dict[str, Any] | None = None, **kwds: Any) Self[source]
Construct from texts.
- Parameters:
nlp – Language model to use to parse texts.
*texts – Texts to parse.
pipe_kws – Keyword arguments passed to
spacy.language.Language.pipe().**kwds – Passed
__init__(). Vocabulary is taken from the language model.
- to_data(*, vocab: bool = True, nlp: bool = False) dict[str, Any][source]
Dump to data dictionary.
- Parameters:
vocab – Should
self.vocabbe used.nlp – Should
self.nlpbe used.
- classmethod from_data(data: dict[str, Any], **kwds: Any) Self[source]
Construct from data dictionary.
**kwdsare passed toadd_docs().
- to_disk(path: str | bytes | PathLike, **kwds: Any) None[source]
Save to disk.
**kwdsare passed toto_data().
- classmethod from_disk(path: str | bytes | PathLike, *, vocab: Vocab | bytes | None = None, nlp: Language | bytes | None = None, **kwds: Any) Self[source]
Construct from disk.
- Parameters:
nlp – Use
vocabandnlpto pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk withvocab=Falseand/ornlp=False.vocab – Use
vocabandnlpto pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk withvocab=Falseand/ornlp=False.**kwds – Passed to
from_data().