segram.nlp.corpus module
- class segram.nlp.corpus.Corpus(vocab: Vocab, nlp: Language | None = None, *, count_method: Literal['words', 'lower', 'lemmas'] = 'lemmas', resolve_coref: bool = True)[source]
Bases:
Mapping
Corpus class.
- token_dist
Token distribution.
- count
Count raw words, lowercased words or lemmas.
- resolve_coref
If
True
then token coreferences are resolved when calculating token text and lemma frequency distributions.
- add_doc(doc: Doc | str) None [source]
Add document to the corpus.
The method recognizes identical documents and do not add the same ones more than once. The identity check is based on
segram.nlp.Doc.id()
.See also
segram.nlp.Doc.id
persistent document identifier.
segram.nlp.Doc.coredata
data used to generate the identifier.
- Raises:
AttributeError – If a language model is not defined under the attribute
self.nlp
.
- add_docs(docs: Iterable[Doc | str], *, progress: bool = False, **kwds: Any) None [source]
Add documents to the corpus.
**kwds
are passed totqdm.tqdm()
withprogress
used to switch the progress bar (i.e. it is used asdisable=not progress
).
- count_tokens(what: Literal['words', 'lower', 'lemmas']) None [source]
(Re)count tokens.
what
specifies what kind of tokens should be counted. Recount is done only when necessary, i.e. when the call changes the previous count_method method.
- copy() Self [source]
Make a copy.
Language model object is passed but not copied. Document objects are copied.
- get_docbin(attrs: Iterable[str] = ('HEAD', 'TAG', 'POS', 'DEP', 'LEMMA', 'MORPH', 'ENT_IOB', 'ENT_TYPE', 'ENT_KB_ID'), user_data: bool = True) DocBin [source]
Get documents packed as
spacy.tokens.DocBin
.- Parameters:
attrs – Token attributes to serialize.
user_data – Should user data be stored. Setting to
True
requires clearing the cached grammar objects linked to all tokens, spans and docs to allow for serialization. This does not affect any functionalities of existing documents, but temporarily affects performance as the cache must be first reconstructed during further use.
- prefer_gpu_vectors(*args: Any, **kwds: Any) bool [source]
Put word vectors on GPU if possible.
Arguments are passed to
segram.utils.misc.prefer_gpu_vectors()
.
- classmethod from_texts(nlp: Language, *texts: str, pipe_kws: dict[str, Any] | None = None, progress: bool = False, tqdm_kws: dict[str, Any] | None = None, **kwds: Any) Self [source]
Construct from texts.
- Parameters:
nlp – Language model to use to parse texts.
*texts – Texts to parse.
pipe_kws – Keyword arguments passed to
spacy.language.Language.pipe()
.**kwds – Passed
__init__()
. Vocabulary is taken from the language model.
- to_data(*, vocab: bool = True, nlp: bool = False) dict[str, Any] [source]
Dump to data dictionary.
- Parameters:
vocab – Should
self.vocab
be used.nlp – Should
self.nlp
be used.
- classmethod from_data(data: dict[str, Any], **kwds: Any) Self [source]
Construct from data dictionary.
**kwds
are passed toadd_docs()
.
- to_disk(path: str | bytes | PathLike, **kwds: Any) None [source]
Save to disk.
**kwds
are passed toto_data()
.
- classmethod from_disk(path: str | bytes | PathLike, *, vocab: Vocab | bytes | None = None, nlp: Language | bytes | None = None, **kwds: Any) Self [source]
Construct from disk.
- Parameters:
nlp – Use
vocab
andnlp
to pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk withvocab=False
and/ornlp=False
.vocab – Use
vocab
andnlp
to pass an arbitrary vocabulary and/or language model for initializing corpus. Useful when a corpus has been saved to disk withvocab=False
and/ornlp=False
.**kwds – Passed to
from_data()
.