segram.grammar.abc module
Base classes from which ABCs of concret grammar classes are derived.
Grammar classes provide building blocks for representing complex syntactical relationships within sentences which go beyond simple syntax tree links and can be used to perform various tasks such as component and phrase detection.
- class segram.grammar.abc.Grammar[source]
Bases:
SegramWithDocABC,ContainerAbstract base class for grammar classes.
All grammar classes must be defined as slots classes. This is necessary for ensuring low-memory footprint and better computational efficiency. Even classes with no new slots need to declare
__slots__ = (). This requirement is checked during class construction. Other class-specific requirements of this sort as well as their related validation checks may be implemented on specialized grammar classes using the standard__init_subclass__interface. This allows abstract base classes further down the inheritance chain to check for more complex requirements as well as apply dynamic class customizations.
- class segram.grammar.abc.GrammarElement[source]
Bases:
Grammar,SequenceAbstract base class for grammar elements.
All grammar classes must be defined as slots classes. This is necessary for ensuring low-memory footprint and better computational efficiency. Even classes with no new slots need to declare
__slots__ = (). This requirement is checked during class construction. Other class-specific requirements of this sort as well as their related validation checks may be implemented on specialized grammar classes using the standard__init_subclass__interface. This allows abstract base classes further down the inheritance chain to check for more complex requirements as well as apply dynamic class customizations.- abstract property idx: int | tuple[int, ...]
Element index.
- abstract property vector: ndarray[tuple[int], floating]
Word vector.
- property text: str
Raw text of element.
- match(_pattern: str | None = None, _flag: RegexFlag = RegexFlag.NOFLAG, _ignore_missing: bool = False, **kwds: Any | Callable[[Any], bool]) Pattern | None[source]
Match element text against a regex pattern using
re.search()function.- Parameters:
_pattern – Regular expression pattern used for matching. No matching is done when
None._flag – Regex flag.
_ignore_missing_fields – Should missing fields on
selfbe ignored.**kwds – Other keyword arguments can be used for testing values of different attributes on
self. If callables are passed as values then they are expected to be predicate functions returning boolean values.
- similarity(other: Self) float[source]
Cosine similarity between word vectors.
Warning
Currently only doc-doc comparisons based on average token vectors are implemented.
All methods defined here are designed to ensure that:
Similarity of a phrase with respect to itself is
1.Similarity
x ~ y == y ~ x.
In some case the above may be true only approximately due to accumulation of floating point imprecision.
- Parameters:
element – Grammar phrase to compare.
spec – Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see
segram.grammar.phrases.Phrase.part_namesandsegram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared againstsegram.grammar.Sentandsegram.grammar.Docobjects as long as they are comprised of a single sentence. SeeSentSimilarityfor details.method –
Method for calculating similarity between phrases:
componentsComponents are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the
weightparameter) is taken and rescaled with a factorshared / union, wheresharedis the numebr of types present in both elements andunionis the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:\[J = \frac{|A \cap B|}{|A \cup B|}\]phrasesAs above but based on phrase parts and phrase head compoents. See
segram.grammar.Phrase.part_namesfor a full list.bothAs above but components and phrases are used together.
averageSimple average vectors calculated over all component head tokens are used. In this case weights are ignored.
recursiveNOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using
decay_rateparameter by rescaling each weight with a factor ofdecay_rate**depth, wheredepthis calculated relative to the depth of theself.phrase.
weights – Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.
decay_rate – Additional parameter used when
method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.only – Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.
ignore – Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.
- Raises:
RuntimeError –
- class segram.grammar.abc.DocElement(doc: Doc)[source]
Bases:
GrammarElement,SequenceDocument element class.
All grammar classes must be defined as slots classes. This is necessary for ensuring low-memory footprint and better computational efficiency. Even classes with no new slots need to declare
__slots__ = (). This requirement is checked during class construction. Other class-specific requirements of this sort as well as their related validation checks may be implemented on specialized grammar classes using the standard__init_subclass__interface. This allows abstract base classes further down the inheritance chain to check for more complex requirements as well as apply dynamic class customizations.- property idx: int
Fast document id.
It is stable for an instance, and allows for hashing, but is not stable for different objects with the same data, e.g. an element initialized from the same data twice may have differen
.idxvalues each time.
- property id: int
Slow persistent document id.
It will be always the same for documents based on the same exact data.
- property vector: ndarray[tuple[int], floating]
Word vector.
- class segram.grammar.abc.SentElement(sent: Span)[source]
Bases:
GrammarElementGrammar element based on a sentence span.
All grammar classes must be defined as slots classes. This is necessary for ensuring low-memory footprint and better computational efficiency. Even classes with no new slots need to declare
__slots__ = (). This requirement is checked during class construction. Other class-specific requirements of this sort as well as their related validation checks may be implemented on specialized grammar classes using the standard__init_subclass__interface. This allows abstract base classes further down the inheritance chain to check for more complex requirements as well as apply dynamic class customizations.- property idx: tuple[int, int]
Sentence index equal to
(self.start, self.end)allowing for identification/hashing and sorting within the parent document.
- property vector: ndarray[tuple[int], floating]
Word vector.
- property is_correct: bool
Indicates whether the sentence has been parsed correctly and has a well-defined root token.
- class segram.grammar.abc.TokenElement(tok: Token)[source]
Bases:
GrammarElementGrammar element based on a token.
All grammar classes must be defined as slots classes. This is necessary for ensuring low-memory footprint and better computational efficiency. Even classes with no new slots need to declare
__slots__ = (). This requirement is checked during class construction. Other class-specific requirements of this sort as well as their related validation checks may be implemented on specialized grammar classes using the standard__init_subclass__interface. This allows abstract base classes further down the inheritance chain to check for more complex requirements as well as apply dynamic class customizations.- property idx: int
Token index within the parent document.
- property vector: ndarray[tuple[int], floating]
Word vector.