segram.semantic.similarity module
- class segram.semantic.similarity.GrammarSimilarity(element: GrammarElement, spec: Any)[source]
Bases:
ABCAbstract base class for structured similarity scorers.
- abstract get_similarity(element: GrammarElement, spec: Any) float[source]
Get structured similarity between
self.elementandself.spec.
- class segram.semantic.similarity.PhraseSimilarity(element: Phrase, spec: Phrase | str | Iterable[str] | dict[str, str | Iterable[str] | Phrase | Sent | Doc], method: Literal['components', 'phrases', 'recursive', 'average'] = 'components', *, weights: dict[str, float | int] | None = None, decay_rate: float = 1, only: str | Iterable[str] = (), ignore: str | Iterable[str] = ())[source]
Bases:
GrammarSimilarityStructured similarity between phrases and sentences.
All methods defined here are designed to ensure that:
Similarity of a phrase with respect to itself is
1.Similarity
x ~ y == y ~ x.
In some case the above may be true only approximately due to accumulation of floating point imprecision.
- element
Grammar phrase to compare.
- spec
Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see
segram.grammar.phrases.Phrase.part_namesandsegram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared againstsegram.grammar.Sentandsegram.grammar.Docobjects as long as they are comprised of a single sentence. SeeSentSimilarityfor details.
- method
Method for calculating similarity between phrases:
componentsComponents are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the
weightparameter) is taken and rescaled with a factorshared / union, wheresharedis the numebr of types present in both elements andunionis the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:\[J = \frac{|A \cap B|}{|A \cup B|}\]phrasesAs above but based on phrase parts and phrase head compoents. See
segram.grammar.Phrase.part_namesfor a full list.bothAs above but components and phrases are used together.
averageSimple average vectors calculated over all component head tokens are used. In this case weights are ignored.
recursiveNOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using
decay_rateparameter by rescaling each weight with a factor ofdecay_rate**depth, wheredepthis calculated relative to the depth of theself.phrase.
- weights
Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.
- decay_rate
Additional parameter used when
method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.
- only, ignore
Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.
- Raises:
RuntimeError – If word vectors are not available.
- class segram.semantic.similarity.SentSimilarity(element: Phrase, spec: Phrase | str | Iterable[str] | dict[str, str | Iterable[str] | Phrase | Sent | Doc], method: Literal['components', 'phrases', 'recursive', 'average'] = 'components', *, weights: dict[str, float | int] | None = None, decay_rate: float = 1, only: str | Iterable[str] = (), ignore: str | Iterable[str] = ())[source]
Bases:
PhraseSimilarityStructured similarity between sentences and phrases.
All methods defined here are designed to ensure that:
Similarity of a phrase with respect to itself is
1.Similarity
x ~ y == y ~ x.
In some case the above may be true only approximately due to accumulation of floating point imprecision.
- element
Grammar phrase to compare.
- spec
Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see
segram.grammar.phrases.Phrase.part_namesandsegram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared againstsegram.grammar.Sentandsegram.grammar.Docobjects as long as they are comprised of a single sentence. SeeSentSimilarityfor details.
- method
Method for calculating similarity between phrases:
componentsComponents are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the
weightparameter) is taken and rescaled with a factorshared / union, wheresharedis the numebr of types present in both elements andunionis the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:\[J = \frac{|A \cap B|}{|A \cup B|}\]phrasesAs above but based on phrase parts and phrase head compoents. See
segram.grammar.Phrase.part_namesfor a full list.bothAs above but components and phrases are used together.
averageSimple average vectors calculated over all component head tokens are used. In this case weights are ignored.
recursiveNOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using
decay_rateparameter by rescaling each weight with a factor ofdecay_rate**depth, wheredepthis calculated relative to the depth of theself.phrase.
- weights
Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.
- decay_rate
Additional parameter used when
method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.
- only, ignore
Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.
- Raises:
RuntimeError – If word vectors are not available.
- class segram.semantic.similarity.DocSimilarity(element: GrammarElement, spec: Any)[source]
Bases:
GrammarSimilarityStructured similarity between documents.
Warning
Currently only doc-doc comparisons based on average token vectors are implemented.
All methods defined here are designed to ensure that:
Similarity of a phrase with respect to itself is
1.Similarity
x ~ y == y ~ x.
In some case the above may be true only approximately due to accumulation of floating point imprecision.
- element
Grammar phrase to compare.
- spec
Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see
segram.grammar.phrases.Phrase.part_namesandsegram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared againstsegram.grammar.Sentandsegram.grammar.Docobjects as long as they are comprised of a single sentence. SeeSentSimilarityfor details.
- method
Method for calculating similarity between phrases:
componentsComponents are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the
weightparameter) is taken and rescaled with a factorshared / union, wheresharedis the numebr of types present in both elements andunionis the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:\[J = \frac{|A \cap B|}{|A \cup B|}\]phrasesAs above but based on phrase parts and phrase head compoents. See
segram.grammar.Phrase.part_namesfor a full list.bothAs above but components and phrases are used together.
averageSimple average vectors calculated over all component head tokens are used. In this case weights are ignored.
recursiveNOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using
decay_rateparameter by rescaling each weight with a factor ofdecay_rate**depth, wheredepthis calculated relative to the depth of theself.phrase.
- weights
Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.
- decay_rate
Additional parameter used when
method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.
- only, ignore
Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.
- Raises:
RuntimeError – If word vectors are not available.