segram.semantic.similarity module

class segram.semantic.similarity.GrammarSimilarity(element: GrammarElement, spec: Any)[source]

Bases: ABC

Abstract base class for structured similarity scorers.

abstract get_similarity(element: GrammarElement, spec: Any) float[source]

Get structured similarity between self.element and self.spec.

class segram.semantic.similarity.PhraseSimilarity(element: Phrase, spec: Phrase | str | Iterable[str] | dict[str, str | Iterable[str] | Phrase | Sent | Doc], method: Literal['components', 'phrases', 'recursive', 'average'] = 'components', *, weights: dict[str, float | int] | None = None, decay_rate: float = 1, only: str | Iterable[str] = (), ignore: str | Iterable[str] = ())[source]

Bases: GrammarSimilarity

Structured similarity between phrases and sentences.

All methods defined here are designed to ensure that:

  • Similarity of a phrase with respect to itself is 1.

  • Similarity x ~ y == y ~ x.

In some case the above may be true only approximately due to accumulation of floating point imprecision.

element

Grammar phrase to compare.

spec

Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see segram.grammar.phrases.Phrase.part_names and segram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared against segram.grammar.Sent and segram.grammar.Doc objects as long as they are comprised of a single sentence. See SentSimilarity for details.

method

Method for calculating similarity between phrases:

components

Components are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the weight parameter) is taken and rescaled with a factor shared / union, where shared is the numebr of types present in both elements and union is the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:

\[J = \frac{|A \cap B|}{|A \cup B|}\]
phrases

As above but based on phrase parts and phrase head compoents. See segram.grammar.Phrase.part_names for a full list.

both

As above but components and phrases are used together.

average

Simple average vectors calculated over all component head tokens are used. In this case weights are ignored.

recursive

NOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using decay_rate parameter by rescaling each weight with a factor of decay_rate**depth, where depth is calculated relative to the depth of the self.phrase.

weights

Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.

decay_rate

Additional parameter used when method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.

only, ignore

Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.

Raises:

RuntimeError – If word vectors are not available.

get_similarity(element: Phrase, spec: dict[str, str | Iterable[str] | Phrase | Sent | Doc]) float[source]

Structured similarity between self.phrase and self.spec.

class segram.semantic.similarity.SentSimilarity(element: Phrase, spec: Phrase | str | Iterable[str] | dict[str, str | Iterable[str] | Phrase | Sent | Doc], method: Literal['components', 'phrases', 'recursive', 'average'] = 'components', *, weights: dict[str, float | int] | None = None, decay_rate: float = 1, only: str | Iterable[str] = (), ignore: str | Iterable[str] = ())[source]

Bases: PhraseSimilarity

Structured similarity between sentences and phrases.

All methods defined here are designed to ensure that:

  • Similarity of a phrase with respect to itself is 1.

  • Similarity x ~ y == y ~ x.

In some case the above may be true only approximately due to accumulation of floating point imprecision.

element

Grammar phrase to compare.

spec

Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see segram.grammar.phrases.Phrase.part_names and segram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared against segram.grammar.Sent and segram.grammar.Doc objects as long as they are comprised of a single sentence. See SentSimilarity for details.

method

Method for calculating similarity between phrases:

components

Components are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the weight parameter) is taken and rescaled with a factor shared / union, where shared is the numebr of types present in both elements and union is the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:

\[J = \frac{|A \cap B|}{|A \cup B|}\]
phrases

As above but based on phrase parts and phrase head compoents. See segram.grammar.Phrase.part_names for a full list.

both

As above but components and phrases are used together.

average

Simple average vectors calculated over all component head tokens are used. In this case weights are ignored.

recursive

NOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using decay_rate parameter by rescaling each weight with a factor of decay_rate**depth, where depth is calculated relative to the depth of the self.phrase.

weights

Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.

decay_rate

Additional parameter used when method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.

only, ignore

Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.

Raises:

RuntimeError – If word vectors are not available.

get_similarity(element: Sent, spec: dict[str, str | Iterable[str] | Phrase | Sent | Doc]) float[source]

Structured similarity between self.phrase and self.spec.

class segram.semantic.similarity.DocSimilarity(element: GrammarElement, spec: Any)[source]

Bases: GrammarSimilarity

Structured similarity between documents.

Warning

Currently only doc-doc comparisons based on average token vectors are implemented.

All methods defined here are designed to ensure that:

  • Similarity of a phrase with respect to itself is 1.

  • Similarity x ~ y == y ~ x.

In some case the above may be true only approximately due to accumulation of floating point imprecision.

element

Grammar phrase to compare.

spec

Specification against which the phrase is to be compared. Can be another phrase, a string or an iterable of strings, which should be single words. A single strings is splitted at whitespace and turned into multiple words. Finally, an averaged word vector for all words is computed. Alternatively, a specification can have a form of a dictionary mapping names of phrase parts or components (see segram.grammar.phrases.Phrase.part_names and segram.grammar.phrase.Phrase.component_names) to either strings or iterables of strings convertible to word vectors (as previously) or other phrases. Importantly, phrases can be also compared against segram.grammar.Sent and segram.grammar.Doc objects as long as they are comprised of a single sentence. See SentSimilarity for details.

method

Method for calculating similarity between phrases:

components

Components are grouped in buckets by type (verbs, nouns, prepositions and descriptions) and averaged vectors are compared between the same types. Finally, a weighted average (with weights defined by the weight parameter) is taken and rescaled with a factor shared / union, where shared is the numebr of types present in both elements and union is the total number of unique types among both of them. Thus, the final result is akin to a fuzzy Jaccard similarity:

\[J = \frac{|A \cap B|}{|A \cup B|}\]
phrases

As above but based on phrase parts and phrase head compoents. See segram.grammar.Phrase.part_names for a full list.

both

As above but components and phrases are used together.

average

Simple average vectors calculated over all component head tokens are used. In this case weights are ignored.

recursive

NOTE. Currently not implemented. First, head components are compared between two phrases, and then the same rule is applied recursively to all parts (subjects, direct objects etc.) where for each type elements of the two phrases are matched in pairs to maximize similarity. As previously, weights can be applied to different types and a Jaccard-like rescaling is applied. Additionaly, importance of nested phrases may be discounted using decay_rate parameter by rescaling each weight with a factor of decay_rate**depth, where depth is calculated relative to the depth of the self.phrase.

weights

Dictionary mapping phrase part or component names to arbitrary weights (which must be positive). The weights do not have to be normalized and sum up to one.

decay_rate

Additional parameter used when method="recursive", which controls the rate at which contributions coming from nested subphrases are discounted.

only, ignore

Lists of part or component names to selectively use or ignore. Both arguments cannot be used at the same time.

Raises:

RuntimeError – If word vectors are not available.

get_similarity(element: Doc, spec: Doc) float[source]

Get structured similarity between self.element and self.spec.