How to include context around a span into span categorization (label). #12580

DerDiego13 · 2023-04-27T10:59:57Z

DerDiego13
Apr 27, 2023

Hi,

Referring to this question: https://support.prodi.gy/t/does-the-spancat-function-include-context-for-the-classification/6510

I was under the (false) impression that during the "Pooling" step, a window encoder would take into account context around the span to predict its label. Let's take the following two sentences as an example:

'Mike found a postcard worth USD 2.'
'Mike found USD 2.'

In both cases, I want the string 'USD 2' as my span. However, in the first case I want it to be the span.label_ 'Value' and in the second I want to use the span.label_ 'Item'. In reality it will be more complex cases, so a simple rule set will likely not work.

Would it be a valid option to multiply the "token.vector" with the "token.dep" to account for this? Are there any better options - a Transformer model was already rightfully suggested in the forum linked above, however, I would prefer working on CPU?

Answered by adrianeboyd

Apr 28, 2023

The spancat component uses the context-sensitive tensors for the first and last tokens in the span, either from the tok2vec component or from the transformer component. If tok2vec is a separate pipeline component, you can inspect this in doc.tensor and see that the tensor depends on the surrounding context:

import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp("I found USD 2")[2].tensor)
print(nlp("I found a postcard worth USD 2")[5].tensor)

The amount of context is defined by window_size in the tok2vec.encode config as described here: https://spacy.io/api/architectures#HashEmbedCNN

I realize this is a toy example, but it does sound like the model will struggle to make this distinctio…

View full answer

adrianeboyd · 2023-04-28T06:38:21Z

adrianeboyd
Apr 28, 2023

The spancat component uses the context-sensitive tensors for the first and last tokens in the span, either from the tok2vec component or from the transformer component. If tok2vec is a separate pipeline component, you can inspect this in doc.tensor and see that the tensor depends on the surrounding context:

import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp("I found USD 2")[2].tensor)
print(nlp("I found a postcard worth USD 2")[5].tensor)

The amount of context is defined by window_size in the tok2vec.encode config as described here: https://spacy.io/api/architectures#HashEmbedCNN

I realize this is a toy example, but it does sound like the model will struggle to make this distinction just based on the word-level context, so trying out adding token.dep sounds like a reasonable idea. If you use the DEP feature as an additional attribute for tok2vec (this isn't an option for transformer), you can add it to the tok2vec configuration. This requires a few extra manual config settings so that the predicted parses are visible to spancat during training, see this project for an example: https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_predicted_annotations

You also have to be careful to get the config settings correct if you have more than one tok2vec component or the parser requires it, so it can be easier to replace the listeners if you're using the parser from a pipeline like en_core_web_sm. So to use the parser from en_core_web_sm:

[components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]

...

[training]
frozen_components = ["parser"]
annotating_components = ["parser"]

If you want to add vectors, enable use_static_vectors in the tok2vec config and add the model containing vectors under paths.vectors. A pipeline only supports one set of vectors, so if you use the parser from en_core_web_md, you'd also want to include the vectors from en_core_web_md or the parser won't work correctly.

1 reply

DerDiego13 May 4, 2023
Author

Thank you so much. That both clarifies and helps a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to include context around a span into span categorization (label). #12580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to include context around a span into span categorization (label). #12580

DerDiego13 Apr 27, 2023

Replies: 1 comment · 1 reply

adrianeboyd Apr 28, 2023

DerDiego13 May 4, 2023 Author

DerDiego13
Apr 27, 2023

Replies: 1 comment 1 reply

adrianeboyd
Apr 28, 2023

DerDiego13 May 4, 2023
Author