How to include context around a span into span categorization (label). #12580
-
Hi, Referring to this question: https://support.prodi.gy/t/does-the-spancat-function-include-context-for-the-classification/6510 I was under the (false) impression that during the "Pooling" step, a window encoder would take into account context around the span to predict its label. Let's take the following two sentences as an example: 'Mike found a postcard worth USD 2.' In both cases, I want the string 'USD 2' as my span. However, in the first case I want it to be the span.label_ 'Value' and in the second I want to use the span.label_ 'Item'. In reality it will be more complex cases, so a simple rule set will likely not work. Would it be a valid option to multiply the "token.vector" with the "token.dep" to account for this? Are there any better options - a Transformer model was already rightfully suggested in the forum linked above, however, I would prefer working on CPU? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp("I found USD 2")[2].tensor)
print(nlp("I found a postcard worth USD 2")[5].tensor) The amount of context is defined by I realize this is a toy example, but it does sound like the model will struggle to make this distinction just based on the word-level context, so trying out adding You also have to be careful to get the config settings correct if you have more than one [components.parser]
source = "en_core_web_sm"
replace_listeners = ["model.tok2vec"]
...
[training]
frozen_components = ["parser"]
annotating_components = ["parser"] If you want to add vectors, enable |
Beta Was this translation helpful? Give feedback.
The
spancat
component uses the context-sensitive tensors for the first and last tokens in the span, either from thetok2vec
component or from thetransformer
component. Iftok2vec
is a separate pipeline component, you can inspect this indoc.tensor
and see that the tensor depends on the surrounding context:The amount of context is defined by
window_size
in thetok2vec.encode
config as described here: https://spacy.io/api/architectures#HashEmbedCNNI realize this is a toy example, but it does sound like the model will struggle to make this distinctio…