How to retrieve bidirectional relation between the tensor representation and a given span #13048
igormorgado
started this conversation in
Help: Best practices
Replies: 1 comment
-
Here is a very nice tutorial that shows how to align The spacy |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I know that transformer model uses byte pair encoding, therefore isn't always possible to have a single tensor (aka vector), representing a single word. Taken that in account I would like to know if is possible to:
a. Given a
Span
(or thedoc indexes
forstart
andend
of span), retrieve the list of tensors related to that elements/tokens; andb. Given a tensor representation, find the
Span
or the indexes related to it on theDoc
So far I have been playing with the TransformerData object generate by the
en_web_core_trf
pipeline. And could not find a clear way for that.Given the following preamble:
I could find the following data structure equalities:
Also that tensors produced from document are stored at
trfdata.tensors[0]
, with dimensions corresponding to(batch_id, tokens, tensor_representation)
.Tried to reconstruct the text from
align.dataXd
andalign.lengths
, without success. One of the attempts were this oneThe text seems to repeat sometimes. Could not understand how the
Ragged
object works. The output looks like this:But I expected this (from
doc.text
)Last but not least, I noticed that
np.array(trfdata.wordpieces.strings).shape
matches withtrfdata.tensors[0].shape
, therefore I think that I can start to find a bidirectional relation from here. Just need to find how to detect the relation withwordpieces
and their relatedSpans
in theDoc
.Just to end, my questions are:
wordpieces.strings
andtensors
with the text in the doc.trfdata.tensors[0]
Ragged
object, its indexes and lengths.Best regards...
Beta Was this translation helpful? Give feedback.
All reactions