How to retrieve bidirectional relation between the tensor representation and a given span #13048

igormorgado · 2023-10-05T22:28:55Z

igormorgado
Oct 5, 2023

I know that transformer model uses byte pair encoding, therefore isn't always possible to have a single tensor (aka vector), representing a single word. Taken that in account I would like to know if is possible to:

a. Given a Span (or the doc indexes for start and end of span), retrieve the list of tensors related to that elements/tokens; and

b. Given a tensor representation, find the Span or the indexes related to it on the Doc

So far I have been playing with the TransformerData object generate by the en_web_core_trf pipeline. And could not find a clear way for that.

Given the following preamble:

import spacy
import numpy as np
nlp = spacy.load("en_core_web_trf")
doc = nlp(SOME_TEXT)
trfdata = doc._.trf_data

I could find the following data structure equalities:

# Text given as input to the transformer model
np.all(np.array(trfdata.wordpieces.strings) == np.array(trfdata.tokens['input_texts']))
# True

# Alignment data (don't know how it works)
np.all(trfdata.align.dataXd == trfdata.align.data[:,0])
# True

# Tensors representing each input token.
np.all(trfdata.model_output['last_hidden_state'] == trfdata.tensors[0])
# True

Also that tensors produced from document are stored at trfdata.tensors[0], with dimensions corresponding to (batch_id, tokens, tensor_representation).

Tried to reconstruct the text from align.dataXd and align.lengths, without success. One of the attempts were this one

aligns = trfdata.align.dataXd
lengths = trfdata.align.lengths
for pos,length in zip(aligns,lengths):
    print(doc[pos-1:pos+length-1], end=" ")

The text seems to repeat sometimes. Could not understand how the Ragged object works. The output looks like this:

Samanta was was in a dark coat : long , silky, , and tight on womanly hips hips . My doorway [...]

But I expected this (from doc.text)

Samanta was in a dark coat: long, silky, and tight on womanly hips. My doorway [...]

Last but not least, I noticed that np.array(trfdata.wordpieces.strings).shape matches with trfdata.tensors[0].shape, therefore I think that I can start to find a bidirectional relation from here. Just need to find how to detect the relation with wordpieces and their related Spans in the Doc.

Just to end, my questions are:

How to relate the wordpieces.strings and tensors with the text in the doc.
How to do the opposite, from a token/span in doc relate to a list of entries in trfdata.tensors[0]
How to handle the Ragged object, its indexes and lengths.
In all those data structures that (at least from my perspective) look redundant (probably just different names pointing to same object in memory), which ones should be used?
What are the best approaches to these.

Best regards...

adrianeboyd · 2023-10-06T08:26:12Z

adrianeboyd
Oct 6, 2023

Here is a very nice tutorial that shows how to align trf_data with spacy tokens and spans:

https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html#contextual-word-embeddings-from-transformers

The spacy Doc is processed internally in overlapping strided spans (to allow for docs that are longer than the transformer model's maximum length), so some tokens will be aligned to more than one span in the transformer output.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to retrieve bidirectional relation between the tensor representation and a given span #13048

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to retrieve bidirectional relation between the tensor representation and a given span #13048

igormorgado Oct 5, 2023

Replies: 1 comment

adrianeboyd Oct 6, 2023

igormorgado
Oct 5, 2023

adrianeboyd
Oct 6, 2023