Filter duplicate vectors when pruning vectors #5397

adrianeboyd · 2020-05-04T10:05:37Z

How to reproduce the behaviour

When prioritizing vectors to keep, Vocab.prune_vectors doesn't handle existing duplicates from key2row well. By sorting/prioritizing by values from key2row, which may contain duplicate values, prune_vectors may keep multiple copies of the same vector.

Fix:

remove duplicate vectors from indices and adjust keys accordingly
preserve relevant duplicate key2row keys (which is not compatible with the keys truncation from Fix most_similar for vectors with unused rows #5348, which is overly simple; I think you have to re-add the duplicate rows to vectors after initialization with Vectors.add(row=))

The text was updated successfully, but these errors were encountered:

0x0badc0de · 2021-08-13T19:06:10Z

Just faced the issue today when tried to prune previously pruned vocabulary. Wonder if the fix can consist of forming a mask for indices/keys which picks up first n_row unique rows. Then use inversed mask to pick up the rest. Not deeply tested, just a raw idea.

Can be implemented like following

def prune_vectors(self, nr_row, batch_size=1024):
    """Reduce the current vector table to `nr_row` unique entries. Words
    mapped to the discarded vectors will be remapped to the closest vector
    among those remaining.
    For example, suppose the original table had vectors for the words:
    ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
    two rows, we would discard the vectors for 'feline' and 'reclined'.
    These words would then be remapped to the closest remaining vector
    -- so "feline" would have the same vector as "cat", and "reclined"
    would have the same vector as "sat".
    The similarities are judged by cosine. The original vectors may
    be large, so the cosines are calculated in minibatches, to reduce
    memory usage.
    nr_row (int): The number of rows to keep in the vector table.
    batch_size (int): Batch of vectors for calculating the similarities.
        Larger batch sizes might be faster, while temporarily requiring
        more memory.
    RETURNS (dict): A dictionary keyed by removed words mapped to
        `(string, score)` tuples, where `string` is the entry the removed
        word was mapped to, and `score` the similarity score between the
        two words.
    DOCS: https://spacy.io/api/vocab#prune_vectors
    """
    ops = get_current_ops()
    xp = get_array_module(self.vectors.data)
    # Make sure all vectors are in the vocab
    for orth in self.vectors:
        self[orth]
    # Make prob negative so it sorts by rank ascending
    # (key2row contains the rank)
    priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
                for lex in self if lex.orth in self.vectors.key2row]
    priority.sort()
    indices = xp.asarray([i for (prob, i, key) in priority], dtype="uint64")
    keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")

    _, unique_indices_idx = xp.unique(indices, return_index=True)
    # unique_indices_idx points to the first occurencies of unique values,
    # its order matches the order of sorted unique values of indices.
    # Later we will need to pick up n_row first unique values from `indices`. Hence sort `unique_indices_idx`.
    unique_indices_idx.sort()

    # A mask allowing to pick up the first n_row unique rows from indices/keys.
    keep_mask = xp.zeros(indices.size, dtype=bool)
    keep_mask[unique_indices_idx[:nr_row]] = True

    keep = xp.ascontiguousarray(self.vectors.data[indices[keep_mask]])
    toss = xp.ascontiguousarray(self.vectors.data[indices[~keep_mask]])
    self.vectors = Vectors(data=keep, keys=keys[keep_mask], name=self.vectors.name)
    syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
    syn_keys = ops.to_numpy(syn_keys)
    remap = {}
    for i, key in enumerate(ops.to_numpy(keys[~keep_mask])):
        self.vectors.add(key, row=syn_rows[i][0])
        word = self.strings[key]
        synonym = self.strings[syn_keys[i][0]]
        score = scores[i][0]
        remap[word] = (synonym, score)
    return remap

adrianeboyd · 2021-08-16T15:47:08Z

Sorry this didn't work as expected, and thanks for the suggestion! This issue is kind of low priority on our end right now, but we'll try to come back to it when we find a bit of time.

adrianeboyd added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter duplicate vectors when pruning vectors #5397

Filter duplicate vectors when pruning vectors #5397

adrianeboyd commented May 4, 2020

0x0badc0de commented Aug 13, 2021

adrianeboyd commented Aug 16, 2021

Filter duplicate vectors when pruning vectors #5397

Filter duplicate vectors when pruning vectors #5397

Comments

adrianeboyd commented May 4, 2020

How to reproduce the behaviour

0x0badc0de commented Aug 13, 2021

adrianeboyd commented Aug 16, 2021