Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter duplicate vectors when pruning vectors #5397

Open
adrianeboyd opened this issue May 4, 2020 · 2 comments
Open

Filter duplicate vectors when pruning vectors #5397

adrianeboyd opened this issue May 4, 2020 · 2 comments
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity

Comments

@adrianeboyd
Copy link
Contributor

How to reproduce the behaviour

When prioritizing vectors to keep, Vocab.prune_vectors doesn't handle existing duplicates from key2row well. By sorting/prioritizing by values from key2row, which may contain duplicate values, prune_vectors may keep multiple copies of the same vector.

Fix:

  • remove duplicate vectors from indices and adjust keys accordingly
  • preserve relevant duplicate key2row keys (which is not compatible with the keys truncation from Fix most_similar for vectors with unused rows #5348, which is overly simple; I think you have to re-add the duplicate rows to vectors after initialization with Vectors.add(row=))
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity labels May 4, 2020
@0x0badc0de
Copy link

Just faced the issue today when tried to prune previously pruned vocabulary. Wonder if the fix can consist of forming a mask for indices/keys which picks up first n_row unique rows. Then use inversed mask to pick up the rest. Not deeply tested, just a raw idea.

Can be implemented like following

def prune_vectors(self, nr_row, batch_size=1024):
    """Reduce the current vector table to `nr_row` unique entries. Words
    mapped to the discarded vectors will be remapped to the closest vector
    among those remaining.
    For example, suppose the original table had vectors for the words:
    ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
    two rows, we would discard the vectors for 'feline' and 'reclined'.
    These words would then be remapped to the closest remaining vector
    -- so "feline" would have the same vector as "cat", and "reclined"
    would have the same vector as "sat".
    The similarities are judged by cosine. The original vectors may
    be large, so the cosines are calculated in minibatches, to reduce
    memory usage.
    nr_row (int): The number of rows to keep in the vector table.
    batch_size (int): Batch of vectors for calculating the similarities.
        Larger batch sizes might be faster, while temporarily requiring
        more memory.
    RETURNS (dict): A dictionary keyed by removed words mapped to
        `(string, score)` tuples, where `string` is the entry the removed
        word was mapped to, and `score` the similarity score between the
        two words.
    DOCS: https://spacy.io/api/vocab#prune_vectors
    """
    ops = get_current_ops()
    xp = get_array_module(self.vectors.data)
    # Make sure all vectors are in the vocab
    for orth in self.vectors:
        self[orth]
    # Make prob negative so it sorts by rank ascending
    # (key2row contains the rank)
    priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
                for lex in self if lex.orth in self.vectors.key2row]
    priority.sort()
    indices = xp.asarray([i for (prob, i, key) in priority], dtype="uint64")
    keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")

    _, unique_indices_idx = xp.unique(indices, return_index=True)
    # unique_indices_idx points to the first occurencies of unique values,
    # its order matches the order of sorted unique values of indices.
    # Later we will need to pick up n_row first unique values from `indices`. Hence sort `unique_indices_idx`.
    unique_indices_idx.sort()

    # A mask allowing to pick up the first n_row unique rows from indices/keys.
    keep_mask = xp.zeros(indices.size, dtype=bool)
    keep_mask[unique_indices_idx[:nr_row]] = True

    keep = xp.ascontiguousarray(self.vectors.data[indices[keep_mask]])
    toss = xp.ascontiguousarray(self.vectors.data[indices[~keep_mask]])
    self.vectors = Vectors(data=keep, keys=keys[keep_mask], name=self.vectors.name)
    syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
    syn_keys = ops.to_numpy(syn_keys)
    remap = {}
    for i, key in enumerate(ops.to_numpy(keys[~keep_mask])):
        self.vectors.add(key, row=syn_rows[i][0])
        word = self.strings[key]
        synonym = self.strings[syn_keys[i][0]]
        score = scores[i][0]
        remap[word] = (synonym, score)
    return remap

@adrianeboyd
Copy link
Contributor Author

Sorry this didn't work as expected, and thanks for the suggestion! This issue is kind of low priority on our end right now, but we'll try to come back to it when we find a bit of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

No branches or pull requests

2 participants