-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto cleanup memmap scores #1086
base: main
Are you sure you want to change the base?
Conversation
@@ -175,7 +174,6 @@ def partition( | |||
clusters = self.cluster(pair_scores, threshold) | |||
clusters = self._add_singletons(data, clusters) | |||
clusters = list(clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have auto-cleanup, perhaps we don't want to cast this to a list? I think then this entire method would be out-of-core, right?
Same for casting to list in join()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right. i think we we should put in a warning for a few releases though because some folks are likely depending on it being a list instead of a generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do what Gazeteer.search() does with the generator: bool
parameter?
def score(.... generator: bool = False):
...
if not generator:
warn("this deprecated")
return list(scores)
else:
yield from scores
dedupe/core.py
Outdated
@@ -176,9 +177,28 @@ def scoreDuplicates( | |||
else: | |||
scored_pairs = numpy.array([], dtype=dtype) | |||
|
|||
# See https://docs.python.org/3/library/weakref.html#comparing-finalizers-with-del-methods | |||
scored_pairs.remove = weakref.finalize(scored_pairs, _cleanup_scores, scored_pairs) | |||
scored_pairs.removed = property(_is_removed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK if including the removed
property is good, probably could live without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i understand correctly, it's never going to be anything except false in reachable code? because it would only set to true when scored_paris is garbage collected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone could call pairs.remove() manually to do the finalization, and then it would be True. But you're correct, we as the library rely on garbage collection so we never take advantage of the .removed property. We could remove it, always easier to add later than to have to continue supporting it forever...
a61dab1
to
815cde5
Compare
It wasn't stated anywhere in the docs that you have to cleanup the possibly-a-tempfile, so people might not have been doing it. And, I thought this was better than the alternatives from https://stackoverflow.com/questions/865115/how-do-i-correctly-clean-up-a-python-object
Now you can #type: ignore[specific-code]
815cde5
to
133d744
Compare
@@ -176,9 +177,29 @@ def scoreDuplicates( | |||
else: | |||
scored_pairs = numpy.array([], dtype=dtype) | |||
|
|||
# Monkeypatch in these extra methods and attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seriously! The first time I've used it too.
wow! i've long wanted to do something like this but del was always so finicky. a super nice approach. can we get a test for this: something like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a request for a test in the thread
Yes the test seems important, I can look into it |
Users don't need to worry about cleaning up the memmpa file now