-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert accuracy metrics to float #1893
Conversation
I wonder if this bug is also related? #1822 (comment) |
yeah so fundamentally this is because in Spark a literal number is a Decimal not a Double in Spark. See also here So whenever we have a literal in the number, we have to wrap it in a cast: splink/splink/comparison_level.py Line 557 in 663491b
Then we have: Which translates e.g. cast(0.12345 AS float8) into We need to do that because the float8 type doesn't exist in spark.sql('select 0.1 as a').toPandas().iloc[0,0] spark.sql('select 0.12345678D as a').toPandas().iloc[0,0] spark.sql('select cast(0.12345678 as float8) as a').toPandas().iloc[0,0] |
Can also see here gamma_name that you can't use 'cast(0.23489134 as double' as it truncates, it has to be the 0.23498234D syntax |
spark.sql('select cast(0.349807234523750923475089457470478952340895740 as double) as a').toPandas().iloc[0,0]
spark.sql('select 0.349807234523750923475089457470478952340895740D as a').toPandas().iloc[0,0]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change in comment on code - but yeah, i think this is the right way to fix
splink/accuracy.py
Outdated
5.0*TP/(5*TP + 4*FN + FP) as f2, | ||
1.25*TP/(1.25*TP + 0.25*FN + FP) as f0_5, | ||
4.0*TP*TN/((4.0*TP*TN) + ((TP + TN)*(FP + FN))) as p4, | ||
cast(2.0*TP/(2*TP + FN + FP) as float) as f1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these should be float8
for consistency (see comments on PR)
see also #694 |
I've just seen this exact problem as well. We're using the Postgres backend. I had to convert all the For reference, that's what the table is created as it currently stands:
P.S. I'll be doing a PR to alter Line 200 in cbd5e8f
|
Thank you - that sounds good @cinnq346 . When you do, please could you add additional casts around the 1.0 and 0.0 to make it 'spark proof' as follows:
|
Thanks for this table, it's really useful. The additional casts have been added here. Feel free to tag me into your follow-up PR alongside Robin. |
Type of PR
Give a brief description for the solution you have provided
Our production QA pipelines have been encountering the following error:
TypeError: Object of type Decimal is not JSON serializable
when attempting to serialise an the altair chart spec, for any charts containing the following sql.I've narrowed the error down to the following lines:
which, when run in our glue jobs (spark 3.1.1) are being cast to python's Decimal type -
from decimal import Decimal
.To resolve this, I've added some explicit float casts, which seems to ensure that spark doesn't convert to using Decimals - this branch has been used to check that this resolves the issue.
I haven't looked into why this is happening, so there are no tests in this PR. However, if you would like a more detailed explanation exploring what's going on, I'm happy to try and get a reproducible example.