-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue while calculating ROC - Decimal type not json serialisable #2043
Comments
Likely same root cause as #1893 |
This doesn't seem to be a problem in Splink 4. Closing. Reprex of working code below Working in Splink 4%load_ext autoreload
%autoreload 2
import logging
import time
from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, SparkAPI, block_on, splink_datasets
from splink.backends.spark import similarity_jar_location
path = similarity_jar_location()
df_pandas = splink_datasets.fake_1000
# df_pandas.iteritems = df_pandas.items
conf = SparkConf()
conf.set("spark.jars", path)
conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")
conf.set("spark.default.parallelism", "12")
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
print(spark)
display(spark)
df = spark.createDataFrame(df_pandas)
db_api = SparkAPI(
spark_session=spark,
break_lineage_method="parquet",
num_partitions_on_repartition=6,
)
blocking_rules = [
block_on("surname"),
block_on("first_name"),
block_on("dob"),
]
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=blocking_rules,
comparisons=[
cl.JaroWinklerAtThresholds("first_name").configure(
term_frequency_adjustments=True
),
cl.JaroWinklerAtThresholds("surname").configure(
term_frequency_adjustments=True
),
cl.DateOfBirthComparison("dob", input_is_string=True),
cl.ExactMatch("city").configure(term_frequency_adjustments=True),
cl.LevenshteinAtThresholds("email")
],
retain_intermediate_calculation_columns=True,
additional_columns_to_retain=["cluster"]
)
linker = Linker(df, settings, db_api)
import logging
logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(10)
start = time.time()
df = linker.inference.predict(threshold_match_weight=2)
df.as_pandas_dataframe()
end_time = time.time()
elapsed_time = end_time - start
print(f"Elapsed time: {elapsed_time:.2f} seconds")
linker.evaluation.accuracy_analysis_from_labels_column("cluster")
from splink.internals.datasets import splink_dataset_labels
lab = splink_dataset_labels.fake_1000_labels
lab_sdf = linker.table_management.register_labels_table(lab, "labels")
linker.evaluation.accuracy_analysis_from_labels_table(lab_sdf) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Discussed in #2042
Originally posted by shivam221098 March 11, 2024
When I try to get the roc curve using the
roc_chart_from_labels_column
method, I get this error. Don't know how can I solve this as I don't have any of my methods used in the code. Just created a model usingAfter above, I just save the model into dictionary and when I apply a method, I get this error.
The text was updated successfully, but these errors were encountered: