Issue while calculating ROC - Decimal type not json serialisable #2043

RobinL · 2024-03-11T19:39:46Z

Discussed in #2042

^{Originally posted by shivam221098 March 11, 2024}
When I try to get the roc curve using the roc_chart_from_labels_column method, I get this error. Don't know how can I solve this as I don't have any of my methods used in the code. Just created a model using

spark_df = spark.read.csv("gs://temp_csv.csv", header=True)
linker = SparkLinker(spark_df, settings, spark=spark)
    
# probablity two random records match
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.8)

linker.estimate_u_using_random_sampling(max_pairs=1e7)  # u estimation

linker.estimate_m_from_label_column("column_label")

After above, I just save the model into dictionary and when I apply a method, I get this error.

The text was updated successfully, but these errors were encountered:

RobinL · 2024-03-11T19:40:24Z

Likely same root cause as #1893

RobinL · 2024-08-17T10:37:22Z

This doesn't seem to be a problem in Splink 4. Closing. Reprex of working code below

Working in Splink 4

%load_ext autoreload
%autoreload 2
import logging
import time

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession

import splink.comparison_library as cl
from splink import Linker, SettingsCreator, SparkAPI, block_on, splink_datasets
from splink.backends.spark import similarity_jar_location




path = similarity_jar_location()
df_pandas = splink_datasets.fake_1000


# df_pandas.iteritems = df_pandas.items

conf = SparkConf()
conf.set("spark.jars", path)
conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")
conf.set("spark.default.parallelism", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
print(spark)
display(spark)

df = spark.createDataFrame(df_pandas)


db_api = SparkAPI(
    spark_session=spark,
    break_lineage_method="parquet",
    num_partitions_on_repartition=6,
)



blocking_rules = [
    block_on("surname"),
    block_on("first_name"),
    block_on("dob"),
]

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name").configure(
            term_frequency_adjustments=True
        ),
        cl.JaroWinklerAtThresholds("surname").configure(
            term_frequency_adjustments=True
        ),
        cl.DateOfBirthComparison("dob", input_is_string=True),


        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.LevenshteinAtThresholds("email")
    ],
    retain_intermediate_calculation_columns=True,
    additional_columns_to_retain=["cluster"]
)

linker = Linker(df, settings, db_api)
import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(10)


start = time.time()
df = linker.inference.predict(threshold_match_weight=2)
df.as_pandas_dataframe()
end_time = time.time()
elapsed_time = end_time - start
print(f"Elapsed time: {elapsed_time:.2f} seconds")


linker.evaluation.accuracy_analysis_from_labels_column("cluster")

from splink.internals.datasets import splink_dataset_labels
lab = splink_dataset_labels.fake_1000_labels
lab_sdf = linker.table_management.register_labels_table(lab, "labels")
linker.evaluation.accuracy_analysis_from_labels_table(lab_sdf)

RobinL closed this as completed Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue while calculating ROC - Decimal type not json serialisable #2043

Issue while calculating ROC - Decimal type not json serialisable #2043

RobinL commented Mar 11, 2024

RobinL commented Mar 11, 2024

RobinL commented Aug 17, 2024

Issue while calculating ROC - Decimal type not json serialisable #2043

Issue while calculating ROC - Decimal type not json serialisable #2043

Comments

RobinL commented Mar 11, 2024

Discussed in #2042

RobinL commented Mar 11, 2024

RobinL commented Aug 17, 2024