Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while calculating ROC - Decimal type not json serialisable #2043

Closed
RobinL opened this issue Mar 11, 2024 Discussed in #2042 · 2 comments
Closed

Issue while calculating ROC - Decimal type not json serialisable #2043

RobinL opened this issue Mar 11, 2024 Discussed in #2042 · 2 comments

Comments

@RobinL
Copy link
Member

RobinL commented Mar 11, 2024

Discussed in #2042

Originally posted by shivam221098 March 11, 2024
When I try to get the roc curve using the roc_chart_from_labels_column method, I get this error. Don't know how can I solve this as I don't have any of my methods used in the code. Just created a model using

spark_df = spark.read.csv("gs://temp_csv.csv", header=True)
linker = SparkLinker(spark_df, settings, spark=spark)
    
# probablity two random records match
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.8)

linker.estimate_u_using_random_sampling(max_pairs=1e7)  # u estimation

linker.estimate_m_from_label_column("column_label")

After above, I just save the model into dictionary and when I apply a method, I get this error.

image

@RobinL
Copy link
Member Author

RobinL commented Mar 11, 2024

Likely same root cause as #1893

@RobinL
Copy link
Member Author

RobinL commented Aug 17, 2024

This doesn't seem to be a problem in Splink 4. Closing. Reprex of working code below

Working in Splink 4
%load_ext autoreload
%autoreload 2
import logging
import time

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession

import splink.comparison_library as cl
from splink import Linker, SettingsCreator, SparkAPI, block_on, splink_datasets
from splink.backends.spark import similarity_jar_location




path = similarity_jar_location()
df_pandas = splink_datasets.fake_1000


# df_pandas.iteritems = df_pandas.items

conf = SparkConf()
conf.set("spark.jars", path)
conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")
conf.set("spark.default.parallelism", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
print(spark)
display(spark)

df = spark.createDataFrame(df_pandas)


db_api = SparkAPI(
    spark_session=spark,
    break_lineage_method="parquet",
    num_partitions_on_repartition=6,
)



blocking_rules = [
    block_on("surname"),
    block_on("first_name"),
    block_on("dob"),
]

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name").configure(
            term_frequency_adjustments=True
        ),
        cl.JaroWinklerAtThresholds("surname").configure(
            term_frequency_adjustments=True
        ),
        cl.DateOfBirthComparison("dob", input_is_string=True),


        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.LevenshteinAtThresholds("email")
    ],
    retain_intermediate_calculation_columns=True,
    additional_columns_to_retain=["cluster"]
)

linker = Linker(df, settings, db_api)
import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(10)


start = time.time()
df = linker.inference.predict(threshold_match_weight=2)
df.as_pandas_dataframe()
end_time = time.time()
elapsed_time = end_time - start
print(f"Elapsed time: {elapsed_time:.2f} seconds")


linker.evaluation.accuracy_analysis_from_labels_column("cluster")

from splink.internals.datasets import splink_dataset_labels
lab = splink_dataset_labels.fake_1000_labels
lab_sdf = linker.table_management.register_labels_table(lab, "labels")
linker.evaluation.accuracy_analysis_from_labels_table(lab_sdf)

@RobinL RobinL closed this as completed Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant