Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/763 improve type annotations for DataFrameModel.validate #1905

Merged
merged 10 commits into from
Feb 17, 2025

Conversation

m-richards
Copy link
Collaborator

This aims to close #763, but is currently a partial WIP step towards that.

I did come across that #1450 already exists, but it doesn't deal with different dataframe libraries, so it seemed easier to start from scratch, and focus specifically on the validate method to get some initial feedback too.

In the current state there's a few issues to resolve:

  • typing imports depending on optional deps, which won't necessarily be available
  • just testing with pandas and polars, would need to deal with all the other dataframe libraries supported (and pyspark seems to actually overload the validate method as well)
  • the pandera mypy throws and error (or at least it does for the pre-commit hooks):
    pandera\api\dataframe\model.py:298: error: Overloaded function signature 2 will never be matched: signature 1's parameter type(s) are the same or broader [misc]
    which I gather is because without a stub library (and perhaps due to some mypy configuration) the revealed type of both pd.DataFrame and pl.LazyFrame is Any, and so the overloads are in conflict with one another.

Another possible approach would be to overload validate in all the dataframe library implementations, purely to update type annotations (avoiding the need to deal with optional imports) but this might not be worth the mess it creats.

I'd welcome any feedback, it's been quite a while since I last submitted a PR to pandera, didn't think it would be such a long gap.

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.36%. Comparing base (812b2a8) to head (7ca2b86).
Report is 192 commits behind head on main.

Files with missing lines Patch % Lines
pandera/api/polars/model.py 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1905      +/-   ##
==========================================
- Coverage   94.28%   93.36%   -0.92%     
==========================================
  Files          91      121      +30     
  Lines        7013     9362    +2349     
==========================================
+ Hits         6612     8741    +2129     
- Misses        401      621     +220     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
inplace: bool = False,
) -> DataFrame[Self]:
"""%(validate_doc)s"""
return cast(
Copy link
Collaborator Author

@m-richards m-richards Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could potentially also all be in an if typing.TYPE_CHECKING block if that's cleaner than providing an implementation the same as the parent. I also don't know if it's preferable to use super() instead?

Edit: I also briefly looked at seeing if an overload on recieving a GeoDataFrame is possible. I had a solution working in my standalone test case, but it was pretty ugly with conditional overloads, and pre-commit mypy was not very happy with that.

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

@classmethod
@overload
def validate(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mypy in pre-commit, this gives:
pandera\api\polars\model.py:130: error: Overloaded function signature 2 will never be matched: signature 1's parameter type(s) are the same or broader [misc]

but that doesn't seem to be the case. I have this as a test case:

import typing
from typing import reveal_type

import pandera as pda
import pandera.polars as pla

import pandas as pd
import polars as pl

from pandera.typing import Series
from pandera.typing.geopandas import GeoSeries
import geopandas as gpd


class SchemaPandas(pda.DataFrameModel):
    col1: Series[int]
    col2: Series[int]

class SchemaGeoPandas(pda.DataFrameModel):
    col1: Series[int]
    col2: GeoSeries


class SchemaPolars(pla.DataFrameModel):
    col1: Series[int]
    col2: Series[int]

pandas_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1, 2, 3]})
gdf = gpd.GeoDataFrame(pandas_df.assign(col2=gpd.GeoSeries.from_xy(x=[1,1,1], y=[2,2,2]))).set_geometry("col2")

polars_df = pl.from_pandas(pandas_df)
lazyframe = polars_df.lazy()

reveal_type(pandas_df)

print("pd.DataFrame")
result = SchemaPandas.validate(pandas_df)
# test operations for methods accepting pandas dataframes
result.to_csv("test")
pd.concat([result, result])
reveal_type(result)
print("gpd.GeoDataFrame")
reveal_type(SchemaGeoPandas.validate(gdf))

print("pl.DataFrame")
result2 = SchemaPolars.validate(polars_df)

reveal_type(result2)
print("pl.LazyFrame")
reveal_type(SchemaPolars.validate(lazyframe))
if typing.TYPE_CHECKING:
    # this should fail mypy, I don't want it crashing my test script though
    SchemaPolars.validate(pandas_df) # should fail

when then shows

fork\tester.py:34: note: Revealed type is "pandas.core.frame.DataFrame"
fork\tester.py:41: note: Revealed type is "pandera.typing.pandas.DataFrame[tester.SchemaPandas]"
fork\tester.py:43: note: Revealed type is "pandera.typing.pandas.DataFrame[tester.SchemaGeoPandas]"
fork\tester.py:48: note: Revealed type is "pandera.typing.polars.DataFrame[tester.SchemaPolars]"
fork\tester.py:50: note: Revealed type is "pandera.typing.polars.LazyFrame[tester.SchemaPolars]"
fork\tester.py:53: error: No overload variant of "validate" of "DataFrameModel" matches argument type "DataFrame"  [call-overload]

under mypy. I thought at first this might be because pl.LazyFrame and pl.DataFrame are resolving to Any in the pre-commit environment, but I've tried adding polars there explicitly and it still shows the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cosmicBboy I had another look at this and I think it's ready for another look if you have a chance. I worked out how to resolve the

pandera\api\polars\model.py:130: error: Overloaded function signature 2 will never be matched: signature 1's parameter type(s) are the same or broader [misc]
I had above. This was happening for two reasons. Polars wasn't installed in the mypy environment, so pl.DataFrame and pl.LazyFrame were resolving to Any. The second reason was the global mypi config setting follow_imports=skip. I swapped this to be a per module follow_imports = skip, and then relaxed the required cases so that this import would get treated properly.

I did however notice that adding polars to the mypy environment resulted in a lot of new errors so I've temporarily supressed some errors with more rules in the mypy.ini. I was hoping I would just be able to fix them, but they probably need a bit of input (particularly around how the PolarsData namedtuple works), so I've left that over in #1911 and kept only the minimal mypy changes here.

I guess question, are you happy with mypy changes in this PR, or would be better to revert to the previous state and I just type:ignore the error I quoted up above?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this approach @m-richards. It does add a boilerplate method for each dataframe model type, but I think it's acceptable in order to make the type linters happy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did however notice that adding polars to the mypy environment resulted in a lot of new errors so I've temporarily supressed some errors with more rules in the mypy.ini. I was hoping I would just be able to fix them, but they probably need a bit of input (particularly around how the PolarsData namedtuple works), so I've left that over in #1911 and kept only the minimal mypy changes here.

Thanks for opening up #1911! Let me know if you need any additional help address the mypy errors: I'm okay with type: ignore'ing them and slowly chipping away at the errors.

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
@m-richards m-richards marked this pull request as ready for review February 15, 2025 03:52
@cosmicBboy cosmicBboy merged commit 7186507 into unionai-oss:main Feb 17, 2025
144 of 146 checks passed
m-richards added a commit to m-richards/pandera that referenced this pull request Feb 23, 2025
…nai-oss#1905)

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* trial type annotations

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* changes in individual api files

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* pl.dataframe working in local test

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* older python union compat

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* try polars in the mypy env on ci

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* translate toplevel mypy skip into module specific skips

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* mypy passes

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* missing line continuation

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* python 3.8

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

---------

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
m-richards added a commit to m-richards/pandera that referenced this pull request Feb 23, 2025
…nai-oss#1905)

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* trial type annotations

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* changes in individual api files

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* pl.dataframe working in local test

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* older python union compat

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* try polars in the mypy env on ci

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* translate toplevel mypy skip into module specific skips

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* mypy passes

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* missing line continuation

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* python 3.8

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

---------

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
m-richards added a commit to m-richards/pandera that referenced this pull request Feb 23, 2025
…nai-oss#1905)

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* trial type annotations

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* changes in individual api files

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* pl.dataframe working in local test

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* older python union compat

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* try polars in the mypy env on ci

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* translate toplevel mypy skip into module specific skips

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* mypy passes

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* missing line continuation

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* python 3.8

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

---------

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
m-richards added a commit to m-richards/pandera that referenced this pull request Feb 23, 2025
…nai-oss#1905)

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* trial type annotations

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* changes in individual api files

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* pl.dataframe working in local test

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* older python union compat

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* try polars in the mypy env on ci

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* translate toplevel mypy skip into module specific skips

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* mypy passes

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* missing line continuation

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

* python 3.8

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>

---------

Signed-off-by: Matt Richards <mrichards7@outlook.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mypy - SchemaModel.validate does not return a DataFrame
2 participants