-
-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/763 improve type annotations for DataFrameModel.validate #1905
Changes from all commits
abaa176
3491aea
b4228c6
a6380aa
a8fa84c
2c284e1
60be3a9
7193b4b
2d9bebe
7ca2b86
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,13 @@ | ||
"""Class-based api for polars models.""" | ||
|
||
import inspect | ||
from typing import Dict, List, Tuple, Type | ||
from typing import Dict, List, Tuple, Type, cast, Optional, overload, Union | ||
from typing_extensions import Self | ||
|
||
import pandas as pd | ||
import polars as pl | ||
|
||
from pandera.api.base.schema import BaseSchema | ||
from pandera.api.checks import Check | ||
from pandera.api.dataframe.model import DataFrameModel as _DataFrameModel | ||
from pandera.api.dataframe.model import get_dtype_kwargs | ||
|
@@ -16,7 +18,8 @@ | |
from pandera.engines import polars_engine as pe | ||
from pandera.errors import SchemaInitError | ||
from pandera.typing import AnnotationInfo | ||
from pandera.typing.polars import Series | ||
from pandera.typing.polars import Series, LazyFrame, DataFrame | ||
from pandera.utils import docstring_substitution | ||
|
||
|
||
class DataFrameModel(_DataFrameModel[pl.LazyFrame, DataFrameSchema]): | ||
|
@@ -109,6 +112,53 @@ | |
|
||
return columns | ||
|
||
@classmethod | ||
@overload | ||
def validate( | ||
cls: Type[Self], | ||
check_obj: pl.DataFrame, | ||
head: Optional[int] = None, | ||
tail: Optional[int] = None, | ||
sample: Optional[int] = None, | ||
random_state: Optional[int] = None, | ||
lazy: bool = False, | ||
inplace: bool = False, | ||
) -> DataFrame[Self]: ... | ||
|
||
@classmethod | ||
@overload | ||
def validate( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With mypy in pre-commit, this gives: but that doesn't seem to be the case. I have this as a test case: import typing
from typing import reveal_type
import pandera as pda
import pandera.polars as pla
import pandas as pd
import polars as pl
from pandera.typing import Series
from pandera.typing.geopandas import GeoSeries
import geopandas as gpd
class SchemaPandas(pda.DataFrameModel):
col1: Series[int]
col2: Series[int]
class SchemaGeoPandas(pda.DataFrameModel):
col1: Series[int]
col2: GeoSeries
class SchemaPolars(pla.DataFrameModel):
col1: Series[int]
col2: Series[int]
pandas_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1, 2, 3]})
gdf = gpd.GeoDataFrame(pandas_df.assign(col2=gpd.GeoSeries.from_xy(x=[1,1,1], y=[2,2,2]))).set_geometry("col2")
polars_df = pl.from_pandas(pandas_df)
lazyframe = polars_df.lazy()
reveal_type(pandas_df)
print("pd.DataFrame")
result = SchemaPandas.validate(pandas_df)
# test operations for methods accepting pandas dataframes
result.to_csv("test")
pd.concat([result, result])
reveal_type(result)
print("gpd.GeoDataFrame")
reveal_type(SchemaGeoPandas.validate(gdf))
print("pl.DataFrame")
result2 = SchemaPolars.validate(polars_df)
reveal_type(result2)
print("pl.LazyFrame")
reveal_type(SchemaPolars.validate(lazyframe))
if typing.TYPE_CHECKING:
# this should fail mypy, I don't want it crashing my test script though
SchemaPolars.validate(pandas_df) # should fail when then shows
under mypy. I thought at first this might be because pl.LazyFrame and pl.DataFrame are resolving to Any in the pre-commit environment, but I've tried adding polars there explicitly and it still shows the same. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @cosmicBboy I had another look at this and I think it's ready for another look if you have a chance. I worked out how to resolve the
I did however notice that adding polars to the mypy environment resulted in a lot of new errors so I've temporarily supressed some errors with more rules in the mypy.ini. I was hoping I would just be able to fix them, but they probably need a bit of input (particularly around how the PolarsData namedtuple works), so I've left that over in #1911 and kept only the minimal mypy changes here. I guess question, are you happy with mypy changes in this PR, or would be better to revert to the previous state and I just type:ignore the error I quoted up above? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm okay with this approach @m-richards. It does add a boilerplate method for each dataframe model type, but I think it's acceptable in order to make the type linters happy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks for opening up #1911! Let me know if you need any additional help address the mypy errors: I'm okay with |
||
cls: Type[Self], | ||
check_obj: pl.LazyFrame, | ||
head: Optional[int] = None, | ||
tail: Optional[int] = None, | ||
sample: Optional[int] = None, | ||
random_state: Optional[int] = None, | ||
lazy: bool = False, | ||
inplace: bool = False, | ||
) -> LazyFrame[Self]: ... | ||
|
||
@classmethod | ||
@docstring_substitution(validate_doc=BaseSchema.validate.__doc__) | ||
def validate( | ||
cls: Type[Self], | ||
check_obj: Union[pl.LazyFrame, pl.DataFrame], | ||
head: Optional[int] = None, | ||
tail: Optional[int] = None, | ||
sample: Optional[int] = None, | ||
random_state: Optional[int] = None, | ||
lazy: bool = False, | ||
inplace: bool = False, | ||
) -> Union[LazyFrame[Self], DataFrame[Self]]: | ||
"""%(validate_doc)s""" | ||
result = cls.to_schema().validate( | ||
check_obj, head, tail, sample, random_state, lazy, inplace | ||
) | ||
if isinstance(check_obj, pl.LazyFrame): | ||
return cast(LazyFrame[Self], result) | ||
else: | ||
return cast(DataFrame[Self], result) | ||
|
||
@classmethod | ||
def to_json_schema(cls): | ||
"""Serialize schema metadata into json-schema format. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could potentially also all be in an if typing.TYPE_CHECKING block if that's cleaner than providing an implementation the same as the parent. I also don't know if it's preferable to use super() instead?
Edit: I also briefly looked at seeing if an overload on recieving a GeoDataFrame is possible. I had a solution working in my standalone test case, but it was pretty ugly with conditional overloads, and pre-commit mypy was not very happy with that.