-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Move rarely used I/O connectors to third party modules #28409
Comments
I don't think Stata or SAS are simple wrappers around 3rd party libraries. We're the ones maintaining them right now (primarily with individuals who are experts in that section of the code base, but aren't necessarily pandas maintainers). I'm concerned that moving them out of pandas will make them more likely to be abandoned. As for CI, I assume we would be testing these as downstream dependencies? So we would still be downloading them and their dependencies (just not running the full set of tests). |
Thanks for the feedback, that's good to know. Not sure if any of us is using the connectors to Stata or SAS, but personally I don't see as a negative thing that the connectors are abandoned if we don't maintain it ourselves. I guess the users of those are mainly corporations; if they are not interested in maintaining or funding the maintenance of those, not sure if it's worth that we do it for free. I was thinking that third party connectors would implement and run their own tests. Do we test any downstream dependency currently? |
I am not sure of the goal here. the existing data io connectors don’t require much maintenance by maintainers not is CI an issue at all here (the d/l of the reps is immaterial) I also don’t think that the api is really that big of a deal. So IIIC you want to enable adding in new data io connectors not being in pandas proper (or at least an api to do this)? +1 on that and if that were available then you could easily move lesser used libraries out BUT this seems like the cart before the horse; we can’t move these libraries out w/o an option for external installation; this is not possible for sas/ STATA as they all code in pandas. (gbq is completely external) so if someone wants to move these out and maintain them great |
Thanks for the feedback. And yes, enable adding new data connectors (and eventually move the connectors that make sense out) is exactly what I wanted to propose. Sorry the focus was too much on moving out the connectors, wanted to complement what was already discussed in #26804 (which wasn't well focused either). If people is happy with the idea of being able to extend pandas with new connectors, then I'll open a PR to see how that would look like, and we can better discuss over the code. |
I don't think that's accurate, at least not for Stata.
Yeah (see I'm still a bit unclear on the goal here that would be achieved by moving this out of pandas. I don't think that wanting to add new IO formats (via the connector stuff) is necessarily related. Even if we had a connector API we could choose to maintain these here vs. another repo. So, in my mind, it comes down to the cost of maintenance vs. the benefit to users. I think Stata, SAS, SPSS, at least pass that test. Their maintenance cost is relatively small, especially when people like @kshedden, @bashtage, @cbrnr, etc. are around to fix issues when they do come up. And I think the benefit of having these is decently large. Something like msgpack didn't meet that test. That code was hard to work with, and since it's a wire format we were essentially defining, it couldn't handle the new extension types. The existing file formats like Stata don't really share that problem. |
I'd definitely keep Stata, SAS, and SPSS in. Personally, I know a lot of people who are switching from one of these packages to Python/pandas, and having these readers greatly simplifies life for them. |
IMO these types of connectors, especially for import (but why not export if it is done), are really essential to the success of pandas. They make it possible to move to Pandas in a gradual way and to interface with a large part of the statistics community. Stata has 0 external deps FWIW. |
I also think the survey results are skewed towards a particular type, and so are challenging to interpret as broad guidance. I was not aware of the survey until I saw some results on reddit. |
Thanks all for the feedback. Just to be clear, there is no discussion on removing those connectors from pandas, and I have no doubt they add value to many users (that 3%, if true, represents at least 30,000 users). The question is whether it'd be better to have these connectors in separate repos. And consequently, users would have to add another dependency (e.g That would also allow that the people maintaining those could have write access to those repos. If we allow third-party connectors (sorry I mixed both things here), I think it could make things simpler and more efficient to have some connectors in separate projects. The line of what should be in pandas and what not will always be ambiguous, in the same way it's somehow arbitrary what Python has in the standard library, and what lives in third-party libraries. |
I'm not sure how to square these two
The questions of "where is this developed" and "how do I install this" can be separated. We're able to
Which of those (if any) are you proposing? |
Good point, I'm proposing 3. But probably worth clarifying my priorities:
To me, 1 and 2 are quite important, they will empower users and simplify our code. 3 would be really nice, but not as important. I think 4 makes a lot of sense, but if people who actually maintain those less used connectors prefer to have them in pandas, I think it's not optimal, but it's surely all right. |
the main issue is maintainers (who is going to maintain these) and infrastructure it’s not trivial to build wheels in all platforms
this is already a 3rd party library;we just test
|
Thanks for the info, I totally missed that. I thought it was a wrapper around a generic gbq library, I see now that it's a third-party connector already, and we basically have the documentation. Do you think it's worth testing it in every pandas PR? I guess there is a reason, but seems like adding a nightly build to pandas_gbq that tests with the latest pandas (release candidates, or may be master) should be enough. I think having it well decoupled (moving the tests there) would make things simpler, not only for the CI. |
wouldn’t object to having a nightly build test |
I think the preference is to keep all I/O connectors in the pandas main repo. Closing. |
Related to #26804 (and indirectly related to #15862, #15008 if we standardize I/O connectors as part of this)
The recent survey shows that several of our I/O connectors (Stata, SPSS, SAS and GBQ) are rarely used (from the chart I'd say around 3% of respondents said they used them):
While afaik all them are just wrappers around other libraries, and don't contain much code to maintain, I see few main immediate advantages:
I/O ecosystem
While the number of widely used formats is quite limited, the number of possible I/O connectors is huge. Several new connectors have been requested in the past, like markdown (#11052), avro (#11752), docx (#22518), and there are surely plenty of specialized and internal formats that didn't make sense to propose adding to pandas, but that people would appreciate having and using the same standard API (
pandas.read_*
/DataFrame.to_*
) we have. And there may also be alternative versions of the connectors we provide (see for example this lightning talk by @dutc dicussing about an unsafepandas.read_csv
: https://youtu.be/QkQ5HHEu1b4?t=1554).CI
For the CI, those modules include dependencies that need to be downloaded for every build. We also have files in
pandas/tests/io/data
that eventually we may need to update. For GBQ in particular, we have to store the password for the account we use to test, and those tests fail from time to time for connectivity problems. For the clipboard (not proposing to move to a third-party yet, but is a good candidate too), we require to emulate an X system in our tests (need to check if we finally fixed the tests, but they weren't running for a while because of problems on the CI set up).And of course our test suite is taking some extra time to test all these connectors.
API changes
Being part of our public API, I think we need to commit to the API of those connectors, in the same way as we do with the rest of our API. But, for whatever reason it may make sense to discontinue some of these connectors (e.g. Google releasing a replacement for GBQ...), or make important changes to their parameters, and probably it makes sense to have different release cycles for them.
This also applies to connectors not yet in pandas, but that we may include if we keep the policy of adding to our code base the connectors for all the formats / services we consider relevant.
I know timing is not great, but because of this API commitment, I'd personally remove Stata, SPSS, SAS and GBQ before pandas 1.0.
The text was updated successfully, but these errors were encountered: