Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

columns parameter not working as expected in parse_single_table/parse #14943

Closed
druzm opened this issue Jun 14, 2023 · 14 comments · Fixed by #15959
Closed

columns parameter not working as expected in parse_single_table/parse #14943

druzm opened this issue Jun 14, 2023 · 14 comments · Fixed by #15959

Comments

@druzm
Copy link

druzm commented Jun 14, 2023

Description

When using parse_single_table to read a VOTable in an XML file, the parameter columns doesn't work as expected. The parameter columns is internally passed to the function parse which describes columns as:

columns: sequence of str, optional
    List of field names to include in the output. The default is to include all fields.

I'm expecting parse to have the same issue. I haven't directly tried it, but I've verified that columns is being passed to parse appropriately.

Expected behavior

When passing a list of columns to columns only the columns in the list should be present in the output. However, all the columns are present in the output, but the ones that are not in the list are shown as a masked constant (NaN) or as an empty array depending on the type declared in the file.

I'm expecting to get:

<VOTable length=2>
  id
int64   
-----
    1
    2

But I'm getting:

<VOTable length=2>
  id  function_id coefficients
int64    int16       object   
----- ----------- ------------
    1          --           []
    2          --           []

How to Reproduce

from astropy.io.votable import parse_single_table
xml_file = 'sample.xml'
_usecols = ['id']
votable = parse_single_table(xml_file, columns=_usecols)

The contents of sample.xml are:

<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.3 http://www.ivoa.net/xml/VOTable/v1.3" xmlns:spec="http://www.ivoa.net/xml/SpectrumModel/v1.01">
<RESOURCE type="results" utype="u:Type">
<TIMESYS ID="time_frame" refposition="BARYCENTER" timeorigin="24.5" timescale="TCB"/><TABLE>
<FIELD datatype="long" name="id">
<DESCRIPTION>Identifier</DESCRIPTION>
</FIELD>
<FIELD datatype="short" name="function_id">
<DESCRIPTION>Identifier defining the functions</DESCRIPTION>
</FIELD>
<FIELD arraysize="*" datatype="double" name="coefficients">
<DESCRIPTION>Function coefficients</DESCRIPTION>
</FIELD>
<DATA>
<TABLEDATA>
  <TR>
    <TD>1</TD>
    <TD>2</TD>
    <TD>31125.820 3682.205 -1562.585</TD>
  </TR>
  <TR>
    <TD>2</TD>
    <TD>3</TD>
    <TD>3112.820 36822.205 13562.585</TD>
  </TR>
</TABLEDATA>
</DATA>
</TABLE>
</RESOURCE>
</VOTABLE>

Versions

astropy version: '6.0.dev278+g196e3c2d2'

@druzm druzm added the Bug label Jun 14, 2023
@github-actions
Copy link
Contributor

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

GitHub issues in the Astropy repository are used to track bug reports and feature requests; If your issue poses a question about how to use Astropy, please instead raise your question in the Astropy Discourse user forum and close this issue.

If you feel that this issue has not been responded to in a timely manner, please send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

@pllim
Copy link
Member

pllim commented Jun 14, 2023

I cannot tell what was the original intent in the design but this behavior seems intentional in the code. The columns eventually get turned into this:

colnumbers_bits = [i in colnumbers for i in range(len(fields))]

that is used here:

if colnumbers_bits[i]:

that populates this:

row[i] = value
row_mask[i] = mask_value

I don't see the code attempting to drop any columns.

Are you trying to load a larger-than-memory table? If not, maybe consider astropy.table.Table unified I/O interface and then you can manipulate the columns as you could any astropy.table.Table. If unit matters, then consider astropy.table.QTable.

@druzm
Copy link
Author

druzm commented Jun 15, 2023

I see. But actually, instead of dropping the columns, I was expecting the code to simply skip them. I want to avoid reading columns that I know I'm not going to use.

I may have to use this function to read files of a few GB, so I'll see whether the reading time improves when passing a list of only a few columns to parse_single_table and then manually selecting only the columns I need. If this weren't sufficient, I'll try the unified I/O interface you're proposing.

I still find the description in the docstring misleading, though.

@pllim
Copy link
Member

pllim commented Jun 15, 2023

I'll ask PyVO people who use this format a lot. Perhaps they would have better insight.

If you have a proposal on how to improve the docstring, please open a pull request. Thanks!

@tomdonaldson
Copy link
Contributor

I wasn't aware of the columns feature in parsing before, but it sounds useful and I agree it's not behaving as expected (i.e., only the specified columns should exist in the output table).

It does look like the code is skipping parsing and saving the values for those columns, which is about all it could do for read-time efficiency. The individual values still need to be parsed in the XML sense so the reader can get to the next value, but the values are being ignored. However, if it created the output table without the undesired columns, that could free up useful space for output tables (I don't know how much space is saved by not storing any values in those columns).

@druzm
Copy link
Author

druzm commented Jun 16, 2023

I haven't checked in terms of space, but I can confirm that the function runs faster when only passing a few columns. I have an XML file containing a table of 5,000 rows and 26 fields/columns.

_usecols = ['column1', 'column2', 'column3']
table = parse_single_table(xml_file, columns=_usecols)

took 1.35s

whereas

table = parse_single_table(xml_file)

took 26.5s.

@tomdonaldson
Copy link
Contributor

I haven't checked in terms of space, but I can confirm that the function runs faster when only passing a few columns. I have an XML file containing a table of 5,000 rows and 26 fields/columns.

Ah thanks, that's good to know! With that time savings, it's definitely worth doing.

I am a bit concerned with the general slow performance those numbers show, so I'm thinking we should be raising the priority of issue #8191 (votable read slow).

@neutrinoceros
Copy link
Contributor

It seems to be rather consensual that the current behaviour is unexpected. To address @pllim's interrogations, I started studying the TableElement._parse_tabledata method and I think that the current design is just as simple as possible (it's already a function with a lot of complexity) with little regards for "edge" cases such as this one. I think that's also why it's currently pure Python (which, as pointed to in #8191, is the most likely reason why parsing VOTable is slow), but I think the only thing that's missing to improve user experience and performances is man hours: I'm game to get this going if that's desired. (ping @hamogu)

@fxpineau
Copy link

I am actively maintaining and developing a Rust VOTable parser available on crates.io. (@neutrinoceros, you are learning Rust according to your github page ;) ).
The source code is open (MIT license) and available on github.
Although it is far from been as clean, documented and tested as I would like, the code is already in production in Aladin Lite V3, in votable-cli for the convertion of small-tables in various formats, and in other CDS tools.
It supports recent IVOA standards like the parsing of MIVOT.
Maybe it could be used in a way or another to improve performances with respect to a pure python implementation.

Don't hesitate to contact me (or @ManonMarchand which is located in the office next to mine).

@neutrinoceros
Copy link
Contributor

@fxpineau That is awesome, thank you for sharing ! And yes, I'm learning rust, but haven't gone beyond the basics yet, so the project is probably more interesting to me than I am to it as a potential contributor (for now) !
Also, for transparency

  • there's no precedent for adding a dependency on a crates.io rust package to astropy and I suspect it would be a stretch to propose something like this at the moment: astropy's dependencies are required to be available on conda-forge, and I don't think we have the man-power or expertise to maintain rust (yet).
  • I'm temporary staff (my current contract with astropy ends in 2 months), so I cannot commit to medium-to-long term projects.

That said, it's still helpful to know about another implementation that's closer to the metal so we can better set our expectations of what to aim for in terms of performance !

@fxpineau
Copy link

Ok. Just in case, we have some experience of mixed Rust/Python packages with MOCPy which is an astropy affiliated package available in both pypi and conda, see mocpy feedstock recipe here . It is largely based on the Rust moc crate. Same for cds-healpix-python, which is not astropy affiliated, but is both in pypi and conda, and is based on the rust cdshealpix crate.

(We also made the tool get-license-healper on purpose to help writing mixed Python/Rust projects conda recipes.)

Also, I have a permanent position at CDS so those Rust codes are going to be supported on the long term.

@pllim
Copy link
Member

pllim commented Jan 26, 2024

Very interesting! Is IVOA involved in this effort? Any plan to integrate with PyVO if possible?

@fxpineau
Copy link

@pllim I am not sure of what you mean exactly.
The CDS is involved in the IVOA, in making some of its standards and implementations.
Regarding MIVOT in particular, @lmichel (the MIVOT main author) and I are interacting regularly, and he is pushing MIVOT into PyVO, see this PR.

@lmichel
Copy link
Contributor

lmichel commented Jan 29, 2024

For the record: Two years ago, we talked with @tomdonaldson an some other VO people about taking advantage of the MIVOT implementation to promote a new VOTable parser (e.g. RUST) in PyVO. In the end, it seemed to us preferable to remain focused on the implementation of MIVOT staying within the existing software environment.
When I see the review work required for the MIVOT PRs as they are today, I think we made the right choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants