columns parameter not working as expected in parse_single_table/parse #14943

druzm · 2023-06-14T17:47:29Z

Description

When using parse_single_table to read a VOTable in an XML file, the parameter columns doesn't work as expected. The parameter columns is internally passed to the function parse which describes columns as:

columns: sequence of str, optional
    List of field names to include in the output. The default is to include all fields.

I'm expecting parse to have the same issue. I haven't directly tried it, but I've verified that columns is being passed to parse appropriately.

Expected behavior

When passing a list of columns to columns only the columns in the list should be present in the output. However, all the columns are present in the output, but the ones that are not in the list are shown as a masked constant (NaN) or as an empty array depending on the type declared in the file.

I'm expecting to get:

<VOTable length=2>
  id
int64   
-----
    1
    2

But I'm getting:

<VOTable length=2>
  id  function_id coefficients
int64    int16       object   
----- ----------- ------------
    1          --           []
    2          --           []

How to Reproduce

from astropy.io.votable import parse_single_table
xml_file = 'sample.xml'
_usecols = ['id']
votable = parse_single_table(xml_file, columns=_usecols)

The contents of sample.xml are:

<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.3 http://www.ivoa.net/xml/VOTable/v1.3" xmlns:spec="http://www.ivoa.net/xml/SpectrumModel/v1.01">
<RESOURCE type="results" utype="u:Type">
<TIMESYS ID="time_frame" refposition="BARYCENTER" timeorigin="24.5" timescale="TCB"/><TABLE>
<FIELD datatype="long" name="id">
<DESCRIPTION>Identifier</DESCRIPTION>
</FIELD>
<FIELD datatype="short" name="function_id">
<DESCRIPTION>Identifier defining the functions</DESCRIPTION>
</FIELD>
<FIELD arraysize="*" datatype="double" name="coefficients">
<DESCRIPTION>Function coefficients</DESCRIPTION>
</FIELD>
<DATA>
<TABLEDATA>
  <TR>
    <TD>1</TD>
    <TD>2</TD>
    <TD>31125.820 3682.205 -1562.585</TD>
  </TR>
  <TR>
    <TD>2</TD>
    <TD>3</TD>
    <TD>3112.820 36822.205 13562.585</TD>
  </TR>
</TABLEDATA>
</DATA>
</TABLE>
</RESOURCE>
</VOTABLE>

Versions

astropy version: '6.0.dev278+g196e3c2d2'

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-14T17:48:05Z

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

GitHub issues in the Astropy repository are used to track bug reports and feature requests; If your issue poses a question about how to use Astropy, please instead raise your question in the Astropy Discourse user forum and close this issue.

If you feel that this issue has not been responded to in a timely manner, please send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

pllim · 2023-06-14T18:07:52Z

I cannot tell what was the original intent in the design but this behavior seems intentional in the code. The columns eventually get turned into this:

astropy/astropy/io/votable/tree.py

Line 2811 in 196e3c2

colnumbers_bits = [i in colnumbers for i in range(len(fields))]

that is used here:

astropy/astropy/io/votable/tree.py

Line 2832 in 196e3c2

if colnumbers_bits[i]:

that populates this:

astropy/astropy/io/votable/tree.py

Lines 2867 to 2868 in 196e3c2

    
           row[i] = value 
        
           row_mask[i] = mask_value

I don't see the code attempting to drop any columns.

Are you trying to load a larger-than-memory table? If not, maybe consider astropy.table.Table unified I/O interface and then you can manipulate the columns as you could any astropy.table.Table. If unit matters, then consider astropy.table.QTable.

druzm · 2023-06-15T08:24:37Z

I see. But actually, instead of dropping the columns, I was expecting the code to simply skip them. I want to avoid reading columns that I know I'm not going to use.

I may have to use this function to read files of a few GB, so I'll see whether the reading time improves when passing a list of only a few columns to parse_single_table and then manually selecting only the columns I need. If this weren't sufficient, I'll try the unified I/O interface you're proposing.

I still find the description in the docstring misleading, though.

pllim · 2023-06-15T19:24:40Z

I'll ask PyVO people who use this format a lot. Perhaps they would have better insight.

If you have a proposal on how to improve the docstring, please open a pull request. Thanks!

tomdonaldson · 2023-06-15T21:00:20Z

I wasn't aware of the columns feature in parsing before, but it sounds useful and I agree it's not behaving as expected (i.e., only the specified columns should exist in the output table).

It does look like the code is skipping parsing and saving the values for those columns, which is about all it could do for read-time efficiency. The individual values still need to be parsed in the XML sense so the reader can get to the next value, but the values are being ignored. However, if it created the output table without the undesired columns, that could free up useful space for output tables (I don't know how much space is saved by not storing any values in those columns).

druzm · 2023-06-16T09:54:24Z

I haven't checked in terms of space, but I can confirm that the function runs faster when only passing a few columns. I have an XML file containing a table of 5,000 rows and 26 fields/columns.

_usecols = ['column1', 'column2', 'column3']
table = parse_single_table(xml_file, columns=_usecols)

took 1.35s

whereas

table = parse_single_table(xml_file)

took 26.5s.

tomdonaldson · 2023-06-16T10:25:22Z

I haven't checked in terms of space, but I can confirm that the function runs faster when only passing a few columns. I have an XML file containing a table of 5,000 rows and 26 fields/columns.

Ah thanks, that's good to know! With that time savings, it's definitely worth doing.

I am a bit concerned with the general slow performance those numbers show, so I'm thinking we should be raising the priority of issue #8191 (votable read slow).

neutrinoceros · 2024-01-25T13:55:36Z

It seems to be rather consensual that the current behaviour is unexpected. To address @pllim's interrogations, I started studying the TableElement._parse_tabledata method and I think that the current design is just as simple as possible (it's already a function with a lot of complexity) with little regards for "edge" cases such as this one. I think that's also why it's currently pure Python (which, as pointed to in #8191, is the most likely reason why parsing VOTable is slow), but I think the only thing that's missing to improve user experience and performances is man hours: I'm game to get this going if that's desired. (ping @hamogu)

fxpineau · 2024-01-26T11:09:17Z

I am actively maintaining and developing a Rust VOTable parser available on crates.io. (@neutrinoceros, you are learning Rust according to your github page ;) ).
The source code is open (MIT license) and available on github.
Although it is far from been as clean, documented and tested as I would like, the code is already in production in Aladin Lite V3, in votable-cli for the convertion of small-tables in various formats, and in other CDS tools.
It supports recent IVOA standards like the parsing of MIVOT.
Maybe it could be used in a way or another to improve performances with respect to a pure python implementation.

Don't hesitate to contact me (or @ManonMarchand which is located in the office next to mine).

neutrinoceros · 2024-01-26T12:00:33Z

@fxpineau That is awesome, thank you for sharing ! And yes, I'm learning rust, but haven't gone beyond the basics yet, so the project is probably more interesting to me than I am to it as a potential contributor (for now) !
Also, for transparency

there's no precedent for adding a dependency on a crates.io rust package to astropy and I suspect it would be a stretch to propose something like this at the moment: astropy's dependencies are required to be available on conda-forge, and I don't think we have the man-power or expertise to maintain rust (yet).
I'm temporary staff (my current contract with astropy ends in 2 months), so I cannot commit to medium-to-long term projects.

That said, it's still helpful to know about another implementation that's closer to the metal so we can better set our expectations of what to aim for in terms of performance !

fxpineau · 2024-01-26T13:18:39Z

Ok. Just in case, we have some experience of mixed Rust/Python packages with MOCPy which is an astropy affiliated package available in both pypi and conda, see mocpy feedstock recipe here . It is largely based on the Rust moc crate. Same for cds-healpix-python, which is not astropy affiliated, but is both in pypi and conda, and is based on the rust cdshealpix crate.

(We also made the tool get-license-healper on purpose to help writing mixed Python/Rust projects conda recipes.)

Also, I have a permanent position at CDS so those Rust codes are going to be supported on the long term.

pllim · 2024-01-26T15:03:15Z

Very interesting! Is IVOA involved in this effort? Any plan to integrate with PyVO if possible?

fxpineau · 2024-01-29T09:58:04Z

@pllim I am not sure of what you mean exactly.
The CDS is involved in the IVOA, in making some of its standards and implementations.
Regarding MIVOT in particular, @lmichel (the MIVOT main author) and I are interacting regularly, and he is pushing MIVOT into PyVO, see this PR.

lmichel · 2024-01-29T14:20:45Z

For the record: Two years ago, we talked with @tomdonaldson an some other VO people about taking advantage of the MIVOT implementation to promote a new VOTable parser (e.g. RUST) in PyVO. In the end, it seemed to us preferable to remain focused on the implementation of MIVOT staying within the existing software environment.
When I see the review work required for the MIVOT PRs as they are today, I think we made the right choice.

druzm added the Bug label Jun 14, 2023

pllim added the io.votable label Jun 14, 2023

This was referenced Jan 26, 2024

BUG: fix an incorrect error message in io.votable.TableElement.parse #15956

Merged

API/BUG: only output user-requested columns in votable.parse #15959

Merged

MridulS mentioned this issue Feb 18, 2024

Cycle 3 Funding: Günther and Price-Whelan | Research Software Engineer: User and Developer Support astropy/astropy-project#360

Open

pllim closed this as completed in #15959 Apr 18, 2024

pllim mentioned this issue Apr 18, 2024

API: Make fields property immutable #16314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

columns parameter not working as expected in parse_single_table/parse #14943

columns parameter not working as expected in parse_single_table/parse #14943

druzm commented Jun 14, 2023 •

edited

Loading

github-actions bot commented Jun 14, 2023

pllim commented Jun 14, 2023

druzm commented Jun 15, 2023

pllim commented Jun 15, 2023

tomdonaldson commented Jun 15, 2023

druzm commented Jun 16, 2023

tomdonaldson commented Jun 16, 2023

neutrinoceros commented Jan 25, 2024

fxpineau commented Jan 26, 2024

neutrinoceros commented Jan 26, 2024

fxpineau commented Jan 26, 2024

pllim commented Jan 26, 2024

fxpineau commented Jan 29, 2024

lmichel commented Jan 29, 2024 •

edited

Loading

columns parameter not working as expected in parse_single_table/parse #14943

columns parameter not working as expected in parse_single_table/parse #14943

Comments

druzm commented Jun 14, 2023 • edited Loading

Description

Expected behavior

How to Reproduce

Versions

github-actions bot commented Jun 14, 2023

pllim commented Jun 14, 2023

druzm commented Jun 15, 2023

pllim commented Jun 15, 2023

tomdonaldson commented Jun 15, 2023

druzm commented Jun 16, 2023

tomdonaldson commented Jun 16, 2023

neutrinoceros commented Jan 25, 2024

fxpineau commented Jan 26, 2024

neutrinoceros commented Jan 26, 2024

fxpineau commented Jan 26, 2024

pllim commented Jan 26, 2024

fxpineau commented Jan 29, 2024

lmichel commented Jan 29, 2024 • edited Loading

druzm commented Jun 14, 2023 •

edited

Loading

lmichel commented Jan 29, 2024 •

edited

Loading