Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL magic drops column if all row value is NaN #611

Open
benhyy opened this issue Dec 26, 2019 · 6 comments
Open

SQL magic drops column if all row value is NaN #611

benhyy opened this issue Dec 26, 2019 · 6 comments
Labels
kind:bug An unexpected error or issue with sparkmagic

Comments

@benhyy
Copy link

benhyy commented Dec 26, 2019

If a column has null value in every row/record, %%sql will not drop that entire column.

To reproduce, create a table where a column has only null values, e.g.
%%sql
insert into table
values (1, null),
(2, null),
(3, null)

I have attached screenshots using results from %%sql and spark.sql()

Screen Shot 2019-12-26 at 2.50.52 pm.pdf

Versions:

  • SparkMagic 0.12.0
  • Livy 0.6.0
  • Kernel: Spark

Additional context
I believe the problem comes from the fact that since JSON doesn't pick up null values, when the data got converted into dict and then converted into dataframe, it couldn't have known that there was a missing column:

https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/utils.py#L52
https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/sqlquery.py#L58

We need a way to pick up the schema before populating all the data.

@itamarst itamarst self-assigned this Jan 14, 2020
@itamarst
Copy link
Contributor

Yeah, issue is Spark's toJSON() omits null values.

@itamarst
Copy link
Contributor

itamarst commented Jan 14, 2020

One solution is to have first line of result be the list of column names. This is annoying to do because sqlquery.py has three different implementations, for Python/Scala/R, but definitely possible.

@itamarst
Copy link
Contributor

itamarst commented Jan 14, 2020

Example of the problem:

Input:

%%spark
d = spark.sql("select * from example")
d.show()
print(d.toJSON().collect())
d.columns

Output:

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

[u'{"a":1}']
['a', 'b']

@itamarst
Copy link
Contributor

The code that needs fixing is _pyspark_command, _scale_command, and _r_command in sparkmagic/sparkmagic/livyclientlib/sqlquery.py. @benhyy any interest in submitting a PR?

@itamarst itamarst removed their assignment Jan 14, 2020
@itamarst itamarst added the kind:bug An unexpected error or issue with sparkmagic label Jan 14, 2020
@itamarst
Copy link
Contributor

If we ever get #598 in we can drop two of the languages.

As a third alternative, I think I have some memory of SQL support in Livy, too? But that would presumably not less us fix this bug (although perhaps Livy has fixed it).

@ramp818
Copy link

ramp818 commented Jul 15, 2020

Hello is there any more news on this? I'm currently facing the same problem. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug An unexpected error or issue with sparkmagic
Projects
None yet
Development

No branches or pull requests

3 participants