Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset dec='\0' to detect numbers while dec is undecided #6445

Merged
merged 1 commit into from
Sep 16, 2024

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Aug 30, 2024

Closes #6440.

dec='\0' is reset each time when moving on to the next column:

dec = '\0'; // reset for next parse

The issue here really is isolated to timestamps with milliseconds: within a column, once dec=',' is tried & fails to parse the timestamp (as a real number), dec=',' is left dangling --> the timestamp fails to parse. This only matters for "higher" types than real numbers where double parsing is attempted, i.e. only timestamps.

Because IS_DEC_TYPE() excludes timestamp on master, and hence whether a timestamp parses with , doesn't count to the "race" vs. dec='.', the timestamp also fails to parse with , as the microsecond separator:

fread(text="Datetime\n2023-10-12T06:53:53,123Z", sep="\t") # sep= required otherwise `,` is seen as sep in this simple example
#                    Datetime
#                      <char>
# 1: 2023-10-12T06:53:53,123Z

But this can be nudged to work in files with fields recognized as real numbers on master:

fread(text="x\tDatetime\n1,0\t2023-10-12T06:53:53,123Z", sep="\t")
#        x            Datetime
#    <num>              <POSc>
# 1:     1 2023-10-12 06:53:53

One incidental change of the PR as of now is that ISO timestamps will parse correctly with , as the millisecond separator; @kav2k confirmed this is OK and in fact recognized by the standard.

Copy link

github-actions bot commented Aug 30, 2024

Comparison Plot

Generated via commit a042c90

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 3 minutes and 22 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 48 seconds

@kav2k
Copy link

kav2k commented Aug 30, 2024

Okay, I just looked at the discussions of the standard, and it actually allows the comma separator (which is surprising).

My concern is data that potentially mixes separators, because even preferring , for decimal, software will likely output . for ISO timestamps (data.table itself is an example):

t <- fread(text="x\tDatetime\n1,1\t2023-10-12T06:53:53,123Z", sep="\t")
fwrite(t, file="", dec=",", sep=";")
x;Datetime
1,1;2023-10-12T06:53:53.123Z

So this fix (which, if I understand correctly, is trying to reparse with the other separator if one fails) should be OK?

Here's a sample of my real data that tripped this issue, which was output by data.table's fwrite():

CoilID;AntennaID;Time;TagID;Pen;Side;Position;Location;Coil_Y;Coil_X
1;16403160;2023-10-12T10:30:55.270Z;DA2C6411;1;AKB;Litter central;middle;1;6
3;16403160;2023-10-12T10:30:55.270Z;DA459D86;1;AKB;Litter central;middle;1;4
15;16402963;2023-10-12T10:31:00.900Z;DA459D86;1;AKB;Litter central;right;2;3
14;16402963;2023-10-12T10:31:02.240Z;DA2C6411;1;AKB;Litter central;right;2;1
11;16403160;2023-10-12T10:31:02.650Z;DA2C6411;1;AKB;Litter central;middle;2;6

P.S. Datetimes are weird. Were you aware that ISO 8601 allows fractional values other than seconds? E.g. 10:20,5 is valid and means 10:20:30... Probably not a good idea to support this though. Something more sane like RFC 3339 draft (which fwrite seems to follow) is already supported and that's likely good enough.

@MichaelChirico
Copy link
Member Author

My concern is data that potentially mixes separators, because even preferring , for decimal, software will likely output . for ISO timestamps (data.table itself is an example):

Oh, nice example. I think that's a follow-up issue for fwrite() to support.

fread(), in the auto-detect step, "runs a race" between dec='.' and dec=','; whichever successfully parses more fields is chosen (tie goes to '.'):

dec = linesForDecDot < 0 ? ',' : '.'; // in a tie, dec=. is more likely

I think this is "good enough". We could possibly support dec = <vector> to allow specifying a dec for each column, but I'd rather not go there, and fix fwrite() to avoid producing such mixtures in the first place.

Probably not a good idea to support this though.

Yea, no thanks 😂

TBH it would not be that hard to support (switch to double parsing here, and don't fail for missing some components), but I doubt it's common enough to be worth maintaining:

str_to_i32_core(&ch, &minute);

So this fix (which, if I understand correctly, is trying to reparse with the other separator if one fails)

Some nuance, this fix is only about the "auto-detect" phase of fread() which runs on a sample of lines to decide the column types. Previously (since dec='auto' support was added) there was no re-try on timestamps, and only dec=',' was tried (by mistake).

@MichaelChirico
Copy link
Member Author

My concern is data that potentially mixes separators, because even preferring , for decimal, software will likely output . for ISO timestamps (data.table itself is an example):

From #6454, this point is moot from data.table -- fwrite(dec=',') will write timestamps with , :)

@MichaelChirico
Copy link
Member Author

Checking in @ben-schwen, got time for a review? Would be nice to get this last part of the 1.16.2 patch submitted.

src/fread.c Outdated Show resolved Hide resolved
@ben-schwen
Copy link
Member

ben-schwen commented Sep 16, 2024

fread(text="t\n2023-10-12T06:53:53.123Z", colClasses="nanotime") # works
fread(text="t\n2023-10-12T06:53:53,123Z", colClasses="nanotime", sep=";") # does not

@MichaelChirico
Copy link
Member Author

Got another "refuse to merge unrelated histories" thing here, strange. Reinstated the changes but lost the history again :\

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Sep 16, 2024

fread(text="t\n2023-10-12T06:53:53.123Z", colClasses="nanotime") # works
fread(text="t\n2023-10-12T06:53:53,123Z", colClasses="nanotime", sep=";") # does not

That's an R side issue:

as.nanotime("2023-10-12T06:53:53,123Z")
# Error: Parse error on 2023-10-12T06:53:53,123Z

But I'm reminded of #6107, there may be a way to get it returned to R as POSIXct and then it will work:

as.nanotime(fread(text="t\n2023-10-12T06:53:53.123Z", colClasses="nanotime")$t)
# [1] 2023-10-12T06:53:53.123+00:00

Out of scope here though, filing for follow-up: #6500

@ben-schwen
Copy link
Member

LGTM, ty!

@ben-schwen ben-schwen merged commit 8f505d5 into master Sep 16, 2024
8 checks passed
@ben-schwen ben-schwen deleted the fread-time-ms branch September 16, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing ISO8601 with microseconds in fread() broken in 1.16.0
3 participants