You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think this is something people know, but it is not explicitly stated: Can a record have multiple extension-fields of the same type?
Section 5.1 of the 1.1 spec says "WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g. WARC-Concurrent-To)." However it makes no explicit mention of whether multiple extension-fields of the same type are allowed. It does say "WARC processing software shall ignore fields with unrecognized names" which could mean it is allowed.
I think the answer is yes. But this is not stated anywhere. An example of multiple extension-fields of the same type on the same record that I've found so far is #42, the proposed WARC-Protocol field. That shows examples using 2 fields (for TLS and HTTP), but presumably at some point this will become a named field and have language in the spec like WARC-Concurrent-To does, leaving this question unanswered.
A reason to explicitly discuss multiple extension-fields of the same type is to avoid implementation issues. I suspect most WARC parsing software implements field parsing for extension-fields with a dictionary/hash, keyed on the field name, where duplicate keys are not allowed. Implementations will behave differently (first value wins, last field value wins, etc.). I personally hit this when parsing records with multiple WARC-Protocol fields.
Perhaps it should be explicitly stated, or does the " ignore fields with unrecognized names" cover this?
The text was updated successfully, but these errors were encountered:
Agreed, repeated extension-fields should definitely be allowed. You might argue that 'as noted' also applies to extensions. Of course, a parser that doesn't support a particular extension wouldn't know whether a field defined there allows repetitions, and so I do think the 'ignore unrecognised fields' clause sort of covers it. But it'd still be good to fix this in the core specification in my opinion.
The easiest resolution would naturally be to change the quoted paragraph of section 5.1 to talk about defined-fields rather than 'named fields'. But perhaps it's worth considering a renaming of the entire section 5 instead.
Yes. My interpretation and implementation is that "shall not be repeated ... except as noted" is setting up a default so that each named field doesn't need a statement disallowing repetition and the specification of individual fields can override this regardless of whether they're defined by the core format or an extension.
Given there's only one repeatable core field, I think the the fact it's worded broadly as "except as noted (e.g. WARC-Concurrent-To)" instead of "except WARC-Concurrent-To" supports this interpretation.
ato
added a commit
that referenced
this issue
Jan 9, 2024
I think this is something people know, but it is not explicitly stated: Can a record have multiple extension-fields of the same type?
Section 5.1 of the 1.1 spec says "WARC named fields of the same type shall not be repeated in the same WARC record (for example, a WARC record shall not have several WARC-Date or several WARC-Target-URI), except as noted (e.g. WARC-Concurrent-To)." However it makes no explicit mention of whether multiple extension-fields of the same type are allowed. It does say "WARC processing software shall ignore fields with unrecognized names" which could mean it is allowed.
I think the answer is yes. But this is not stated anywhere. An example of multiple extension-fields of the same type on the same record that I've found so far is #42, the proposed
WARC-Protocol
field. That shows examples using 2 fields (for TLS and HTTP), but presumably at some point this will become a named field and have language in the spec likeWARC-Concurrent-To
does, leaving this question unanswered.A reason to explicitly discuss multiple extension-fields of the same type is to avoid implementation issues. I suspect most WARC parsing software implements field parsing for extension-fields with a dictionary/hash, keyed on the field name, where duplicate keys are not allowed. Implementations will behave differently (first value wins, last field value wins, etc.). I personally hit this when parsing records with multiple WARC-Protocol fields.
Perhaps it should be explicitly stated, or does the " ignore fields with unrecognized names" cover this?
The text was updated successfully, but these errors were encountered: