-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.format
53 lines (53 loc) · 5.26 KB
/
README.format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
ELISA_LRLP_CORPUS: root node of the file. Includes language code as attribute
DOCUMENT: signifies a document extent. Includes document id as attribute, expressed as concatenation of genre, provenance, language, index id, and date
GENRE: two-letter indication of document genre
PROVENANCE: three-letter indication of document provenance
SOURCE_LANGUAGE: three-letter indication of document source language
INDEX_ID: six-digit unique identifier of document
DATE: eight-digit date of document, expressed as YYYYMMDD. Field is zeroed if year/month/day is unknown
DIRECTION: One of fromsource, fromtarget, found. Mono = 'fromsource'
SEGMENT: signifies a segment (sentence) and possibly its translation or other attributes. If downloaded twitter, includes 'downloaded="true"' attribute
SOURCE: signifies the source component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
FULL_ID_SOURCE: repeat of document id
ORIG_SEG_ID: segment id in initial ltf file (for non-found data) (needed for eval)
ORIG_FILENAME: base name of the file where the data was originally found (needed for eval)
ORIG_RAW_SOURCE: original form of segment. This canonical form is used as a reference by any annotations
ORIG_SOURCE: original form of segment, passed through utf8 cleaning and normalization scripts
MD5_HASH_SOURCE: MD5 digest of the tokenized source. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
LRLP_TOKENIZED_RAW_SOURCE: Tokenization as provided in the original LRLP.
LRLP_TOKENIZED_LC_RAW_SOURCE: Lowercased version of LRLP_TOKENIZED_RAW_SOURCE [TODO: is this extant?]
LRLP_TOKENIZED_SOURCE: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts.
LRLP_TOKENIZED_LC_SOURCE: Lowercased version of LRLP_TOKENIZED_SOURCE [TODO: is this extant?]
CDEC_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through tokenize-anything script in cdec machine translation software.
CDEC_TOKENIZED_LC_SOURCE: Lowercased version of CDEC_TOKENIZED_SOURCE
AGILE_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through agile tokenization script (only available in comparable)
AGILE_TOKENIZED_LC_SOURCE: Lowercased version of AGILE_TOKENIZED_SOURCE
LRLP_MORPH_TOKENIZED_SOURCE: Tokenization based on morphological segmentation, if available. (may not exist if none available)
LRLP_MORPH_SOURCE: Morphological analysis of words, if available. (may not exist if none available)
LRLP_POSTAG_SOURCE: Either coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring", or full POS tags.
IS_HEADLINE: set to 1 if segment is headline (otherwise not present)
AUTHOR: Name of author if available
POST_DATE_TIME: If available, posting info of the article
CLUSTER_ID: If available, comparable cluster to which this segment belongs
ANNOTATIONS: Denotes annotations are available for this segment
ANNOTATION: a semantic annotation labeled by task: FE (full named entity annotation)/NE (simple named entity annotation)/SSA (simple semantic annotation)/NPC (noun phrase chunking)
HEAD: Extent of the head of the annotation
ENTITY_TYPE: (for NE/FE) LOC/NONE/ORG/PER/TTL
ANNOTATION_KIND: (for FE) HEAD or MENTION
MENTION_TYPE: (for FE) NAM/NOM/None/PRO/TTL
PHRASE_ID: (for FE) phrase this annotation refers to (the phrase is usually another annotation and this is the head)
ENTITY_ID: id of the entity this mention refers to
ROLE: (for SSA) act/state
NPC_TYPE: type of np chunking
TARGET: signifies the target component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
FULL_ID_TARGET: target-side document id (usually similar to source-side id)
ORIG_RAW_TARGET: original form of target segment. This canonical form is used as a reference by any annotations
ORIG_TARGET: original form of target segment, passed through utf8 cleaning and normalization scripts
MD5_HASH_TARGET: MD5 digest of the tokenized target. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
LRLP_TOKENIZED_RAW_TARGET: Tokenization as provided in the original LRLP.
LRLP_TOKENIZED_LC_RAW_TARGET: Lowercased version of LRLP_TOKENIZED_RAW_TARGET [TODO: is this extant?]
LRLP_TOKENIZED_TARGET: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts
LRLP_TOKENIZED_LC_TARGET: Lowercased version of LRLP_TOKENIZED_TARGET [TODO: is this extant?]
AGILE_TOKENIZED_TARGET: Tokenization formed by running ORIG_TARGET through agile tokenization script
AGILE_TOKENIZED_LC_TARGET: Lowercased version of AGILE_TOKENIZED_TARGET
LRLP_POSTAG_TARGET: Coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or “numstring"