README.format

ELISA_LRLP_CORPUS: root node of the file. Includes language code as attribute
  DOCUMENT: signifies a document extent. Includes document id as attribute, expressed as concatenation of genre, provenance, language, index id, and date
    GENRE: two-letter indication of document genre
    PROVENANCE: three-letter indication of document provenance
    SOURCE_LANGUAGE: three-letter indication of document source language
    INDEX_ID: six-digit unique identifier of document
    DATE: eight-digit date of document, expressed as YYYYMMDD. Field is zeroed if year/month/day is unknown
    DIRECTION: One of fromsource, fromtarget, found. Mono = 'fromsource'
    SEGMENT: signifies a segment (sentence) and possibly its translation or other attributes. If downloaded twitter, includes 'downloaded="true"' attribute
      SOURCE: signifies the source component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
        FULL_ID_SOURCE: repeat of document id
        ORIG_SEG_ID: segment id in initial ltf file (for non-found data) (needed for eval)
        ORIG_FILENAME: base name of the file where the data was originally found (needed for eval)
        ORIG_RAW_SOURCE: original form of segment. This canonical form is used as a reference by any annotations
        ORIG_SOURCE: original form of segment, passed through utf8 cleaning and normalization scripts
        MD5_HASH_SOURCE: MD5 digest of the tokenized source. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
        LRLP_TOKENIZED_RAW_SOURCE: Tokenization as provided in the original LRLP.
        LRLP_TOKENIZED_LC_RAW_SOURCE: Lowercased version of LRLP_TOKENIZED_RAW_SOURCE [TODO: is this extant?]
        LRLP_TOKENIZED_SOURCE: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts.
        LRLP_TOKENIZED_LC_SOURCE: Lowercased version of LRLP_TOKENIZED_SOURCE [TODO: is this extant?]
        CDEC_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through tokenize-anything script in cdec machine translation software.
        CDEC_TOKENIZED_LC_SOURCE: Lowercased version of CDEC_TOKENIZED_SOURCE
        AGILE_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through agile tokenization script (only available in comparable)
        AGILE_TOKENIZED_LC_SOURCE: Lowercased version of AGILE_TOKENIZED_SOURCE
        LRLP_MORPH_TOKENIZED_SOURCE: Tokenization based on morphological segmentation, if available. (may not exist if none available)
        LRLP_MORPH_SOURCE: Morphological analysis of words, if available. (may not exist if none available)
        LRLP_POSTAG_SOURCE: Either coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring", or full POS tags.
        IS_HEADLINE: set to 1 if segment is headline (otherwise not present)
        AUTHOR: Name of author if available
        POST_DATE_TIME: If available, posting info of the article
        CLUSTER_ID: If available, comparable cluster to which this segment belongs
        ANNOTATIONS: Denotes annotations are available for this segment
         ANNOTATION: a semantic annotation labeled by task: FE (full named entity annotation)/NE (simple named entity annotation)/SSA (simple semantic annotation)/NPC (noun phrase chunking)
           HEAD: Extent of the head of the annotation
           ENTITY_TYPE: (for NE/FE) LOC/NONE/ORG/PER/TTL
           ANNOTATION_KIND: (for FE) HEAD or MENTION
           MENTION_TYPE: (for FE) NAM/NOM/None/PRO/TTL
           PHRASE_ID: (for FE) phrase this annotation refers to (the phrase is usually another annotation and this is the head)
           ENTITY_ID: id of the entity this mention refers to
           ROLE: (for SSA) act/state
        	 NPC_TYPE: type of np chunking
      TARGET: signifies the target component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
        FULL_ID_TARGET: target-side document id (usually similar to source-side id)
        ORIG_RAW_TARGET: original form of target segment. This canonical form is used as a reference by any annotations
        ORIG_TARGET: original form of target segment, passed through utf8 cleaning and normalization scripts
        MD5_HASH_TARGET: MD5 digest of the tokenized target. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
        LRLP_TOKENIZED_RAW_TARGET: Tokenization as provided in the original LRLP.
        LRLP_TOKENIZED_LC_RAW_TARGET: Lowercased version of LRLP_TOKENIZED_RAW_TARGET [TODO: is this extant?]
        LRLP_TOKENIZED_TARGET: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts
        LRLP_TOKENIZED_LC_TARGET: Lowercased version of LRLP_TOKENIZED_TARGET [TODO: is this extant?]
        AGILE_TOKENIZED_TARGET: Tokenization formed by running ORIG_TARGET through agile tokenization script
        AGILE_TOKENIZED_LC_TARGET: Lowercased version of AGILE_TOKENIZED_TARGET
        LRLP_POSTAG_TARGET: Coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or “numstring"