README.parallel

The bilingual corpora are structured with the following fields:

ELISA_BILINGUAL_LRLP_CORPUS: root node of the file. Includes source language code as attribute
  DOCUMENT: signifies a document extent. Includes source-side document id as attribute, expressed as concatenation of genre, provenance, language, index id, and date
    GENRE: two-letter indication of document genre
    PROVENANCE: three-letter indication of document provenance
    SOURCE_LANGUAGE: three-letter indication of document source language
    TARGET_LANGUAGE: three-letter indication of document target language (always ENG)
    INDEX_ID: six-digit unique identifier of document
    DATE: eight-digit date of document, expressed as YYYYMMDD. Field is zeroed if year/month/day is unknown
    DIRECTION: One of fromsource, fromtarget, found
    PARALLEL: signifies a segment (sentence) and its translation.
      SEGMENT_SOURCE: signifies the source component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
        FULL_ID_SOURCE: repeat of document id
        ORIG_RAW_SOURCE: original form of source segment. This canonical form is used as a reference by any annotations
        MD5_HASH_SOURCE: MD5 digest of the tokenized source. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
        LRLP_TOKENIZED_SOURCE: Tokenization as provided in the original LRLP.
        CDEC_TOKENIZED_SOURCE: Tokenization formed by running ORIG_RAW_SOURCE through tokenize-anything script in cdec machine translation software.
        CDEC_TOKENIZED_LC_SOURCE: Lowercased version of CDEC_TOKENIZED_SOURCE
        LRLP_MORPH_TOKENIZED_SOURCE: Tokenization based on morphological segmentation, if available.
        LRLP_MORPH_SOURCE: POS tags of morphemes, if available, otherwise 'none'
        LRLP_POSTAG_SOURCE: Either coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring", or full POS tags.
        IS_HEADLINE: set to 1 if segment is headline (otherwise not present)
        AUTHOR: Name of Author if available (otherwise not present)
        POST_DATE_TIME: If available, posting info of the article (otherwise not present)
        ANNOTATIONS: Denotes annotations are available for this segment
         ANNOTATION: a semantic annotation labeled by task: FE (full named entity annotation)/NE (simple named entity annotation)/SSA (simple semantic annotation)/NPC (noun phrase chunking)
           HEAD: Extent of the head of the annotation
           ENTITY_TYPE: (for NE/FE) LOC/NONE/ORG/PER/TTL
           ANNOTATION_KIND: (for FE) HEAD or MENTION
           MENTION_TYPE: (for FE) NAM/NOM/None/PRO/TTL
           PHRASE_ID: (for FE) phrase this annotation refers to (the phrase is usually another annotation and this is the head)
           ENTITY_ID: id of the entity this mention refers to
           ROLE: (for SSA) act/state
           NPC_TYPE: type of np chunking
      SEGMENT_TARGET: signifies the target component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
        FULL_ID_TARGET: target-side document id (usually similar to source-side id)
        ORIG_RAW_TARGET: original form of target segment. This canonical form is used as a reference by any annotations
        MD5_HASH_TARGET: MD5 digest of the tokenized target. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
        LRLP_TOKENIZED_TARGET: Tokenization as provided in the original LRLP.
        AGILE_TOKENIZED_TARGET: Tokenization formed by running ORIG_RAW_TARGET through agile tokenization script
        AGILE_TOKENIZED_LC_TARGET: Lowercased version of AGILE_TOKENIZED_TARGET
        LRLP_MORPH_TOKENIZED_TARGET: Tokenization based on morphological segmentation, if available. (NOTE: should be removed; this never exists!)
        LRLP_MORPH_TARGET: POS tags of morphemes, if available, otherwise 'none' (NOTE: should be removed: this never exists!)
        LRLP_POSTAG_TARGET: Coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring"