-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.parallel
45 lines (44 loc) · 4.24 KB
/
README.parallel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
The bilingual corpora are structured with the following fields:
ELISA_BILINGUAL_LRLP_CORPUS: root node of the file. Includes source language code as attribute
DOCUMENT: signifies a document extent. Includes source-side document id as attribute, expressed as concatenation of genre, provenance, language, index id, and date
GENRE: two-letter indication of document genre
PROVENANCE: three-letter indication of document provenance
SOURCE_LANGUAGE: three-letter indication of document source language
TARGET_LANGUAGE: three-letter indication of document target language (always ENG)
INDEX_ID: six-digit unique identifier of document
DATE: eight-digit date of document, expressed as YYYYMMDD. Field is zeroed if year/month/day is unknown
DIRECTION: One of fromsource, fromtarget, found
PARALLEL: signifies a segment (sentence) and its translation.
SEGMENT_SOURCE: signifies the source component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
FULL_ID_SOURCE: repeat of document id
ORIG_RAW_SOURCE: original form of source segment. This canonical form is used as a reference by any annotations
MD5_HASH_SOURCE: MD5 digest of the tokenized source. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
LRLP_TOKENIZED_SOURCE: Tokenization as provided in the original LRLP.
CDEC_TOKENIZED_SOURCE: Tokenization formed by running ORIG_RAW_SOURCE through tokenize-anything script in cdec machine translation software.
CDEC_TOKENIZED_LC_SOURCE: Lowercased version of CDEC_TOKENIZED_SOURCE
LRLP_MORPH_TOKENIZED_SOURCE: Tokenization based on morphological segmentation, if available.
LRLP_MORPH_SOURCE: POS tags of morphemes, if available, otherwise 'none'
LRLP_POSTAG_SOURCE: Either coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring", or full POS tags.
IS_HEADLINE: set to 1 if segment is headline (otherwise not present)
AUTHOR: Name of Author if available (otherwise not present)
POST_DATE_TIME: If available, posting info of the article (otherwise not present)
ANNOTATIONS: Denotes annotations are available for this segment
ANNOTATION: a semantic annotation labeled by task: FE (full named entity annotation)/NE (simple named entity annotation)/SSA (simple semantic annotation)/NPC (noun phrase chunking)
HEAD: Extent of the head of the annotation
ENTITY_TYPE: (for NE/FE) LOC/NONE/ORG/PER/TTL
ANNOTATION_KIND: (for FE) HEAD or MENTION
MENTION_TYPE: (for FE) NAM/NOM/None/PRO/TTL
PHRASE_ID: (for FE) phrase this annotation refers to (the phrase is usually another annotation and this is the head)
ENTITY_ID: id of the entity this mention refers to
ROLE: (for SSA) act/state
NPC_TYPE: type of np chunking
SEGMENT_TARGET: signifies the target component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE'
FULL_ID_TARGET: target-side document id (usually similar to source-side id)
ORIG_RAW_TARGET: original form of target segment. This canonical form is used as a reference by any annotations
MD5_HASH_TARGET: MD5 digest of the tokenized target. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates.
LRLP_TOKENIZED_TARGET: Tokenization as provided in the original LRLP.
AGILE_TOKENIZED_TARGET: Tokenization formed by running ORIG_RAW_TARGET through agile tokenization script
AGILE_TOKENIZED_LC_TARGET: Lowercased version of AGILE_TOKENIZED_TARGET
LRLP_MORPH_TOKENIZED_TARGET: Tokenization based on morphological segmentation, if available. (NOTE: should be removed; this never exists!)
LRLP_MORPH_TARGET: POS tags of morphemes, if available, otherwise 'none' (NOTE: should be removed: this never exists!)
LRLP_POSTAG_TARGET: Coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring"