-
Notifications
You must be signed in to change notification settings - Fork 1
panx27/elisatools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ELISA_LRLP_CORPUS: root node of the file. Includes language code as attribute DOCUMENT: signifies a document extent. Includes document id as attribute, expressed as concatenation of genre, provenance, language, index id, and date GENRE: two-letter indication of document genre PROVENANCE: three-letter indication of document provenance SOURCE_LANGUAGE: three-letter indication of document source language INDEX_ID: six-digit unique identifier of document DATE: eight-digit date of document, expressed as YYYYMMDD. Field is zeroed if year/month/day is unknown DIRECTION: One of fromsource, fromtarget, found. Mono = 'fromsource' SEGMENT: signifies a segment (sentence) and possibly its translation or other attributes. If downloaded twitter, includes 'downloaded="true"' attribute SOURCE: signifies the source component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE' FULL_ID_SOURCE: repeat of document id ORIG_SEG_ID: segment id in initial ltf file (for non-found data) (needed for eval) ORIG_FILENAME: base name of the file where the data was originally found (needed for eval) ORIG_RAW_SOURCE: original form of segment. This canonical form is used as a reference by any annotations ORIG_SOURCE: original form of segment, passed through utf8 cleaning and normalization scripts MD5_HASH_SOURCE: MD5 digest of the tokenized source. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates. LRLP_TOKENIZED_RAW_SOURCE: Tokenization as provided in the original LRLP. LRLP_TOKENIZED_LC_RAW_SOURCE: Lowercased version of LRLP_TOKENIZED_RAW_SOURCE [TODO: is this extant?] LRLP_TOKENIZED_SOURCE: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts. LRLP_TOKENIZED_LC_SOURCE: Lowercased version of LRLP_TOKENIZED_SOURCE [TODO: is this extant?] CDEC_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through tokenize-anything script in cdec machine translation software. CDEC_TOKENIZED_LC_SOURCE: Lowercased version of CDEC_TOKENIZED_SOURCE AGILE_TOKENIZED_SOURCE: Tokenization formed by running ORIG_SOURCE through agile tokenization script (only available in comparable) AGILE_TOKENIZED_LC_SOURCE: Lowercased version of AGILE_TOKENIZED_SOURCE LRLP_MORPH_TOKENIZED_SOURCE: Tokenization based on morphological segmentation, if available. (may not exist if none available) LRLP_MORPH_SOURCE: Morphological analysis of words, if available. (may not exist if none available) LRLP_POSTAG_SOURCE: Either coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or "numstring", or full POS tags. IS_HEADLINE: set to 1 if segment is headline (otherwise not present) AUTHOR: Name of author if available POST_DATE_TIME: If available, posting info of the article CLUSTER_ID: If available, comparable cluster to which this segment belongs ANNOTATIONS: Denotes annotations are available for this segment ANNOTATION: a semantic annotation labeled by task: FE (full named entity annotation)/NE (simple named entity annotation)/SSA (simple semantic annotation)/NPC (noun phrase chunking) HEAD: Extent of the head of the annotation ENTITY_TYPE: (for NE/FE) LOC/NONE/ORG/PER/TTL ANNOTATION_KIND: (for FE) HEAD or MENTION MENTION_TYPE: (for FE) NAM/NOM/None/PRO/TTL PHRASE_ID: (for FE) phrase this annotation refers to (the phrase is usually another annotation and this is the head) ENTITY_ID: id of the entity this mention refers to ROLE: (for SSA) act/state NPC_TYPE: type of np chunking TARGET: signifies the target component of the segment. Includes document-local segment id, start and end (inclusive, 0-based) char offsets of 'ORIG_RAW_SOURCE' FULL_ID_TARGET: target-side document id (usually similar to source-side id) ORIG_RAW_TARGET: original form of target segment. This canonical form is used as a reference by any annotations ORIG_TARGET: original form of target segment, passed through utf8 cleaning and normalization scripts MD5_HASH_TARGET: MD5 digest of the tokenized target. The data may contain duplicate segments, which we did not remove, in order to keep the context intact. This field can be used to efficiently detect duplicates. LRLP_TOKENIZED_RAW_TARGET: Tokenization as provided in the original LRLP. LRLP_TOKENIZED_LC_RAW_TARGET: Lowercased version of LRLP_TOKENIZED_RAW_TARGET [TODO: is this extant?] LRLP_TOKENIZED_TARGET: Tokenization as provided in the original LRLP, passed through utf8 cleaning and normalization scripts LRLP_TOKENIZED_LC_TARGET: Lowercased version of LRLP_TOKENIZED_TARGET [TODO: is this extant?] AGILE_TOKENIZED_TARGET: Tokenization formed by running ORIG_TARGET through agile tokenization script AGILE_TOKENIZED_LC_TARGET: Lowercased version of AGILE_TOKENIZED_TARGET LRLP_POSTAG_TARGET: Coarse POS tags such as "punct", "word", "twitter", "unknown", "url", "number", "email", or “numstring"
About
tools for processing data in elisa
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published