-
Notifications
You must be signed in to change notification settings - Fork 0
CLARIN Standards
Maarten Janssen edited this page Aug 23, 2020
·
24 revisions
One objective of this set of tools is to provide the backbone for a set of conversion tools that make it possible to convert between the various corpus formats recommended by CLARIN as well as import several other non-recommended, using TEITOK. The scripts provided here can all be used as a drag-and-drop web service at UFAL Conversion Tool, where they are combined with several other third-party tool to convert a wide range of files to and from TEI(TOK). They can also be used directly for NLP tagging in the TEITOK Tagging Tool. Metadata conversion are a separate problem.
Format | Import | Export | Status |
---|---|---|---|
CES | TBD | TBD | Need to find a generic description and example |
CHAT | chat2teitok.pl | TBD | Only legacy transcriptions get converted fully |
DiAML | TBD | TBD | Need to find a generic description and example |
DITA | TBD | TBD | Need to find a generic description and example |
HTML | - | - | Conversion to TEI done by pandoc |
HyTime | TBD | TBD | Need to find a generic description and example |
JATS | TBD | TBD | Need to find a generic description and example (metadata only format?) |
LAF | TBD | TBD | Need to find a generic description and example |
MAF | TBD | TBD | Need to find a generic description and example |
MLIF | TBD | TBD | Need to find a generic description and example |
NLM JATS | TBD | TBD | Need to find a generic description and example (metadata only format?) |
PDF/A | - | - | Conversion to HTML done by pdf2html |
RTF | - | - | Conversion to TEI done by pandoc |
SemRoleML | TBD | TBD | Need to find a generic description and example |
SynAF | TBD | TBD | Need to find a generic description and example |
TEI | - | teitok2p5.pl | Since TEITOK uses TEI, exporting is a light conversion, and importing is not generically possible |
TimeML | TBD | TBD | Need to find a generic description and example |
TMX | tmx2teitok.pl | TDB | Export is difficult since aligned data will be distributed typically in TEITOK |
WordSeg | TBD | TBD | Need to find a generic description and example |
XCES | TBD | TBD | Need to find a generic description and example |
XHTML | - | - | Conversion to TEI done by pandoc |
Apart from the recommended formats, we attempt to provide import support for various other popular corpus formats
Format | Import | Export | Status |
---|---|---|---|
Praat | praat2tei.pl | - | Working but not thoroughly tested |
CoNLL-U | conllu2teitok.pl | teitok2conllu.pl | Working but not thoroughly tested |
EXMARaLDA | exb2tei.pl | - | Working but not thoroughly tested |
FoLiA | folia2teitok.pl | - | Working but not thoroughly tested |
hOCR | hocr2teitok.pl | TDB | Working but not thoroughly tested |
PML | pml2tei.pl | TDB | Working but not thoroughly tested |
Brat | brat2teitok.pl | - | Working but not thoroughly tested |
Toolbox | tbt2teitok.pl | - | Working but not thoroughly tested |
Transcriber | trs2teitok.pl | - | Working but not thoroughly tested |