Skip to content

CLARIN Standards

Maarten Janssen edited this page Aug 23, 2020 · 24 revisions

One objective of this set of tools is to provide the backbone for a set of conversion tools that make it possible to convert between the various corpus formats recommended by CLARIN as well as import several other non-recommended, using TEITOK. The scripts provided here can all be used as a drag-and-drop web service at UFAL Conversion Tool, where they are combined with several other third-party tool to convert a wide range of files to and from TEI(TOK). They can also be used directly for NLP tagging in the TEITOK Tagging Tool. Metadata conversion are a separate problem.

Format Import Export Status
CES TBD TBD Need to find a generic description and example
CHAT chat2teitok.pl TBD Only legacy transcriptions get converted fully
DiAML TBD TBD Need to find a generic description and example
DITA TBD TBD Need to find a generic description and example
HTML - - Conversion to TEI done by pandoc
HyTime TBD TBD Need to find a generic description and example
JATS TBD TBD Need to find a generic description and example (metadata only format?)
LAF TBD TBD Need to find a generic description and example
MAF TBD TBD Need to find a generic description and example
MLIF TBD TBD Need to find a generic description and example
NLM JATS TBD TBD Need to find a generic description and example (metadata only format?)
PDF/A - - Conversion to HTML done by pdf2html
RTF - - Conversion to TEI done by pandoc
SemRoleML TBD TBD Need to find a generic description and example
SynAF TBD TBD Need to find a generic description and example
TEI - teitok2p5.pl Since TEITOK uses TEI, exporting is a light conversion, and importing is not generically possible
TimeML TBD TBD Need to find a generic description and example
TMX tmx2teitok.pl TDB Export is difficult since aligned data will be distributed typically in TEITOK
WordSeg TBD TBD Need to find a generic description and example
XCES TBD TBD Need to find a generic description and example
XHTML - - Conversion to TEI done by pandoc

Non-Recommended Standards with import support

Apart from the recommended formats, we attempt to provide import support for various other popular corpus formats

Format Import Export Status
Praat praat2tei.pl - Working but not thoroughly tested
CoNLL-U conllu2teitok.pl teitok2conllu.pl Working but not thoroughly tested
EXMARaLDA exb2tei.pl - Working but not thoroughly tested
FoLiA folia2teitok.pl - Working but not thoroughly tested
hOCR hocr2teitok.pl TDB Working but not thoroughly tested
PML pml2tei.pl TDB Working but not thoroughly tested
Brat brat2teitok.pl - Working but not thoroughly tested
Toolbox tbt2teitok.pl - Working but not thoroughly tested
Transcriber trs2teitok.pl - Working but not thoroughly tested
Clone this wiki locally