Example SOLR configuration for German text corpus? #49

johann-petrak · 2020-02-11T15:16:33Z

For somebody not familiar with SOLR it is very hard to start using this. Would it be possible to
add an example configuration for processing a corpus where each document is just a text file for the language German?

Is there a way to provide the corpus in a way where the necessary NLP preprocessing (POS tagging, lemmatization, stop word identification) has already been performed by other tools?

jerrygaoLondon · 2020-03-27T18:39:32Z

This is a good idea to make JATE2.0 more adaptable in different languages. And, this is something i think having already been supported by Solr, for which you may look into PreAnalyzedField.

"The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token attributes)."

We will look into providing an example of mapping pre-analysed fields with JATE2 Solr schema in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example SOLR configuration for German text corpus? #49

Example SOLR configuration for German text corpus? #49

johann-petrak commented Feb 11, 2020

jerrygaoLondon commented Mar 27, 2020 •

edited

Loading

Example SOLR configuration for German text corpus? #49

Example SOLR configuration for German text corpus? #49

Comments

johann-petrak commented Feb 11, 2020

jerrygaoLondon commented Mar 27, 2020 • edited Loading

jerrygaoLondon commented Mar 27, 2020 •

edited

Loading