You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For somebody not familiar with SOLR it is very hard to start using this. Would it be possible to
add an example configuration for processing a corpus where each document is just a text file for the language German?
Is there a way to provide the corpus in a way where the necessary NLP preprocessing (POS tagging, lemmatization, stop word identification) has already been performed by other tools?
The text was updated successfully, but these errors were encountered:
This is a good idea to make JATE2.0 more adaptable in different languages. And, this is something i think having already been supported by Solr, for which you may look into PreAnalyzedField.
"The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token attributes)."
We will look into providing an example of mapping pre-analysed fields with JATE2 Solr schema in the future.
For somebody not familiar with SOLR it is very hard to start using this. Would it be possible to
add an example configuration for processing a corpus where each document is just a text file for the language German?
Is there a way to provide the corpus in a way where the necessary NLP preprocessing (POS tagging, lemmatization, stop word identification) has already been performed by other tools?
The text was updated successfully, but these errors were encountered: