Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example SOLR configuration for German text corpus? #49

Open
johann-petrak opened this issue Feb 11, 2020 · 1 comment
Open

Example SOLR configuration for German text corpus? #49

johann-petrak opened this issue Feb 11, 2020 · 1 comment

Comments

@johann-petrak
Copy link

For somebody not familiar with SOLR it is very hard to start using this. Would it be possible to
add an example configuration for processing a corpus where each document is just a text file for the language German?

Is there a way to provide the corpus in a way where the necessary NLP preprocessing (POS tagging, lemmatization, stop word identification) has already been performed by other tools?

@jerrygaoLondon
Copy link
Collaborator

jerrygaoLondon commented Mar 27, 2020

This is a good idea to make JATE2.0 more adaptable in different languages. And, this is something i think having already been supported by Solr, for which you may look into PreAnalyzedField.

"The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with independent stored values of a field, and have this information stored and indexed without any additional text processing applied in Solr. This is useful if user wants to submit field content that was already processed by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token attributes)."

We will look into providing an example of mapping pre-analysed fields with JATE2 Solr schema in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants