-
-
Notifications
You must be signed in to change notification settings - Fork 26
Creating a Morphological Attribute Configuration File
This walks through the steps needed to create a morphological attribute configuration file for Arethusa. These files configure how morphological attributes for a given annotation format and language are represented by Arethusa in the state it maintains for the tokens as users annotate a document using that format and language.
(To have a simple setup of Arethusa to test your config file changes, see http://github.com/alpheios-project/arethusa-vagrant)
They also describe how to map the results of one or more morphological parsers to these morphological attributes so that the output of a parser can be automatically applied to tokens loaded in the annotation environment.
Example Arethusa morphological attribute configuration files can be found at (https://github.com/alpheios-project/arethusa-configs/tree/master/configs/arethusa.morph).
The Arethusa configuration file is a JSON object containing the following keys:
postagSchema
describes an ordered list which defines the order in which multiple morphological attributes on a token are combined when serialized into a single postag attribute for the resulting annotation. The values in this list must corresponding to the keys used to define the morphological attributes in the annotation, as discussed further below. The following example defines the order for the ALDT treebanking format:
"postagSchema" : [
"pos",
"pers",
"num",
"tense",
"mood",
"voice",
"gend",
"case",
"degree"
],
The styledThrough
attribute supplies the short name of the morphological attribute that is used to style (i.e. color) tokens as they are annotated in the Arethusa interface. For the ALDT treebank format, we use the pos
attribute for this (the part of speech). The value used here must match the name of an attribute key as discussed further below:
"styledThrough" : "pos"
The attributes
key defines an object which describes the representation of morphological attributes.
Each key in the attributes
object should itself be a key for the name of an attribute that you want to preserve in the resulting annotation (these are the same keys that are referenced in the postagSchema
and styledThrough
objects mentioned above). It doesn't matter what you use for these keys as long as it's a valid javascript string, but to keep the file readable, its generally a good idea to use something semantically meaningful. It also greatly simplify things if you use keys that match those of the parser you expect to use, but it's not a requirement as mappings are possible and will inevitably be needed if you want to support more than one parser (more on this in the mappings
section below).
Each attribute key defines an object which describes the representation of that attribute and its mapping to the parser output. The keys in this object are:
long
- provides a long descriptive name for the attribute. This is displayed in the user interface but not normally preserved in the annotation (although it could be if the Persister chose to do so).
short
- provides a short descriptive name for the attribute. This is displayed in the user interface but may not normally preserved in the annotation (although it could be if the Persister chose to do so).
values
- provides an object which describes the values for this attribute as supported by the treebank format you are configuring.
The attribute values
object contains a key for each possible value for that attribute. It doesn't matter what you use for these keys as long as it's a valid javascript string, but to keep the file readable, its generally a good idea to use something semantically meaningful.
Each values
object itself is made up of the following keys:
long
- provides a long descriptive name for the attribute value. This is displayed in the user interface but not normally preserved in the annotation (although it could be if the Persister chose to do so).
short
- provides a long descriptive name for the attribute value. This is displayed in the user interface but not normally preserved in the annotation (although it could be if the Persister chose to do so).
postag
- normally a single character that is used to represent the attribute value when it is combined with others in a single postag attribute. The position of this value in the postag is according to its position in the postagSchema
list described above.
style
- an optional key which defines an object which is used to style tokens which have this attribute applied to them in the user interface. This object itself currently can contain one key, color
which should be set to a valid css color name or rgb value.
rules
- an optional key which defines a list of rules that should be followed when making the attribute available for use in annotation. Normally these rules describe dependencies between this attribute and other attributes which can apply to the same token. The rules are applied in the order in which they are encountered in the list. This is probably best illustrated by example:
"rules" : [
{
"if" : {
"pos" : "verb",
"mood" : "*"
},
"unless" : {
"mood" : [ "part", "inf" ]
}
}
]
This rules object instructs the system to enable the attribute it's defined for if the value of the pos attribute is "verb" and the value of the mood attribute is anything (the * is a wildcard) UNLESS the value of the mood attribute is "part" or "inf"
The following is a complete example of an attribute object:
"pers" : {
"long" : "Person",
"short" : "pers",
"values" : {
"1st" : {
"long" : "first person",
"short" : "1st",
"postag" : "1"
},
"2nd" : {
"long" : "second person",
"short" : "2nd",
"postag" : "2"
},
"3rd" : {
"long" : "third person",
"short" : "3rd",
"postag" : "3"
}
},
"rules" : [
{
"if" : {
"pos" : "verb",
"mood" : "*"
},
"unless" : {
"mood" : [ "part", "inf" ]
}
}
]
}
This defines an attribute with the key pers
. The descriptive (long) name of this attribute is "Person" and the short name is "pers". It can have the following values "1st", "2nd", or "3rd". When preserved in a postag, the values preserved for these values are, respectively, "1", "2" or "3". This attribute is only available if the part of speech (pos) of the token is set to "verb" and it's not available if the mood attribute is set to either "part" or "inf" (i.e. if it's not available for participles or infinitives).
The `mappings object defines how to map the results of one or more morphological parsers to the attributes you have just defined for this particular treebanking format. The keys of this object should be the name you have configured for the retriever plugin which provides Arethusa with access to the parser. (see Adding-a-new-Morphology-Service-to-Arethusa). You can map both attributes and attribute values.
The morph retriever object has 2 keys: attributes
and values
.
attributes
- defines how attributes returned by the morphological retriever are mapped to keys in the attributes object you have just configured. The keys of this object should be the key for the attribute in the morphological parser output, and the value should be the key you have defined for this attribute in this Arethusa configuration file.
values
- defines how attribute values returned by the morphological retriever are mapped to attribute values in the attributes object you have just configured. The keys of this object should be the key for the attribute in this Arethusa configuration file (not the morphological parser attribute key) and the value should be an object which maps possible values for that attribute, as reported by the morphological parser, to the value you want to use (i.e. the valid value as per this arethusa configuration file).
The following example defines a mapping for the BSPMorphRetriever, a retriever that talks to a morphological service adhering to the Morphology Service API, which uses the alpheios lexicon schema:
"BspMorphRetriever" : {
"attributes" : {
"pofs" : "pos",
"comp" : "degree"
},
"values" : {
"pos" : {
"verb\nparticiple": "verb"
},
"tense" : {
"pluperfect" : "plusquamperfect"
}
}
}
The Alpheios Lexicon Schema uses "pofs" to name the part of speech attribute, and the Arethusa configuration uses "pos".
The Alpheios Lexicon Schema uses "comp" to name the degree attribute, and the Arethusa configuration uses "degree".
The Alpheios Lexicon Schema reports "verb\nparticiple" as value of the part of speech (pos) attribute for verbal participles, and in our configuration we want to just describe these as "verb".
The Alpheios Lexicon Schema reports "pluperfect" as the value of the tense attribute for where in our configuration we use "plusquamperfect".