Input data might be in a form of .jsoln
[example]
or csv
[example].
Input represents list/rows of samples.
Every sample contains text and mentioned opinion in it, i.e. source->target
relation.
Sample contain the following mandatory parameters:
id
(type:uint
) -- sample identifierlabel
(type:int
) -- for training only, denotes a class;- value in range
[0, c]
, wherec
corresponds to classes amount.
- value in range
text
(type:str
orlist
) -- string of terms, separated byjsonl
fomat;
Optional parameters:
s_ind
(type:int
) -- index of the source term intext
string/list;t_ind
(type:int
) -- index of the target term intext
string/list;opinion_id
(type:uint
) -- used for grouping opinions, denoting index in group;entities
(type:str
) -- comma separated term indices which corresponds to entities, in order of their appearance in textframes
(type:str
) -- comma separated term indices which corresponds to connotation frames, in order of their appearance in text; important for sentiment-classification related tasks;frame_connots_uint
(type:str
) -- comma separated scores of the in set of three scaleint
values{-1, 0, 1}
;syn_subjs
(type:str
) -- comma separated indices, synonymous to sources_ind
;syn_objs
(type:str
) -- comma separated indices, synonymous to targett_ind
;pos_tags
(type:str
) -- comma separated part-of-speech tags, with length exact the same as terms count oftext
;
We support model.txt
format, which provides:
- first row -- shape of the embedding matrix
- word and its vector in every row
Embeddings could be obtained from NLPL repository
- See how we do this in the following downloading script
Text-based embeddings will be then converted into vocab.txt
and embedding.npz
matrix.
[see code implementation]