Feature	Feature Area	Feature name	Name in extracted dataframe	Function	Description	Notes
Raw sequence length/total number of characters	surface	raw_sequence_length	raw_sequence_length	get_raw_sequence_length	Number of characters in the text (including whitespaces)
Number of tokens	surface	n_tokens	n_tokens	get_num_tokens	Number of tokens in the text
Number of sentences	surface	n_sentences	n_sentences	get_num_sentences	Number of sentences in the text
Number of token per sentence	surface	tokens_per_sentence	tokens_per_sentence	get_num_tokens_per_sentence	Average number of tokens per sentence: n_tokens / n_sentences
Number of characters	surface	n_characters	n_characters	get_num_characters	Number of characters in the text (excluding whitespaces)
Characters per sentence	surface	characters_per_sentence	characters_per_sentence	get_chars_per_sentence	Average number of characters per sentence: n_characters / n_sentences
Raw sequence length per sentence	surface	raw_length_per_sentence	raw_length_per_sentence	get_raw_length_per_sentence	Average number of characters per sentence: raw_sequence_length / n_sentences
Average word length	surface	avg_word_length	avg_word_length	get_avg_word_length	Average word length (in characters): n_characters / n_tokens
Number of types	surface	n_types	n_types	get_num_types	Number of types (unique tokens) in the text
Number of long words	surface	n_long_words	n_long_words	get_num_long_words	Number of long words (i.t.o. characters)	Threshold of what is considered a long word defaults to >6 characters; can be adapted in the config
Number of lemmas	surface	n_lemmas	n_lemmas	get_num_lemmas	Number of lemmas in the text
Token frequencies	surface	token_freqs	token_freqs	get_token_freqs	Token frequencies of the types in the text	As this produces a list in a column, writing to file has to be handled
Number of lexical tokens	pos	n_lexical_tokens	n_lexical_tokens	get_num_lexical_tokens	Number of lexical tokens (tokens w/ upos tag NOUN, ADVERB, ADJ, ADV)
POS variability	pos	pos_variability	pos_variability	get_pos_variability	POS variability of the text: (unique upos text in the text) / n_tokens
Number of tokens with upos tag {pos}	pos	n_per_pos	n_{pos}	get_num_per_pos	Number of tokens with a given upos tag in the text. Takes a list of upos tag to extract this feature for	pos_list defaults to all upos tags; if you only need a subset, this can be adapted in the config
Lemma token ratio	lexical_richness	lemma_token_ratio	lemma_token_ratio	get_lemma_token_ratio	Lemma token ratio of the text: n_lemmas / n_tokens
Type token ratio	lexical_richness	ttr	ttr	get_ttr	Type token ratio of the text: n_types / n_tokens
Root type token ratio	lexical_richness	rttr	rttr	get_rttr	Root type token ratio of the text: sqrt(n_types / n_tokens)
Corrected type token ratio	lexical_richness	cttr	cttr	get_cttr	Corrected type token ratio of the text: n_types / sqrt(2 * n_tokens)
Herdan's C	lexical_richness	herdan_c	herdan_c	get_herdan_c	Herdan's C of a text: log(n_types) / log(n_tokens)
Summer's type token ratio/ index	lexical_richness	summer_index	summer_index	get_summer_index	Summer's text token ratio of the text: log(log(n_types)) / log(log(n_tokens))
Dugast's Uber index	lexical_richness	dugast_u	dugast_u	get_dugast_u	Dugast's Uber index of the text: log(n_types)^2 / (log(n_tokens) - log( n_types))
Maas' text token ratio/index	lexical_richness	maas_index	maas_index	get_maas_index	Maas' text token ratio of the text: (n_tokens - n_types) / log(n_types)^2
Number of local hapax legomena	lexical_richness	n_hapax_legomena	n_hapax_legomena	get_n_hapax_legomena	Number of hapax legomena (tokens that occur only once) in the text
Number of global token hapax legomena	lexical_richness	n_global_token_hapax_legomena	n_global_token_hapax_legomena	get_n_global_token_hapax_legomena	Number of hapax legomena (tokens that occur only once) in the entire corpus in the text instance
Number of global lemma hapax legomena	lexical_richness	n_global_lemma_hapax_legomena	n_global_lemma_hapax_legomena	get_n_global_lemma_hapax_legomena	Number of hapax legomena (lemmas that occur only once) in the entire corpus in the text instance
Number of hapax dislegomena	lexical_richness	n_hapax_dislegomena	n_hapax_dislegomena	get_n_hapax_dislegomena	Number of hapax dislegomena (tokens that occur once or twice) in the text
Number of global token hapax dislegomena	lexical_richness	n_global_token_hapax_dislegomena	n_global_token_hapax_dislegomena	get_n_global_token_hapax_dislegomena	Number of hapax dislegomena (tokens that occur once or twice) in the entire corpus in the text instance
Number of global lemma hapax dislegomena	lexical_richness	n_global_lemma_hapax_dislegomena	n_global_lemma_hapax_dislegomena	get_n_global_lemma_hapax_dislegomena	Number of hapax dislegomena (tokens that occur once or twice) in the entire corpus in the text instance
Sichel's S	lexical_richness	sichel_s	sichel_s	get_sichel_s	Sichel's S of the text: n_hapax_dislegomena / n_types
Global Sichel's S	lexical_richness	global_sichel_s	global_sichel_s	get_global_sichel_s	Global Sichel's S of the text: n_global_token_hapax_dislegomena / n_types
Lexical density	lexical_richness	lexical_density	lexical_density	get_lexical_density	Lexical density of the text: n_lexical_tokens / n_tokens
Giroud's index	lexical_richness	giroud_index	giroud_index	get_giroud_index	Giroud's index of a text: n_types / sqrt(n_tokens)
Measure of Textual Lexical Density (MTLD)	lexical_richness	mtld	mtld	get_mtld	For definition, check https://link.springer.com/article/10.3758/BRM.42.2.381
Hypergeometric Distribution Diversity (HD-D	lexical_richness	hdd	hdd	get_hdd	For definition, check https://link.springer.com/article/10.3758/BRM.42.2.381
Moving-average type token ratio (MATTR)	lexical_richness	mattr	mattr	get_mattr	Calculates the TTR for a sliding window of n tokens, then takes the average
Mean segmental type token ratio (MSTTR)	lexical_richness	msttr	msttr	get_msttr	Divides the text into n segments, calculates the TTR for all of them, then takes the average
Yule's K	lexical_richness	yule_k	yule_k	get_yule_k	Yule's characteristic constant of vocabulary richness	For definition, check https://quantling.org/~hbaayen/publications/TweedieBaayen1998.pdf
Simpson's D	lexical_richness	simpsons_d	simpsons_d	get_simpsons_d		For definition, check https://quantling.org/~hbaayen/publications/TweedieBaayen1998.pdf
Herdan's Vm	lexical_richness	herdan_v	herdan_v	get_herdan_v		For definition, check https://quantling.org/~hbaayen/publications/TweedieBaayen1998.pdf
Number of syllables	readability	n_syllables	n_syllables	get_num_syllables	Number of syllables in the text	Only implemented for spacy backbone
Number of monosyllables	readability	n_monosyllables	n_monosyllables	get_num_monosyllables	Number of monosyllables (words with only one syllable) in the text
Number of polysyllables	readability	n_polysyllables	n_polysyllables	get_num_polysyllables	Number of polysyllables (words with three or more syllables) in the text
Flesch reading ease	readability	flesch_reading_ease	flesch_reading_ease	get_flesch_reading_ease	Flesch reading ease score of the text	For reference: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_Reading_Ease
Flesch-Kincaid Grade Level	readability	flesch_kincaid_grade	flesch_kincaid_grade	get_flesch_kincaid_grade	Flesch-Kincaid grade level of the text	For reference: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_Grade_Level
Automated Readability Index (ARI)	readability	ari	ari	get_ari		For reference: https://en.wikipedia.org/wiki/Automated_readability_index
Simple Measure of Gobbledygook (SMOG)	readability	smog	smog	get_smog		For reference: https://en.wikipedia.org/wiki/SMOG
Coleman-Liau Index (CLI)	readability	cli	cli	get_cli		For reference: https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
Gunning-fog Index	readability	gunning_fog	gunning_fog	get_gunning_fog		For reference: https://en.wikipedia.org/wiki/Gunning_fog_index
LIX	readability	lix	lix	get_lix		For reference: https://en.wikipedia.org/wiki/Lix_(readability_test)
RIX	readability	rix	rix	get_rix		For reference: https://www.jstor.org/stable/40031755
Compressibility	information	compressibility	compressibility	get_compressibility	Compressibility is the ratio of the length of the compressed text to the length of the original text. This is a proxy for the Kolmogorov complexity of the text
Entropy	information	entropy	entropy	get_entropy	Shannon entropy of the text
Number of named entities	entities	n_entitites	n_entitites	get_num_entities	Number of named entities in the text
Number of named entities of type {ent}	entities	n_per_entity_type	n_{ent}	get_num_per_entity_type	Number of named entities in the text with type {ent}. Takes a list of entity types to extract this feature for	ent_types defaults to all possible entity types; if you only need a subset, this can be adapted in the config
Number of hedge words	semantic	n_hedges	n_hedges	get_num_hedges	Number of hedge words in the text (words expressing uncertainty of the speaker).	requires a hedge lexicon; currently only supported in English
Hedges token ratio	semantic	hedges_ratio	hedges_ratio	get_hedges_ratio	Ratio of hedges in the text: n_hedges / n_tokens
Average number of synsets	semantic	avg_n_synsets	avg_n_synsets	get_avg_num_synsets	Average number of wordnet synsets of lexical tokens; proxy for ambiguity/polysemy
Number of words with a low number of synsets per pos	semantic	n_low_synsets_per_pos	n_low_synsets_{pos}	get_low_synsets_{pos}	Number of lexical tokens with a low number of synsets per pos tag	Threshold defaults to 2
Number of words with a high number of synsets per pos	semantic	n_high_synsets_per_pos	n_high_synsets_{pos}	get_high_synsets_{pos}	Number of lexical tokens with a high number of synsets per pos tag	Threshold defaults to 5
Number of words with a low number of synsets	semantic	n_low_synsets	n_low_synsets	get_num_low_synsets	Number of lexical tokens with a low number of synsets	Threshold defaults to 2
Number of words with a high number of synsets	semantic	n_high_synsets	n_high_synsets	get_num_high_synsets	Number of lexical tokens with a high number of synsets	Threshold defaults to 5
Average valence	emotion	avg_valence	avg_valence	get_avg_valence	Average valence of the tokens in the text	For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of low valence tokens	emotion	n_low_valence	n_low_valence	get_n_low_valence	Number of low valence tokens in the text	Threshold defaults to 0.33; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of high valence tokens	emotion	n_high_valence	n_high_valence	get_n_high_valence	Number of high valence tokens in the text	Threshold defaults to 0.66; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Average arousal	emotion	avg_arousal	avg_arousal	get_avg_arousal	Average arousal of the tokens in the text	For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of low arousal tokens	emotion	n_low_arousal	n_low_arousal	get_n_low_arousal	Number of low arousal tokens in the text	Threshold defaults to 0.33; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of high arousal tokens	emotion	n_high_arousal	n_high_arousal	get_n_high_arousal	Number of high arousal tokens in the text	Threshold defaults to 0.66; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Average dominance	emotion	avg_dominance	avg_dominance	get_avg_dominance	Average dominance of the tokens in the text	For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of low dominance tokens	emotion	n_low_dominance	n_low_dominance	get_n_low_dominance	Number of low dominance tokens in the text	Threshold defaults to 0.33; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Number of high dominance tokens	emotion	n_high_dominance	n_high_dominance	get_n_high_dominance	Number of high dominance tokens in the text	Threshold defaults to 0.66; For reference: https://saifmohammad.com/WebPages/nrc-vad.html
Average emotion intensity for {emotion}	emotion	avg_intensity	avg_intensity_{emotion}	get_avg_intensity	Average intensity of an emotion; takes a list of emotions	For reference: https://saifmohammad.com/WebPages/AffectIntensity.htm
Number of high intensity tokens for {emotion}	emotion	n_high_intensity	n_high_intensity_{emotion}	get_n_high_intensity	Number of high intensity tokens for a given emotion	Threshold defaults to 0.33; For reference: https://saifmohammad.com/WebPages/AffectIntensity.htm
Number of low intensity tokens for {emotion}	emotion	n_low_intensity	n_low_intensity_{emotion}	get_n_low_intensity	Number of high§§ intensity tokens for a given emotion	Threshold defaults to 0.66; For reference: https://saifmohammad.com/WebPages/AffectIntensity.htm
Sentiment score	emotion	sentiment_score	sentiment_score	get_sentiment_score	Difference between the number of positive and negative sentiment words in the text: (n_positive_sentiment - n_negative_sentiment) / n_tokens	Values in range (-1,1) where 0 is neutral, -1 is completely negative sentiment, and 1 is completely positive sentiment; For reference: https://saifmohammad.com/WebDocs/NRC-Emotion-Lexicon.htm
Number of negative sentiment tokens	emotion	n_negative_sentiment	n_negative_sentiment	get_n_negative_sentiment	Number of negative sentiment tokens	For reference: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Number of positive sentiment tokens	emotion	n_positive_sentiment	n_positive_sentiment	get_n_positive_sentiment	Number of positive sentiment tokens	For reference: https://saifmohammad.com/WebDocs/NRC-Emotion-Lexicon.htm
Average concreteness	psycholinguistic	avg_concreteness	avg_concreteness	get_avg_concreteness	Average human concreteness ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-013-0403-5
Average standard deviation of concreteness	psycholinguistic	avg_sd_concreteness	avg_sd_concreteness	get_avg_sd_concreteness	Average standard deviation in the human concreteness ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-013-0403-5
Number of low concreteness tokens	psycholinguistic	n_low_concreteness	n_low_concreteness	get_n_low_concreteness	Number of tokens with a low concreteness rating	For reference: https://link.springer.com/article/10.3758/s13428-013-0403-5
Number of high concreteness tokens	psycholinguistic	n_high_concreteness	n_high_concreteness	get_n_high_concreteness	Number of tokens with a high concreteness rating	For reference: https://link.springer.com/article/10.3758/s13428-013-0403-5
Number of tokens with controversial concreteness	psycholinguistic	n_controversial_concreteness	n_controversial_concreteness	get_n_controversial_concreteness	Number of tokens with a high standard deviation in the human concreteness rating	For reference: https://link.springer.com/article/10.3758/s13428-013-0403-5
Average age of acquisition	psycholinguistic	avg_aoa	avg_aoa	get_avg_aoa	Average age of acquisition rating	For reference: https://link.springer.com/article/10.3758/s13428-016-0811-4
Average standard deviation of age of acquisition	psycholinguistic	avg_sd_aoa	avg_sd_aoa	get_avg_sd_aoa	Average standard deviation in the age of acquisition rating	For reference: https://link.springer.com/article/10.3758/s13428-016-0811-4
Number of low age of acquisition tokens	psycholinguistic	n_low_aoa	n_low_aoa	get_n_low_aoa	Number of low age of acquisition tokens	For reference: https://link.springer.com/article/10.3758/s13428-016-0811-4
Number of high age of acquisition tokens	psycholinguistic	n_high_aoa	n_high_aoa	get_n_high_aoa	Number of high age of acquisition tokens	For reference: https://link.springer.com/article/10.3758/s13428-016-0811-4
Number of tokens with controversial age of acquisition	psycholinguistic	n_controversial_aoa	n_controversial_aoa	get_n_controversial_aoa	Number of tokens with a high standard deviation in the age of acquisition rating	For reference: https://link.springer.com/article/10.3758/s13428-016-0811-4
Average prevalence	psycholinguistic	avg_prevalence	avg_prevalence	get_avg_prevalence	Average human prevalence ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-018-1077-9
Number of low prevalence tokens	psycholinguistic	n_low_prevalence	n_low_prevalence	get_n_low_prevalence	Number of low prevalence tokens	For reference: https://link.springer.com/article/10.3758/s13428-018-1077-9
Number of high prevalence tokens	psycholinguistic	n_high_prevalence	n_high_prevalence	get_n_high_prevalence	Number of high prevalence tokens	For reference: https://link.springer.com/article/10.3758/s13428-018-1077-9
Average socialness	psycholinguistic	avg_socialness	avg_socialness	get_avg_socialness	Average human socialness ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-022-01810-x
Average standard deviation of socialness	psycholinguistic	avg_sd_socialness	avg_sd_socialness	get_avg_sd_socialness	Average standard deviation in the human socialness ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-022-01810-x
Number of high socialness tokens	psycholinguistic	n_high_socialness	n_high_socialness	get_n_high_socialness	Number of high socialness tokens	For reference: https://link.springer.com/article/10.3758/s13428-022-01810-x
Number of low socialness tokens	psycholinguistic	n_low_socialness	n_low_socialness	get_n_low_socialness	Number of low socialness tokens	For reference: https://link.springer.com/article/10.3758/s13428-022-01810-x
Number of tokens with controversial socialness	psycholinguistic	n_controversial_socialness	n_controversial_socialness	get_n_controversial_socialness	Number of tokens with a high standard deviation in the human socialness rating	For reference: https://link.springer.com/article/10.3758/s13428-022-01810-x
Average iconicity	psycholinguistic	avg_iconicity	avg_iconicity	get_avg_iconicity	Average human iconicity ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-023-02112-6
Average standard deviation of iconicity	psycholinguistic	avg_sd_iconicity	avg_sd_iconicity	get_avg_sd_iconicity	Average standard deviation in the human iconicity ratings of the tokens in the text	For reference: https://link.springer.com/article/10.3758/s13428-023-02112-6
Number of high iconicity tokens	psycholinguistic	n_high_iconicity	n_high_iconicity	get_n_high_iconicity	Number of high iconicity tokens	For reference: https://link.springer.com/article/10.3758/s13428-023-02112-6
Number of low iconicity tokens	psycholinguistic	n_low_iconicity	n_low_iconicity	get_n_low_iconicity	Number of low iconicity tokens	For reference: https://link.springer.com/article/10.3758/s13428-023-02112-6
Number of tokens with controversial iconicity	psycholinguistic	n_controversial_iconicity	n_controversial_iconicity	get_n_controversial_iconicity	Number of tokens with a high standard deviation in the human iconicity rating	For reference: https://link.springer.com/article/10.3758/s13428-023-02112-6
Average sensorimotor score for {sensorimotor}	psycholinguistic	avg_sensorimotor	avg_{var}_sensorimotor	get_avg_sensorimotor	Average sensorimotor score for a given sensorimotor dimension
Average standard deviation of sensorimotor score	psycholinguistic	avg_sd_sensorimotor	avg_sd_{var}_sensorimotor	get_avg_sd_sensorimotor	Average standard deviation in the sensorimotor scores of the tokens in the text
Number of tokens with low sensorimotor rating	psycholinguistic	n_low_sensorimotor	n_low_{var}_sensorimotor	get_n_low_sensorimotor	Number of tokens with low sensorimotor rating
Number of tokens with high sensorimotor rating	psycholinguistic	n_high_sensorimotor	n_high_{var}_sensorimotor	get_n_high_sensorimotor	Number of tokens with high sensorimotor rating
Number of tokens with controversial sensorimotor rating	psycholinguistic	n_controversial_sensorimotor	n_controversial_{var}_sensorimotor	get_n_controversial_sensorimotor	Number of tokens with a high standard deviation in the sensorimotor rating
Morphological feature counts	morphological	n_per_morph_feature	n_{pos}_{feature}_{val}	get_morph_feats	Number of tokens with {pos} {feature} {val} (e.g. VERB VerbForm Inf), takes a dictionary of pos and associated features and values for them	Default dictionary is the full set of UD options; adapt if you do not need all of them (may be language-specific)
Dependency tree width	dependency	tree_width	tree_width	get_tree_width	Maximum number of siblings of a node at any level
Dependency tree depth	dependency	tree_depth	tree_depth	get_tree_depth	Maximum distance of a token to the root of the dependency tree
Tree branching factor	dependency	tree_branching	tree_branching	get_tree_branching	Average number of children of a token in the dependency tree
Tree ramification factor	dependency	ramification_factor	ramification_factor	get_ramification_factor	Average number of children per level
Number of noun chunks	dependency	n_noun_chunks	n_noun_chunks	get_n_noun_chunks	Number of noun chunks in the dependency tree
Number of dependencies of type {type}	dependency	n_per_dependency_type	n_dependency_{type}	get_n_per_dependency_type
{Feature} token ratio	ratios/normalization		{feature}_token_ratio	get_feature_token_ratio
{Feature} type ratio	ratios/normalization		{feature}_type_ratio	get_feature_type_ratio
{Feature} sentence ratio	ratios/normalization		{feature}_sentence ratio	get_feature_sentence_ratio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

features.md

features.md

Files

features.md

Latest commit

History

features.md

File metadata and controls