-
Notifications
You must be signed in to change notification settings - Fork 0
REST API
NGRAMS has been built following API-first principles. The goal is to make the accessibility of ngram data as easy as possible. The API sends and receives data in UTF-8 encoded JSON format.
There are endpoints which enable the following types of requests:
- Search Request — Send a wildcard query and receive matching ngrams.
- Batch Request — Send multiple raw queries at once and receive matching ngrams.
- Ngram Request — Send an ngram id and receive year-based match count information.
- CorpusInfo Request — Get static information about a corpus.
- TotalCounts Request — Get total match counts by ngram length and year.
The REST API is currently in beta status — expect things to change.
By using the API, you agree to our Terms of Service. In short, they read: NGRAMS can be used free of charge, for both commercial and non-commercial purposes. Use requires attribution.
https://api.ngrams.dev
We believe that performance is a feature and therefore our backend tech stack is pure native. To judge our system in the early stage, we do not apply any rate limiting. You can send as many requests as you want, as fast as you can. We will adjust this policy if necessary.
We will block clients based on IP address if we detect abnormal usage.
At the moment, the following corpora are available.
Name | Label | #Ngrams |
---|---|---|
English | eng |
23.6 B |
German | ger |
4.5 B |
Russian | rus |
1.5 B |
TLDR Here is how to get started in FAQ style.
In NGRAMS every ngram has an absolute total match count and a relative total match count. The former is the sum of all year-based absolute match counts. The latter is the absolute total match count divided by the absolute total match count of all ngrams of the same length — you can call this the ngram's probability.
-
Send a Search Request.
GET {base_url}/{corpus}/search?query=my+awesome+ngram
-
You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains
my awesome ngram
,MY AWESOME NGRAM
, etc. You can add&flags=cs
to make the search case-sensitive (only one or none match). Each ngram in the list has arelTotalMatchCount
property.
I need one probability for all cases of this ngram.
-
Send a Search Request with result set collapsing.
GET {base_url}/{corpus}/search?query=my+awesome+ngram&flags=cr
-
You will get one (or none) matching ngram whose absolute total match count is the sum of all cases, i.e.
my awesome ngram
,MY AWESOME NGRAM
, etc. The relative total match count is derived as described above. The ngram's text will be in case-folded format. The ngram is also called abstract because was derived from other ngrams and has no 1:1 correspondence with an ngram in the raw dataset.
This requires two requests but we will support this use case directly some time in the future. See this feature request.
-
Send a Search Request.
GET {base_url}/{corpus}/search?query=my+awesome+ngram
-
You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains
my awesome ngram
,MY AWESOME NGRAM
, etc. You can add&flags=cs
to make the search case-sensitive (only one or none match). Each ngram in the list has anid
property. -
Send an Ngram Request providing an ngram ID.
GET {base_url}/{corpus}/{ngram_id}
-
You will get the full ngram representation with year-based match counts.
I need the frequencies for all cases of this ngram.
This is not supported at the moment because this would mean to return year-based match counts of an abstract ngram. See NgramLite for details. You can compute these stats yourself by fetching all full ngrams for all IDs returned in step 2 above.
A search request allows you to send a single wildcard query and receive a set of matching ngrams. This is basically the same type of request issued when using the search interface on https://ngrams.dev. The returned ngrams are sorted by decreasing total match count. Large sets are sent in chunks making use of pagination.
GET /{corpus}/search
corpus
string
The label of the corpus to search, see corpora.
query
string
The percent-encoded query string.
flags
string
optional
Enable search flags by adding the respective character sequence to the string.
-
cs
— Search is case-sensitive. -
cr
— Collapse the result set by case-folding and then merging equal ngrams. -
ep
— Exclude ngrams from the result set where wildcards matched punctuation marks as of Unicode category P, see also here. -
es
— Exclude ngrams from the result set where wildcards matched sentence boundary tags, namely _START_ and _END_. -
ri
— Raw query: Do not interpret query operators. No need for escape sequences. -
rt
— Raw query: Do not tokenize query terms. Split only on ASCII whitespace. -
rn
— Raw query: Do not normalize query string characters. -
ra
— Raw query: all (ri
+rt
+rn
)
limit
number
optional
default: 100
max: 100
The maximum number of ngrams to return. The limit is applied before ngram filtering takes place, which means that the actual size of the result set could be smaller if any of the flags cs
, cr
, ep
, or es
is used.
start
string
optional
An opaque token to fetch the next chunk of a result set (pagination). A start token is included in a successful search result if there are possibly more matching ngrams.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure. The response to a 400 bad request contains body data with error details.
Code | Body | Description |
---|---|---|
200 OK |
SearchResponse |
The request was successful. |
400 Bad Request |
ErrorResponse |
The request failed due to client error. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl 'https://api.ngrams.dev/eng/search?query=hello+*&flags=cs&limit=3'
# OR
curl -G https://api.ngrams.dev/eng/search \
--data-urlencode query='hello *' \
-d flags=cs \
-d limit=3
200 OK
Response Body
// SearchResponse object,
// 2 instead of 3 ngrams due to post-retrieval case-sensitive filtering.
{
"queryTokens": [
{ "text": "hello", "kind": "TERM" },
{ "text": "*", "kind": "STAR" }
],
"ngrams": [
{
"id": "d975b1edafaf5aa521f6aee0d7efbe06",
"absTotalMatchCount": 608657,
"relTotalMatchCount": 2.899120077549673e-7,
"tokens": [
{ "text": "hello", "kind": "TERM" },
{ "text": ",", "kind": "TERM", "inserted": true }
]
},
{
"id": "983f5221b490f979d836276d3d986ef2",
"absTotalMatchCount": 598094,
"relTotalMatchCount": 2.848807002403643e-7,
"tokens": [
{ "text": "hello", "kind": "TERM" },
{ "text": ".", "kind": "TERM", "inserted": true }
]
}
],
"nextPageToken": "157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5",
"nextPageLink": "https://api.ngrams.dev/eng/search?query=hello+%2A&flags=cs&limit=3&start=157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5"
}
curl https://api.ngrams.dev/eng/search
400 Bad Request
Response Body
// ErrorResponse object
{ "error": { "code": "MISSING_PARAMETER.QUERY" } }
Wildcard queries can generate result sets that contain thousands of ngrams. The API sends these big result sets in chunks called pages. The start of a page is controlled by the start
parameter. The size of a page is controlled by the limit
parameter.
Every search request that contains a partial result, i.e. a page, has a so called page token in its response. This page token can be used in a follow-up request — as the value of the start
parameter — to fetch the next page.
If a response has no page token, you have reached the end of the result set.
A batch request allows you to send up to 100 raw queries at once, which saves a lot of HTTP round trip time compared to single search requests. Queries in a batch request implicitly have the ri
flag set, which means there is no interpretation of query operators. This type of request is most appropriate in situations where the existence or frequency of multiple ngrams needs to be checked quickly.
Batch requests have no means of pagination, because the list of matching ngrams per query is rather short as it only reflects variants in casing. If, in addition, the cs
or cr
flag is enabled, there is only one or none ngram to return per query.
POST /{corpus}/batch
corpus
string
The label of the corpus to search, see corpora.
BatchRequest
object
The HTTP status code tells whether a request was successful. Invalid request body data causes a request to fail entirely. If the processing of individual queries fails, the batch response will contain corresponding error information for these queries.
Code | Body | Description |
---|---|---|
200 OK |
BatchResponse |
The request was successful. |
400 Bad Request |
ErrorResponse |
The request failed due to client error. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/batch \
-H 'Content-Type: application/json' \
-d '@path/to/batch.json'
path/to/batch.json
// BatchRequest object
{
"flags": "cs",
"queries": ["The quick brown", "fox jumps over the lazy dog"]
}
200 OK
Response Body
// BatchResponse object
{
"results": [
{
"queryTokens": [
{ "text": "The", "kind": "TERM" },
{ "text": "quick", "kind": "TERM" },
{ "text": "brown", "kind": "TERM" }
],
"ngrams": [
{
"id": "ecaf9b4576d82550a5661c85f515be24",
"absTotalMatchCount": 18248,
"relTotalMatchCount": 9.13534806330214e-9,
"tokens": [
{ "text": "The", "kind": "TERM" },
{ "text": "quick", "kind": "TERM" },
{ "text": "brown", "kind": "TERM" }
]
}
]
},
{
"error": { "code": "INVALID_QUERY.TOO_MANY_TOKENS" },
"queryTokens": [
{ "text": "fox", "kind": "TERM" },
{ "text": "jumps", "kind": "TERM" },
{ "text": "over", "kind": "TERM" },
{ "text": "the", "kind": "TERM" },
{ "text": "lazy", "kind": "TERM" },
{ "text": "dog", "kind": "TERM" }
]
}
]
}
An ngram request allows you to send an ngram ID and receive a full ngram object with year-based match count information. This type of request is used on https://ngrams.dev to fetch the data backing an ngram's histogram view.
GET /{corpus}/{ngram_id}
corpus
string
The label of the corpus to search, see corpora.
ngram_id
string
An ngram ID as returned from a search or batch request. Note that the ID of an abstract ngram is always considered unknown, because such ngrams have no year-based match count information.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
Code | Body | Description |
---|---|---|
200 OK |
Ngram | The request was successful. |
404 Not Found |
no | The corpus or ID is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/92c668bc012dc3e387ff0c7e791528db
200 OK
Response Body
// Ngram object
{
"id": "92c668bc012dc3e387ff0c7e791528db",
"absTotalMatchCount": 118987,
"relTotalMatchCount": 5.6675204699428895e-8,
"tokens": [
{ "text": "Hello", "kind": "TERM" },
{ "text": "World", "kind": "TERM" }
],
"stats": [
{
"year": 1880,
"absMatchCount": 52,
"relMatchCount": 1.2108055367130671e-8
},
// There might be gaps for years without any data.
{
"year": 1899,
"absMatchCount": 1,
"relMatchCount": 1.28983869973734e-10
},
{
"year": 1900,
"absMatchCount": 49,
"relMatchCount": 6.053137889244918e-9
},
// Items removed to keep it short.
{
"year": 2017,
"absMatchCount": 5107,
"relMatchCount": 1.765788720318268e-7
},
{
"year": 2018,
"absMatchCount": 4923,
"relMatchCount": 1.7802199706983458e-7
},
{
"year": 2019,
"absMatchCount": 3798,
"relMatchCount": 1.5816449035193755e-7
}
]
}
A corpus info request allows you get static information about a corpus.
GET /{corpus}/info
corpus
string
The label of the corpus to search, see corpora.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
Code | Body | Description |
---|---|---|
200 OK |
CorpusInfo |
The request was successful. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/info
200 OK
Response Body
// CorpusInfo object
{
"name": "English",
"label": "eng",
"stats": [
{
"numNgrams": 76862879,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 1922716631,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 115513165249
},
{
"numNgrams": 1604084580,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 1446928350,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 82544506739
},
{
"numNgrams": 11777289629,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 84854130,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 2907518961
},
{
"numNgrams": 5089891990,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 14391742,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 384260789
},
{
"numNgrams": 5020506742,
"minYear": 1470,
"maxYear": 2019,
"minMatchCount": 1,
"maxMatchCount": 7167265,
"minTotalMatchCount": 40,
"maxTotalMatchCount": 226361873
}
]
}
Get the sum of ngram occurrences by ngram length and year. This data is useful for computing the relative frequencies of ngrams.
GET /{corpus}/total_counts
corpus
string
The label of the corpus to search, see corpora.
The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.
Code | Body | Description |
---|---|---|
200 OK |
TotalCounts |
The request was successful. |
404 Not Found |
no | The corpus is unknown. |
500 Internal Server Error |
no | Try again later. |
curl https://api.ngrams.dev/eng/total_counts
200 OK
Response Body
// TotalCounts object
{
"minYear": 1470,
"maxYear": 2019,
"matchCounts": [
[ 984, "…", 22826152232 ],
[ 1019, "…", 24012975299 ],
[ 984, "…", 22826152232 ],
[ 949, "…", 21639329784 ],
[ 914, "…", 20458636150 ]
]
}
A complete list of types (schemas) used in this API.
A container for multiple queries and search flags.
queries
string[]
An array of query strings.
flags
string
optional
Enable search flags by adding the respective character sequence to the string.
-
cs
— Search is case-sensitive. -
cr
— Collapse the result set by case-folding and then merging equal ngrams.
A container for multiple search results and metadata.
results
(SearchResponse | ErrorResponse)[]
An array of multiple types, aka union. results[i]
is the outcome of BatchRequest.queries[i]
. If results[i].error
exists, the object is an instance of ErrorResponse
, otherwise it is an instance of SearchResponse
.
A container for static information about a single corpus.
name
string
The name of the corpus — something like "English", see corpora.
label
string
The label of the corpus — something like "eng", see corpora.
stats
CorpusStat[5]
An array of CorpusStat
objects sorted by ngram length. stats[0]
refers to the subset of 1-grams, stats[1]
refers to the subset of 2-grams, and so on.
A container for statistical data about a corpus or sub-corpus.
numNgrams
number
The number of indexed ngrams. See ngram dataset for details.
minYear
number
The minimum year value associated with an ngram.
maxYear
number
The maximum year value associated with an ngram.
minMatchCount
number
The minimum value of an ngram's year-based match count.
maxMatchCount
number
The maximum value of an ngram's year-based match count.
minTotalMatchCount
number
The minimum value of an ngram's total match count.
maxTotalMatchCount
number
The maximum value of an ngram's total match count.
A type containing information about a failed query or request.
code
ErrorCode
A string indicating the type of error. The values are constants to be used for programmatic error handling.
context
string | object
optional
Provides error-specific context information to be used for advanced programmatic error handling. The exact format is currently work in progress.
An enum that describes the type of an error. Values are string constants.
Value | Description |
---|---|
INVALID_PARAMETER.LIMIT |
Something is wrong with the limit parameter. |
INVALID_PARAMETER.START |
Something is wrong with the start parameter. |
INVALID_QUERY.BAD_ALTERNATION |
The token to the left or right of the / operator is invalid. |
INVALID_QUERY.BAD_COMPLETION |
The ~ operator has no prefix in front of it. |
INVALID_QUERY.BAD_TERM_GROUP |
There is an opening token without a matching closing token, or vice versa. |
INVALID_QUERY.NO_TERM |
The query has no term. At least one term is required. |
INVALID_QUERY.TOO_EXPENSIVE |
The query is too expensive to process and was rejected. |
INVALID_QUERY.TOO_MANY_TOKENS |
The query has more than 5 tokens after tokenization. |
INVALID_REQUEST_BODY |
The JSON data in the request body is malformed or has wrong schema. |
INVALID_UTF8_ENCODING |
The query string is not in valid UTF-8 encoding. |
MISSING_PARAMETER.QUERY |
The query parameter is missing. |
A container for error information and related data.
error
Error
An error object.
queryTokens
QueryToken[]
optional
A representation of the query after tokenization, which is an array of QueryToken
objects. This property is only available if query processing has actually taken place. It is not available if a request was rejected at an earlier stage, e.g. due to missing required parameters.
A representation of an ngram with full year-based match count information. The properties listed below are in addition to the properties of NgramLite
, i.e. Ngram
extends NgramLite
.
stats
NgramStat[]
An array of NgamStat
objects.
A light-weight representation of an ngram with basic metadata.
id
string
An ID that identifies an ngram uniquely within a corpus. The ID can be used to fetch the corresponding Ngram
object with full year-based match count information. See ngram request for details. Applications that need a unique ngram ID for the whole dataset can do so by prefixing this ID with the label of the associated corpus, i.e. {label}_{ngram_id}
.
abstract
boolean
optional
Indicates whether the ngram is abstract as a result of applying a filter operation, e.g. result set collapsing. An abstract ngram does not represent an existing ngram from the dataset and hence has no associated year-based match count information. Its ID is nevertheless unique within a corpus. The property is only present if true — absence means false.
absTotalMatchCount
number
The ngram's absolute total match count. See Data Model for details.
relTotalMatchCount
number
The ngram's relative total match count. See Data Model for details.
tokens
NgramToken[1..5]
An array of NgramToken
objects of length 1 to 5.
A representation of an ngram's match count relating to a single year.
year
number
The year the data belongs to. See Data Model for details.
absMatchCount
number
The ngram's absolute match count. See Data Model for details.
relMatchCount
number
The ngram's relative match count. See Data Model for details.
A representation of a single token as part of an ngram. It contains basic information like text and type, as well as metadata about its relation to a query, e.g. if the token has been inserted as a result of wildcard application.
text
string
The token's text in UTF-8 encoding. For tokens that have a part-of-speech suffix in the original raw data, e.g. example_NOUN
, this suffix has been removed. The POS tag information is available via the type
property.
kind
NgramTokenKind
The token's kind allows to distinguish programmatically between text-like tokens, part-of-speech (POS) tagged tokens, and sentence boundary tags. It can be used to append the original POS tag suffix to the text string or for syntax highlighting when displayed.
inserted
boolean
optional
Indicates whether the token was inserted as a result of applying a *
, **
, or *_ADJ
and friends wildcard. The property is only present if true — absence means false.
completed
boolean
optional
Indicates whether the token was completed as a result of applying the ~
operator. The property is only present if true, absence means false.
An enum that describes the type of an ngram token. Values are string constants.
Value | Description |
---|---|
TERM |
The token is a regular term. |
TAGGED_AS_ADJ |
The token has a POS tag of ADJ . |
TAGGED_AS_ADP |
The token has a POS tag of ADP . |
TAGGED_AS_ADV |
The token has a POS tag of ADV . |
TAGGED_AS_CONJ |
The token has a POS tag of CONJ . |
TAGGED_AS_DET |
The token has a POS tag of DET . |
TAGGED_AS_NOUN |
The token has a POS tag of NOUN . |
TAGGED_AS_NUM |
The token has a POS tag of NUM . |
TAGGED_AS_PRON |
The token has a POS tag of PRON . |
TAGGED_AS_PRT |
The token has a POS tag of PRT . |
TAGGED_AS_VERB |
The token has a POS tag of VERB . |
SENTENCE_START |
The token is the _START_ token. |
SENTENCE_END |
The token is the _END_ token. |
A representation of a single token as part of a query string.
text
string
The token's text in UTF-8 encoding.
kind
QueryTokenKind
The token's kind tells if the token has been recognized as text-like token or some query operator.
An enum that describes the type of a query token. Values are string constants.
Value | Description |
---|---|
TERM |
The token is a regular query term. |
STAR |
The token is the * wildcard. |
STARSTAR |
The token is the ** wildcard. |
STAR_ADJ |
The token is the *_ADJ wildcard. |
STAR_ADP |
The token is the *_ADP wildcard. |
STAR_ADV |
The token is the *_ADV wildcard. |
STAR_CONJ |
The token is the *_CONJ wildcard. |
STAR_DET |
The token is the *_DET wildcard. |
STAR_NOUN |
The token is the *_NOUN wildcard. |
STAR_NUM |
The token is the *_NUM wildcard. |
STAR_PRON |
The token is the *_PRON wildcard. |
STAR_PRT |
The token is the *_PRT wildcard. |
STAR_VERB |
The token is the *_VERB wildcard. |
SENTENCE_START |
The token is the _START_ token. |
SENTENCE_END |
The token is the _END_ token. |
SLASH |
The token is the / operator. |
PREFIX |
The token has the ~ operator. |
TERM_GROUP |
The token is a term group. |
A representation of the outcome of a successfully processed query.
queryTokens
QueryToken[]
A representation of the query after tokenization, which is an array of QueryToken
objects.
ngrams
NgramLite[]
A representation of the result set, which is an array of NgramLite
objects.
nextPageToken
string
optional
An opaque token to be used in a follow-up request to fetch the next chunk of the result set. See pagination for details.
nextPageLink
string
optional
An absolute URL to issue a follow-up request. See pagination for details.
Represents a lookup table for ngram total match counts by ngram length and year.
The table covers the full range from ngram length 1 to 5 combined with first year to last year of ngram occurrence. For combinations where no total match count is available the value is zero.
minYear
int32
The first year of ngram occurrence.
maxYear
int32
The last year of ngram occurrence.
matchCounts
int64[][]
A matrix of ngram total match counts.
- The first subscript in the range
[0, 4]
denotes the ngram length, with 1-gram counts at index 0, 2-gram counts at index 1, and so on. - The second subscript in the range
[0, (maxYear - minYear)]
denotes the year, withminYear
at index 0,minYear + 1
at index 1, and so on.
Example: The total match count of 3-grams for the year 2000 in the English corpus is matchCounts[2][530]
because minYear = 1470
and 2000 - minYear = 530
.