REST API

NGRAMS has been built following API-first principles. The goal is to make the accessibility of ngram data as easy as possible. The API sends and receives data in UTF-8 encoded JSON format.

There are endpoints which enable the following types of requests:

Search Request — Send a wildcard query and receive matching ngrams.
Batch Request — Send multiple raw queries at once and receive matching ngrams.
Ngram Request — Send an ngram id and receive year-based match count information.
CorpusInfo Request — Get static information about a corpus.
TotalCounts Request — Get total match counts by ngram length and year.

The REST API is currently in beta status — expect things to change.

By using the API, you agree to our Terms of Service. In short, they read: NGRAMS can be used free of charge, for both commercial and non-commercial purposes. Use requires attribution.

Base URL

https://api.ngrams.dev

Rate Limits

We believe that performance is a feature and therefore our backend tech stack is pure native. To judge our system in the early stage, we do not apply any rate limiting. You can send as many requests as you want, as fast as you can. We will adjust this policy if necessary.

We will block clients based on IP address if we detect abnormal usage.

Corpora

At the moment, the following corpora are available.

Name	Label	#Ngrams
English	`eng`	23.6 B
German	`ger`	4.5 B
Russian	`rus`	1.5 B

Quickstart

TLDR Here is how to get started in FAQ style.

I have an ngram and need its probability.

In NGRAMS every ngram has an absolute total match count and a relative total match count. The former is the sum of all year-based absolute match counts. The latter is the absolute total match count divided by the absolute total match count of all ngrams of the same length — you can call this the ngram's probability.

Send a Search Request.

GET {base_url}/{corpus}/search?query=my+awesome+ngram

You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains my awesome ngram, MY AWESOME NGRAM, etc. You can add &flags=cs to make the search case-sensitive (only one or none match). Each ngram in the list has a relTotalMatchCount property.

I need one probability for all cases of this ngram.

Send a Search Request with result set collapsing.

GET {base_url}/{corpus}/search?query=my+awesome+ngram&flags=cr

You will get one (or none) matching ngram whose absolute total match count is the sum of all cases, i.e. my awesome ngram, MY AWESOME NGRAM, etc. The relative total match count is derived as described above. The ngram's text will be in case-folded format. The ngram is also called abstract because was derived from other ngrams and has no 1:1 correspondence with an ngram in the raw dataset.

I have an ngram and need its frequencies by year.

This requires two requests but we will support this use case directly some time in the future. See this feature request.

Send a Search Request.

GET {base_url}/{corpus}/search?query=my+awesome+ngram

You will receive a list of matching ngrams because the search is not case-sensitive by default, i.e. the list contains my awesome ngram, MY AWESOME NGRAM, etc. You can add &flags=cs to make the search case-sensitive (only one or none match). Each ngram in the list has an id property.
Send an Ngram Request providing an ngram ID.
```
GET {base_url}/{corpus}/{ngram_id}
```
You will get the full ngram representation with year-based match counts.

I need the frequencies for all cases of this ngram.

This is not supported at the moment because this would mean to return year-based match counts of an abstract ngram. See NgramLite for details. You can compute these stats yourself by fetching all full ngrams for all IDs returned in step 2 above.

Search Request

A search request allows you to send a single wildcard query and receive a set of matching ngrams. This is basically the same type of request issued when using the search interface on https://ngrams.dev. The returned ngrams are sorted by decreasing total match count. Large sets are sent in chunks making use of pagination.

Endpoint

GET /{corpus}/search

Path Parameters

corpus string

The label of the corpus to search, see corpora.

Query Parameters

query string

The percent-encoded query string.

flags string optional

Enable search flags by adding the respective character sequence to the string.

cs — Search is case-sensitive.
cr — Collapse the result set by case-folding and then merging equal ngrams.
ep — Exclude ngrams from the result set where wildcards matched punctuation marks as of Unicode category P, see also here.
es — Exclude ngrams from the result set where wildcards matched sentence boundary tags, namely _START_ and _END_.
ri — Raw query: Do not interpret query operators. No need for escape sequences.
rt — Raw query: Do not tokenize query terms. Split only on ASCII whitespace.
rn — Raw query: Do not normalize query string characters.
ra — Raw query: all (ri + rt + rn)

limit number optional default: 100 max: 100

The maximum number of ngrams to return. The limit is applied before ngram filtering takes place, which means that the actual size of the result set could be smaller if any of the flags cs, cr, ep, or es is used.

start string optional

An opaque token to fetch the next chunk of a result set (pagination). A start token is included in a successful search result if there are possibly more matching ngrams.

Response

The HTTP status code tells whether a request was successful. A code other than 200 is considered failure. The response to a 400 bad request contains body data with error details.

Code	Body	Description
`200 OK`	`SearchResponse`	The request was successful.
`400 Bad Request`	`ErrorResponse`	The request failed due to client error.
`404 Not Found`	no	The corpus is unknown.
`500 Internal Server Error`	no	Try again later.

Examples

curl 'https://api.ngrams.dev/eng/search?query=hello+*&flags=cs&limit=3'
# OR
curl -G https://api.ngrams.dev/eng/search \
--data-urlencode query='hello *' \
-d flags=cs \
-d limit=3

200 OK

Response Body

// SearchResponse object,
// 2 instead of 3 ngrams due to post-retrieval case-sensitive filtering.
{
  "queryTokens": [
    { "text": "hello", "kind": "TERM" },
    { "text": "*", "kind": "STAR" }
  ],
  "ngrams": [
    {
      "id": "d975b1edafaf5aa521f6aee0d7efbe06",
      "absTotalMatchCount": 608657,
      "relTotalMatchCount": 2.899120077549673e-7,
      "tokens": [
        { "text": "hello", "kind": "TERM" },
        { "text": ",", "kind": "TERM", "inserted": true }
      ]
    },
    {
      "id": "983f5221b490f979d836276d3d986ef2",
      "absTotalMatchCount": 598094,
      "relTotalMatchCount": 2.848807002403643e-7,
      "tokens": [
        { "text": "hello", "kind": "TERM" },
        { "text": ".", "kind": "TERM", "inserted": true }
      ]
    }
  ],
  "nextPageToken": "157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5",
  "nextPageLink": "https://api.ngrams.dev/eng/search?query=hello+%2A&flags=cs&limit=3&start=157c30ede3ed098320eadbaf1a807dd17228ef6880f959e08605f29fcbfc14a894ffea81eec4fdb609a3e331772e9bd5"
}

curl https://api.ngrams.dev/eng/search

400 Bad Request

Response Body

// ErrorResponse object
{ "error": { "code": "MISSING_PARAMETER.QUERY" } }

Pagination

Wildcard queries can generate result sets that contain thousands of ngrams. The API sends these big result sets in chunks called pages. The start of a page is controlled by the start parameter. The size of a page is controlled by the limit parameter.

Every search request that contains a partial result, i.e. a page, has a so called page token in its response. This page token can be used in a follow-up request — as the value of the start parameter — to fetch the next page.

If a response has no page token, you have reached the end of the result set.

Batch Request

A batch request allows you to send up to 100 raw queries at once, which saves a lot of HTTP round trip time compared to single search requests. Queries in a batch request implicitly have the ri flag set, which means there is no interpretation of query operators. This type of request is most appropriate in situations where the existence or frequency of multiple ngrams needs to be checked quickly.

Batch requests have no means of pagination, because the list of matching ngrams per query is rather short as it only reflects variants in casing. If, in addition, the cs or cr flag is enabled, there is only one or none ngram to return per query.

Endpoint

POST /{corpus}/batch

Path Parameters

corpus string

The label of the corpus to search, see corpora.

Request Body

BatchRequest object

Response

The HTTP status code tells whether a request was successful. Invalid request body data causes a request to fail entirely. If the processing of individual queries fails, the batch response will contain corresponding error information for these queries.

Code	Body	Description
`200 OK`	`BatchResponse`	The request was successful.
`400 Bad Request`	`ErrorResponse`	The request failed due to client error.
`404 Not Found`	no	The corpus is unknown.
`500 Internal Server Error`	no	Try again later.

Example

curl https://api.ngrams.dev/eng/batch \
-H 'Content-Type: application/json' \
-d '@path/to/batch.json'

path/to/batch.json

// BatchRequest object
{
  "flags": "cs",
  "queries": ["The quick brown", "fox jumps over the lazy dog"]
}

200 OK

Response Body

// BatchResponse object
{
  "results": [
    {
      "queryTokens": [
        { "text": "The", "kind": "TERM" },
        { "text": "quick", "kind": "TERM" },
        { "text": "brown", "kind": "TERM" }
      ],
      "ngrams": [
        {
          "id": "ecaf9b4576d82550a5661c85f515be24",
          "absTotalMatchCount": 18248,
          "relTotalMatchCount": 9.13534806330214e-9,
          "tokens": [
            { "text": "The", "kind": "TERM" },
            { "text": "quick", "kind": "TERM" },
            { "text": "brown", "kind": "TERM" }
          ]
        }
      ]
    },
    {
      "error": { "code": "INVALID_QUERY.TOO_MANY_TOKENS" },
      "queryTokens": [
        { "text": "fox", "kind": "TERM" },
        { "text": "jumps", "kind": "TERM" },
        { "text": "over", "kind": "TERM" },
        { "text": "the", "kind": "TERM" },
        { "text": "lazy", "kind": "TERM" },
        { "text": "dog", "kind": "TERM" }
      ]
    }
  ]
}

Ngram Request

An ngram request allows you to send an ngram ID and receive a full ngram object with year-based match count information. This type of request is used on https://ngrams.dev to fetch the data backing an ngram's histogram view.

Endpoint

GET /{corpus}/{ngram_id}

Path Parameters

corpus string

The label of the corpus to search, see corpora.

ngram_id string

An ngram ID as returned from a search or batch request. Note that the ID of an abstract ngram is always considered unknown, because such ngrams have no year-based match count information.

Response

The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.

Code	Body	Description
`200 OK`	Ngram	The request was successful.
`404 Not Found`	no	The corpus or ID is unknown.
`500 Internal Server Error`	no	Try again later.

Example

curl https://api.ngrams.dev/eng/92c668bc012dc3e387ff0c7e791528db

200 OK

Response Body

// Ngram object
{
  "id": "92c668bc012dc3e387ff0c7e791528db",
  "absTotalMatchCount": 118987,
  "relTotalMatchCount": 5.6675204699428895e-8,
  "tokens": [
    { "text": "Hello", "kind": "TERM" },
    { "text": "World", "kind": "TERM" }
  ],
  "stats": [
    {
      "year": 1880,
      "absMatchCount": 52,
      "relMatchCount": 1.2108055367130671e-8
    },
    // There might be gaps for years without any data.
    {
      "year": 1899,
      "absMatchCount": 1,
      "relMatchCount": 1.28983869973734e-10
    },
    {
      "year": 1900,
      "absMatchCount": 49,
      "relMatchCount": 6.053137889244918e-9
    },
    // Items removed to keep it short.
    {
      "year": 2017,
      "absMatchCount": 5107,
      "relMatchCount": 1.765788720318268e-7
    },
    {
      "year": 2018,
      "absMatchCount": 4923,
      "relMatchCount": 1.7802199706983458e-7
    },
    {
      "year": 2019,
      "absMatchCount": 3798,
      "relMatchCount": 1.5816449035193755e-7
    }
  ]
}

CorpusInfo Request

A corpus info request allows you get static information about a corpus.

Endpoint

GET /{corpus}/info

Path Parameters

corpus string

The label of the corpus to search, see corpora.

Response

The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.

Code	Body	Description
`200 OK`	`CorpusInfo`	The request was successful.
`404 Not Found`	no	The corpus is unknown.
`500 Internal Server Error`	no	Try again later.

Example

curl https://api.ngrams.dev/eng/info

200 OK

Response Body

// CorpusInfo object
{
  "name": "English",
  "label": "eng",
  "stats": [
    {
      "numNgrams": 76862879,
      "minYear": 1470,
      "maxYear": 2019,
      "minMatchCount": 1,
      "maxMatchCount": 1922716631,
      "minTotalMatchCount": 40,
      "maxTotalMatchCount": 115513165249
    },
    {
      "numNgrams": 1604084580,
      "minYear": 1470,
      "maxYear": 2019,
      "minMatchCount": 1,
      "maxMatchCount": 1446928350,
      "minTotalMatchCount": 40,
      "maxTotalMatchCount": 82544506739
    },
    {
      "numNgrams": 11777289629,
      "minYear": 1470,
      "maxYear": 2019,
      "minMatchCount": 1,
      "maxMatchCount": 84854130,
      "minTotalMatchCount": 40,
      "maxTotalMatchCount": 2907518961
    },
    {
      "numNgrams": 5089891990,
      "minYear": 1470,
      "maxYear": 2019,
      "minMatchCount": 1,
      "maxMatchCount": 14391742,
      "minTotalMatchCount": 40,
      "maxTotalMatchCount": 384260789
    },
    {
      "numNgrams": 5020506742,
      "minYear": 1470,
      "maxYear": 2019,
      "minMatchCount": 1,
      "maxMatchCount": 7167265,
      "minTotalMatchCount": 40,
      "maxTotalMatchCount": 226361873
    }
  ]
}

TotalCounts Request

Get the sum of ngram occurrences by ngram length and year. This data is useful for computing the relative frequencies of ngrams.

Endpoint

GET /{corpus}/total_counts

Path Parameters

corpus string

The label of the corpus to search, see corpora.

Response

The HTTP status code tells whether a request was successful. A code other than 200 is considered failure.

Code	Body	Description
`200 OK`	`TotalCounts`	The request was successful.
`404 Not Found`	no	The corpus is unknown.
`500 Internal Server Error`	no	Try again later.

Example

curl https://api.ngrams.dev/eng/total_counts

200 OK

Response Body

// TotalCounts object
{
  "minYear": 1470,
  "maxYear": 2019,
  "matchCounts": [
    [ 984, "…", 22826152232 ],
    [ 1019, "…", 24012975299 ],
    [ 984, "…", 22826152232 ],
    [ 949, "…", 21639329784 ],
    [ 914, "…", 20458636150 ]
  ]
}

Types

A complete list of types (schemas) used in this API.

BatchRequest

A container for multiple queries and search flags.

Properties

queries string[]

An array of query strings.

flags string optional

Enable search flags by adding the respective character sequence to the string.

cs — Search is case-sensitive.
cr — Collapse the result set by case-folding and then merging equal ngrams.

BatchResponse

A container for multiple search results and metadata.

Properties

results (SearchResponse | ErrorResponse)[]

An array of multiple types, aka union. results[i] is the outcome of BatchRequest.queries[i]. If results[i].error exists, the object is an instance of ErrorResponse, otherwise it is an instance of SearchResponse.

CorpusInfo

A container for static information about a single corpus.

Properties

name string

The name of the corpus — something like "English", see corpora.

label string

The label of the corpus — something like "eng", see corpora.

stats CorpusStat[5]

An array of CorpusStat objects sorted by ngram length. stats[0] refers to the subset of 1-grams, stats[1] refers to the subset of 2-grams, and so on.

CorpusStat

A container for statistical data about a corpus or sub-corpus.

numNgrams number

The number of indexed ngrams. See ngram dataset for details.

minYear number

The minimum year value associated with an ngram.

maxYear number

The maximum year value associated with an ngram.

minMatchCount number

The minimum value of an ngram's year-based match count.

maxMatchCount number

The maximum value of an ngram's year-based match count.

minTotalMatchCount number

The minimum value of an ngram's total match count.

maxTotalMatchCount number

The maximum value of an ngram's total match count.

Error

A type containing information about a failed query or request.

Properties

code ErrorCode

A string indicating the type of error. The values are constants to be used for programmatic error handling.

context string | object optional

Provides error-specific context information to be used for advanced programmatic error handling. The exact format is currently work in progress.

ErrorCode

An enum that describes the type of an error. Values are string constants.

Value	Description
`INVALID_PARAMETER.LIMIT`	Something is wrong with the `limit` parameter.
`INVALID_PARAMETER.START`	Something is wrong with the `start` parameter.
`INVALID_QUERY.BAD_ALTERNATION`	The token to the left or right of the `/` operator is invalid.
`INVALID_QUERY.BAD_COMPLETION`	The `~` operator has no prefix in front of it.
`INVALID_QUERY.BAD_TERM_GROUP`	There is an opening token without a matching closing token, or vice versa.
`INVALID_QUERY.NO_TERM`	The query has no term. At least one term is required.
`INVALID_QUERY.TOO_EXPENSIVE`	The query is too expensive to process and was rejected.
`INVALID_QUERY.TOO_MANY_TOKENS`	The query has more than 5 tokens after tokenization.
`INVALID_REQUEST_BODY`	The JSON data in the request body is malformed or has wrong schema.
`INVALID_UTF8_ENCODING`	The query string is not in valid UTF-8 encoding.
`MISSING_PARAMETER.QUERY`	The `query` parameter is missing.

ErrorResponse

A container for error information and related data.

Properties

error Error

An error object.

queryTokens QueryToken[] optional

A representation of the query after tokenization, which is an array of QueryToken objects. This property is only available if query processing has actually taken place. It is not available if a request was rejected at an earlier stage, e.g. due to missing required parameters.

Ngram

A representation of an ngram with full year-based match count information. The properties listed below are in addition to the properties of NgramLite, i.e. Ngram extends NgramLite.

Properties

stats NgramStat[]

An array of NgamStat objects.

NgramLite

A light-weight representation of an ngram with basic metadata.

Properties

id string

An ID that identifies an ngram uniquely within a corpus. The ID can be used to fetch the corresponding Ngram object with full year-based match count information. See ngram request for details. Applications that need a unique ngram ID for the whole dataset can do so by prefixing this ID with the label of the associated corpus, i.e. {label}_{ngram_id}.

abstract boolean optional

Indicates whether the ngram is abstract as a result of applying a filter operation, e.g. result set collapsing. An abstract ngram does not represent an existing ngram from the dataset and hence has no associated year-based match count information. Its ID is nevertheless unique within a corpus. The property is only present if true — absence means false.

absTotalMatchCount number

The ngram's absolute total match count. See Data Model for details.

relTotalMatchCount number

The ngram's relative total match count. See Data Model for details.

tokens NgramToken[1..5]

An array of NgramToken objects of length 1 to 5.

NgramStat

A representation of an ngram's match count relating to a single year.

Properties

year number

The year the data belongs to. See Data Model for details.

absMatchCount number

The ngram's absolute match count. See Data Model for details.

relMatchCount number

The ngram's relative match count. See Data Model for details.

NgramToken

A representation of a single token as part of an ngram. It contains basic information like text and type, as well as metadata about its relation to a query, e.g. if the token has been inserted as a result of wildcard application.

Properties

text string

The token's text in UTF-8 encoding. For tokens that have a part-of-speech suffix in the original raw data, e.g. example_NOUN, this suffix has been removed. The POS tag information is available via the type property.

kind NgramTokenKind

The token's kind allows to distinguish programmatically between text-like tokens, part-of-speech (POS) tagged tokens, and sentence boundary tags. It can be used to append the original POS tag suffix to the text string or for syntax highlighting when displayed.

inserted boolean optional

Indicates whether the token was inserted as a result of applying a *, **, or *_ADJ and friends wildcard. The property is only present if true — absence means false.

completed boolean optional

Indicates whether the token was completed as a result of applying the ~ operator. The property is only present if true, absence means false.

NgramTokenKind

An enum that describes the type of an ngram token. Values are string constants.

Value	Description
`TERM`	The token is a regular term.
`TAGGED_AS_ADJ`	The token has a POS tag of `ADJ`.
`TAGGED_AS_ADP`	The token has a POS tag of `ADP`.
`TAGGED_AS_ADV`	The token has a POS tag of `ADV`.
`TAGGED_AS_CONJ`	The token has a POS tag of `CONJ`.
`TAGGED_AS_DET`	The token has a POS tag of `DET`.
`TAGGED_AS_NOUN`	The token has a POS tag of `NOUN`.
`TAGGED_AS_NUM`	The token has a POS tag of `NUM`.
`TAGGED_AS_PRON`	The token has a POS tag of `PRON`.
`TAGGED_AS_PRT`	The token has a POS tag of `PRT`.
`TAGGED_AS_VERB`	The token has a POS tag of `VERB`.
`SENTENCE_START`	The token is the `_START_` token.
`SENTENCE_END`	The token is the `_END_` token.

QueryToken

A representation of a single token as part of a query string.

Properties

text string

The token's text in UTF-8 encoding.

kind QueryTokenKind

The token's kind tells if the token has been recognized as text-like token or some query operator.

QueryTokenKind

An enum that describes the type of a query token. Values are string constants.

Value	Description
`TERM`	The token is a regular query term.
`STAR`	The token is the `*` wildcard.
`STARSTAR`	The token is the `**` wildcard.
`STAR_ADJ`	The token is the `*_ADJ` wildcard.
`STAR_ADP`	The token is the `*_ADP` wildcard.
`STAR_ADV`	The token is the `*_ADV` wildcard.
`STAR_CONJ`	The token is the `*_CONJ` wildcard.
`STAR_DET`	The token is the `*_DET` wildcard.
`STAR_NOUN`	The token is the `*_NOUN` wildcard.
`STAR_NUM`	The token is the `*_NUM` wildcard.
`STAR_PRON`	The token is the `*_PRON` wildcard.
`STAR_PRT`	The token is the `*_PRT` wildcard.
`STAR_VERB`	The token is the `*_VERB` wildcard.
`SENTENCE_START`	The token is the `_START_` token.
`SENTENCE_END`	The token is the `_END_` token.
`SLASH`	The token is the `/` operator.
`PREFIX`	The token has the `~` operator.
`TERM_GROUP`	The token is a term group.

SearchResponse

A representation of the outcome of a successfully processed query.

Properties

queryTokens QueryToken[]

A representation of the query after tokenization, which is an array of QueryToken objects.

ngrams NgramLite[]

A representation of the result set, which is an array of NgramLite objects.

nextPageToken string optional

An opaque token to be used in a follow-up request to fetch the next chunk of the result set. See pagination for details.

nextPageLink string optional

An absolute URL to issue a follow-up request. See pagination for details.

TotalCounts

Represents a lookup table for ngram total match counts by ngram length and year.

The table covers the full range from ngram length 1 to 5 combined with first year to last year of ngram occurrence. For combinations where no total match count is available the value is zero.

Properties

minYear int32

The first year of ngram occurrence.

maxYear int32

The last year of ngram occurrence.

matchCounts int64[][]

A matrix of ngram total match counts.

The first subscript in the range [0, 4] denotes the ngram length, with 1-gram counts at index 0, 2-gram counts at index 1, and so on.
The second subscript in the range [0, (maxYear - minYear)] denotes the year, with minYear at index 0, minYear + 1 at index 1, and so on.

Example: The total match count of 3-grams for the year 2000 in the English corpus is matchCounts[2][530] because minYear = 1470 and 2000 - minYear = 530.

REST API

Base URL

Rate Limits

Corpora

Quickstart

I have an ngram and need its probability.

I have an ngram and need its frequencies by year.

Search Request

Endpoint

Path Parameters

Query Parameters

Response

Examples

Pagination

Batch Request

Endpoint

Path Parameters

Request Body

Response

Example

Ngram Request

Endpoint

Path Parameters

Response

Example

CorpusInfo Request

Endpoint

Path Parameters

Response

Example

TotalCounts Request

Endpoint

Path Parameters

Response

Example

Types

BatchRequest

Properties

BatchResponse

Properties

CorpusInfo

Properties

CorpusStat

Error

Properties

ErrorCode

ErrorResponse

Properties

Ngram

Properties

NgramLite

Properties

NgramStat

Properties

NgramToken

Properties

NgramTokenKind

QueryToken

Properties

QueryTokenKind

SearchResponse

Properties

TotalCounts

Properties

Clone this wiki locally