Skip to content

Troubleshooting common errors

Connor Mendenhall edited this page Dec 22, 2021 · 3 revisions

This page describes common errors, common non-errors, and general troubleshooting.

Common errors

Failed deployment

Check Cloudwatch logs to see if the containers correctly started. Run the diagnostic queries to check if the data layer is still syncing. (Header sync and extract diffs rarely change, so the usual culprit is a change made to the transformation layer containers).

Missing data

To diagnose missing urn or auction data, it's easiest to compare event and storage diff data. For every state changing event, we should have a corresponding storage diff. Let's use an urn as an example. This query returns the current ink and art storage values, and embeds frob events for this urn. Within the embedded frob event, we're also embedding the historical ink and art state of the urn—that is, the ink and art value at the block height of the corresponding frob event:

  getUrn(ilkIdentifier: "ETH-A", urnIdentifier: "0xbB7497BAaF231B8b7D92e0cFf9BCf4F2018C2d2d") {
    # current `ink` and `art` (data comes from storage diffs)
    ink
    art
    # frob events (data comes from events)
    frobs(first: 3) {
      # there are probably more than 3 frobs, check here
      totalCount
      nodes {
        tx {
          # block height of the event
          blockHeight
        }
        # dink and dart parameters from the event
        dink
        dart
        urn {
          # ink and art state at the time of the event
          ink
          art
        }
      }
    }
  }
}

The above query returns a response like this:

{
  "data": {
    "getUrn": {
      "ink": "280130575429533717604",
      "art": "553661142717366142318575",
      "frobs": {
        "totalCount": 105,
        "nodes": [
          {
            "tx": {
              "blockHeight": "13767861"
            },
            "dink": "0",
            "dart": "47198512683467635276324",
            "urn": {
              "ink": "280130575429533717604",
              "art": "553661142717366142318575"
            }
          },
          {
            "tx": {
              "blockHeight": "13760904"
            },
            "dink": "0",
            "dart": "3776167629227625039464",
            "urn": {
              "ink": "280130575429533717604",
              "art": "506462630033898507042251"
            }
          },
          {
            "tx": {
              "blockHeight": "13760893"
            },
            "dink": "57670312298673291211",
            "dart": "0",
            "urn": {
              "ink": "280130575429533717604",
              "art": "502686462404670882002787"
            }
          }
        ]
      }
    }
  },
  "meta": {
    "graphqlQueryCost": 3
  }
}

These events show one dink (change to ink) and two darts (change to art). We should expect to see corresponding storage diffs for this urn at the block heights of these events. Here's one way to look for them:

{
  art1: allVatUrnArts(first: 100, filter: {storageDiffByDiffId: {blockHeight: {equalTo: "13767861"}}}) {
    nodes {
      art
      rawUrnByUrnId {
        identifier
      }
    }
  }
  art2: allVatUrnArts(first: 100, filter: {storageDiffByDiffId: {blockHeight: {equalTo: "13760904"}}}) {
    nodes {
      art
      rawUrnByUrnId {
        identifier
      }
    }
  }
  ink1: allVatUrnInks(first: 100, filter: {storageDiffByDiffId: {blockHeight: {equalTo: "13760893"}}}) {
    nodes {
      ink
      rawUrnByUrnId {
        identifier
      }
    }
  }
}

We could also look at a lower level, by querying diffs directly and embedding any associated vatUrnArtsByDiffId:

{
  allStorageDiffs(first: 100, filter: {blockHeight: {equalTo: "13767861"}}) {
    nodes {
      address
      storageKey
      storageValue
      vatUrnArtsByDiffId(first: 1) {
        nodes {
          rawUrnByUrnId {
            identifier
          }
        }
      }
    }
  }
}

Here, we expect to see one diff whose identifier matches the urn's address, 0xbB7497BAaF231B8b7D92e0cFf9BCf4F2018C2d2d.

In case of missing urn data, we can use the backfill urns script to populate it.

This worked example is for urns, but you can take a similar approach for other types of data, like auctions: look for the event history, match it with the storage history, and compare to see if any data is missing.

In the case of missing non-urn data, you will have to run a storage backfill over the block range of missing data.

Query timeouts

Queries may time out in production, especially large summary queries like getUrnsByIlk and allClips for large collateral types like ETH-A. If a query times out, the GraphQL API will return a message like this:

{
  "errors": [
    {
      "message": "canceling statement due to statement timeout",
      "locations": [
        {
          "line": 2,
          "column": 3
        }
      ],
      "path": [
        "getUrnsByIlk"
      ]
    }
  ],
  "data": {
    "getUrnsByIlk": null
  },
  "meta": {
    "graphqlQueryCost": 158
  }
}

In most cases, these queries are already limited to a maximum page size, and the end user can tune this parameter. However, in some cases we may need to add or change the limit. You can tune page size using the Postgraphile pagination cap parameters (start here), or by adding a max results parameter to the underlying SQL function.

Common non-errors

There are a few noisy log messages that cat can look concerning, but are not.

Unique constraint violations in log files (header sync, MCD execute). We occasionally receive duplicate headers due to re-orgs, which will log a message like this:

{
    "blockNumber": 13849045,
    "headerHash": "0xa0d85d3fef24a4d0c6e7ed32752140da61bc43dbbbe8b070abc75c0f65b90fab",
    "headerId": 6721269,
    "level": "warning",
    "msg": "error marking header checked: pq: insert or update on table \"checked_headers\" violates foreign key constraint \"checked_headers_header_id_fkey\"",
    "time": "2021-12-21T14:28:48Z"
}

Connection errors in header sync. Occasionally header sync will fail to connect to RPC, but will retry. These are only an issue if the failed connection persists:

{
    "SubCommand": "headerSync",
    "level": "error",
    "msg": "headerSync: ValidateHeaders failed: error creating validation window: Post \"https://geth0.mainnet.makerops.services/rpc\": dial tcp 3.222.28.184:443: connect: connection refused",
    "time": "2021-12-21T11:34:25Z"
}

General troubleshooting

  • First, check if headers and diffs are correctly syncing using the diagnostic queries.
  • Look in Sentry for application errors.
  • Look at Cloudwatch logs for each container.
  • Look at ECS logs for each service.