Skip to content

Commit

Permalink
PLAT-63: Fix examples; Overhaul documentation
Browse files Browse the repository at this point in the history
This MR updates the format, content, and structure of the python api client documentation
  • Loading branch information
victoreram committed Jan 8, 2025
1 parent 083e1fb commit be404e6
Show file tree
Hide file tree
Showing 139 changed files with 25,657 additions and 8,604 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,4 @@ examples/*.txt
examples/*/*.txt

.idea/
.DS_Store
3 changes: 2 additions & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,11 @@ generate-docs:
- echo "update version:"
- echo $UPDATE_VERSION
- bash update_version.sh $UPDATE_VERSION
- export PYTHONPATH=/ && pydoc-markdown -m coinmetrics.api_client > docs/docs/api_client.md
- export PYTHONPATH=/ && pydoc-markdown -m coinmetrics.api_client > docs/docs/reference/api_client.md
- cp -f README.md docs/docs/index.md
- cp -f FlatFilesExport.md docs/docs/FlatFilesExport.md
- cp -f CHANGELOG.md docs/docs/CHANGELOG.md
- cp -f examples/README.md docs/docs/user-guide/examples.md
- cd docs && mkdocs build
- git add --all -- :!api-client-python/
- git status
Expand Down
520 changes: 23 additions & 497 deletions README.md

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/docs/assets/images/cm-dark-combination.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
520 changes: 23 additions & 497 deletions docs/docs/index.md

Large diffs are not rendered by default.

File renamed without changes.
File renamed without changes.
20 changes: 20 additions & 0 deletions docs/docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
:root {
--md-primary-fg-color: #495070;
--md-primary-fg-color--light: #FFFFFF;
--md-primary-fg-color--dark: #161823;
--md-typeset-a-color: #757CA1;

}
/* a:hover {
text-decoration: underline;
} */
/*
a {
color: #1E2130;
text-decoration: none;
} */

a.custom {
color: var(--primary-color);
text-decoration: underline;
}
File renamed without changes.
111 changes: 111 additions & 0 deletions docs/docs/user-guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Best Practices

## Parallel Execution
There are times when it may be useful to pull in large amounts of data at once. The most effective way to do this
when calling the CoinMetrics API is to split your request into many different queries. This functionality is now
built into the API Client directly to allow for faster data export:

```python
import os
from coinmetrics.api_client import CoinMetricsClient


if __name__ == '__main__':
client = CoinMetricsClient(os.environ['CM_API_KEY'])
coinbase_eth_markets = [market['market'] for market in client.catalog_market_candles(exchange="coinbase", base="eth")]
start_time = "2022-03-01"
end_time = "2023-05-01"
client.get_market_candles(
markets=coinbase_eth_markets,
start_time=start_time,
end_time=end_time,
page_size=1000
).parallel().export_to_json_files()
```

This feature splits the request into multiple threads and either store them in separate files (in the case of `.parallel().export_to_csv_files()` and `.parallel().export_to_json_files`)
or combine them all into one file or data structure (in the case of `.parallel().to_list()`, `.parallel().to_dataframe()`,
`.parallel().export_to_json()`). It's important to know that in order to send more requests per second to the CoinMetrics
this uses the [parallel tasks features in Python's concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html)
package. This means when using this feature, the API Client will use significantly more resources and may approach
the [Coin Metrics rate limits](https://docs.python.org/3/library/concurrent.futures.html).

In terms of resource usage and speed, these usages are in order from most performant to least:
* `.export_to_json_files()`
* `.export_to_csv_files()`
* `.to_list()`
* `.export_to_json()`
* `.to_dataframe()`

### Splitting Parameter Queries
There is a feature `time_increment` that can be used to split a single query into many based on time range, and then
combine them later. Consider this example where we speed up getting a 2 months worth of BTC ReferenceRateUSD data into
many parallel threads to create a dataframe faster:
```python
import datetime
import os
from coinmetrics.api_client import CoinMetricsClient
from dateutil.relativedelta import relativedelta
client = CoinMetricsClient(os.environ.get("CM_API_KEY"))
start_time = datetime.datetime.now()
assets = ["btc", "eth", "sol"]
if __name__ == '__main__':
client.get_asset_metrics(
assets=assets,
metrics="ReferenceRateUSD",
frequency="1m",
start_time="2022-03-01",
end_time="2023-03-01",
page_size=1000,
end_inclusive=False).parallel(
time_increment=relativedelta(months=1)).export_to_csv("btcRRs.csv")
print(f"Time taken parallel: {datetime.datetime.now() - start_time}")
start_time = datetime.datetime.now()
client.get_asset_metrics(
assets=assets,
metrics="ReferenceRateUSD",
frequency="1m",
start_time="2022-03-01",
end_time="2023-03-01",
page_size=1000,
end_inclusive=False).export_to_csv("btcRRsNormal.csv")
```
Notice we pass in the `time_increment=relativedelta(months=1)` so that means we will split the threads up by month, in
addition to by asset. So this will run a total 36 separate threads, 12 threads for each month x 3 threads for each asset.
The difference it takes in time is dramatic:
```commandline
Exporting to dataframe type: 100%|██████████| 36/36 [00:00<00:00, 54.62it/s]
Time taken parallel: 0:00:36.654147
Time taken normal: 0:05:20.073826
```

Please note that for short time periods you can pass in a `time_increment` with `datetime.timedelta` to specify up to
several weeks, for larger time frames you can use `dateutil.relativedelta.relativedelta` in order to split requests
up by increments of months or years.


## General Parallelization Guidelines
* If you are using a small `page_size` and trying to export a large number amount of, this will be your biggest bottleneck.
Generally the fastest `page_size` is `1000` to `10000`
* If you are unsure why an action is taking a long time, running the CoinMetricsClient using `verbose=True` or `debug=True`
can give better insight into what is happening under the hood
* The parallel feature is best used when you are exporting a large amount of data, that can be split by query params into
many smaller requests. A good example of this is market candles over a long time frame - if you are querying hundreds
of markets and are sure there will be data, using `.parallel().export_to_csv_files("...")` can have a huge performance
increase, if you are just querying a single market you will not see a difference
* The parallel feature is highly configurable, there is several configuration options that may be suitable for advanced
users like tweaking the `max_workers` parameter, or changing the default `ProcessPoolExecutor` to a `ThreadPoolExectuor`
* Using multithreaded code is inherently more complex, it will be harder to debug issues with long running queries
when running parallel exports compared to normal single threaded code
* For that reason, this tool is best suited for exporting historical data rather than supporting a real time production
system.
* The methods that create separate files for each thread will be the safest and most performant to use - `.export_to_csv_files()`
and `.export_to_json_files()`. Using the methods that return a single output - `.export_to_csv()`, `export_to_list()`, and
`.export_to_dataframe()` need to join the data from many threads before it can be returned, this may use a lot of memory
if you are accessing data types like market orderbooks or market trades and could fail altogether
* If using `export_to_csv/json_files()` functions, note that by default they will be saved in the directory format `/{endpoint}/{parallelize_on}`.
For example, in `export_to_json_files()`,
`client.get_market_trades("coinbase-eth-btc-spot,coinbase-eth-usdc-spot").parallel("markets")` will create a file each like ./market-trades/coinbase-eth-btc-spot.json, ./market-trades/coinbase-eth-usdc-spot.json
`client.get_asset_metrics('btc,eth', 'ReferenceRateUSD', start_time='2024-01-01', limit_per_asset=1).parallel("assets,metrics", time_increment=timedelta(days=1))`
will create a file each like ./asset-metrics/btc/ReferenceRateUSD/start_time=2024-01-01T00-00-00Z.json, ./asset-metrics/eth/ReferenceRateUSD/start_time=2024-01-01T00-00-00Z.json
* If you get the error `BrokenProcessPool` it [might be because you're missing a main() function](https://stackoverflow.com/questions/15900366/all-example-concurrent-futures-code-is-failing-with-brokenprocesspool)
Loading

0 comments on commit be404e6

Please sign in to comment.