Author: @typerbole, GitHub, stack-stats.com
For the second installment of the Stack Stats series we're going to generate our own HODL Waves charts as popularized by Dhruv Bansal at Unchained Capital. In the previous notebook we used CoinMetric's community data, which is mostly pre-calculated for us. This time around we're going to get our hands dirty and work directly with raw on-chain data, building some familiarity with blockchain data structures that will serve us well for future tasks such as calculating Realized Cap.
This is the second notebook in the Stack Stats repository. See the Stack Stats readme for more about this project and further instructions in how to run the jupyter notebook. This tutorial assumes a basic familiarity with Python and SQL.
The HODL Wave chart is a relatively simple idea: each UTXO (see here for an explanation of UTXOs) has a "birthday", the date at which it was created in a transaction. We can use this birthday to calculate the current age of a UTXO on a given date and bucket all UTXO into age bands we define. Visualizing the BTC value distribution across these age bands over time demonstrates clear patterns in BTC accumulation from market cycles. As Bansal described in the original blog post:
A common pattern after every rally in Bitcoin’s price is what we have named a “HODL wave.” A HODL wave is created when a large amount of Bitcoin transacts on the way up to and through a local price high, becoming recent BTC (1 day — 1 week old), and then slowly ages into each later band as its new owners HODL.
A HODL wave manifests visually on the chart as a pattern of nested curves caused by each age band becoming suddenly much fatter (taller) at progressively later times from the rally. The image below traces a few of the largest HODL waves.
Here is Unchained Capital's annotated version of the HODL wave chart, which we will recreate:
In an ideal world every good bitcoiner runs a full node and owns an absurd linux machine with 1TB RAM so they can analyze their personally validated copy of the blockchain directly with no third parties comfortably from their citadel.
Unfortunately it's pretty difficult for most of us to analyze the blockchain directly with common hardware (a laptop in my case) as this analysis is memory and CPU intensive. There are various open-source tools available for parsing the blockchain but they require local storage of the blockchain (which is currently a few hundred GB) and are not very acessible to beginners.
Lucky for us Google provides BigQuery public datasets containing up-to-date SQL tables for Bitcoin transactions and blocks. These datasets are created and maintained by the folks behind the open source project Blockchain ETL.
Relying on Google for your Bitcoin consensus isn't anyone's idea of cypherpunk, but the data is public and Google's cloud provides fast, efficient and cheap data analysis so here we are. Don't rely on this data for anything other than fun charts.
You will need a Google account with BigQuery enabled to run the queries in this notebook. Google gives you 1TB of free BigQuery usage per month ("The first 1 TB of query data processed per month is free") and the query developed in this notebook processes ~160 GB.
You should be very careful when using BigQuery against the Bitcoin data as it is easy to run many queries, exceed your free usage and incur cloud fees. I am not responsible for your query usage.
If you don't want to go through the effort of using BigQuery to pull your own data you can follow along as I explain how to build the query as you will still learn about Bitcoin data structures. I provide the end product as a .csv file so you can still create the HODL wave chart without running BigQuery yourself.
Here is the schema for the Transactions table in BigQuery:
Each BTC transaction has one or more inputs (unless is_coinbase) and one or more outputs. These inputs/outputs are uniquely identified by the transaction hash (hash) and the index (inputs.index / outputs.index) within that transaction.
To get started we'll want to create a table of transaction outputs (TXO) with the date they were created and the date they were spent (if they were ever spent). This will enable us to know whether a certain TXO existed on any given day, whether it was unspent on that day and how old it was on that day.
To do this we'll need to perform a join of the transaction outputs with the transaction inputs. If a TXO was ever spent, it will be listed as an input into another transaction.
Let's start by defining sub-queries that contain the raw data we will need for the join. To do this we'll use BigQuery's WITH syntax to define tables and build our query incrementally.
# Start our query using WITH statement so we can define subqueries
QUERY = '''
WITH
'''
# Append a subquery to our cumulative QUERY string
QUERY += '''
-- Outputs subquery: contains relevant information about a given output.
-- A TXO is created when it is an output of a transaction, so this contains
-- metadata about the TXO creation
output AS (
SELECT
transactions.HASH AS transaction_hash,
transactions.block_number AS created_block_number,
transactions.block_timestamp AS created_block_ts,
outputs.index AS output_index,
outputs.value AS output_value
FROM
`bigquery-public-data.crypto_bitcoin.transactions` AS transactions,
transactions.outputs AS outputs
),
'''
QUERY += '''
-- Inputs subquery: contains relevant information about a given input.
-- A TXO is consumed when it is the input to a transaction, so this metadata
-- tells us about when a TXO is spent or destroyed
input AS (
SELECT
transactions.hash AS spending_transaction_hash,
inputs.spent_transaction_hash AS spent_transaction_hash,
transactions.block_number AS destroyed_block_number,
transactions.block_timestamp AS destroyed_block_ts,
inputs.spent_output_index,
inputs.value AS input_value
FROM
`bigquery-public-data.crypto_bitcoin.transactions` AS transactions,
transactions.inputs AS inputs
),
'''
Let's check in and see what our query looks like so far.
print(QUERY)
Looking good.
Now we perform a join of the outputs to the inputs to see if/when a given TXO was spent. Outputs in one transaction are linked as inputs to another based on the transaction hash and the output index.
QUERY += '''
-- txo subquery: joins outputs to inputs so that we know when/if a TXO is spent.
txo AS (
SELECT
output.transaction_hash,
output.created_block_number,
DATETIME(output.created_block_ts) AS created_block_ts,
-- Any field from the input table will be NULL if the TXO remains unspent.
input.spending_transaction_hash,
input.spent_transaction_hash,
input.destroyed_block_number,
DATETIME(input.destroyed_block_ts) AS destroyed_block_ts,
output.output_value
FROM
output
-- Use Left Join, as not all outputs will be linked as inputs in future transactions if they remain unspent.
LEFT JOIN
input
ON
-- Join an output to a future input based on the output transaction hash
-- matching the spent transaction hash of the input
output.transaction_hash = input.spent_transaction_hash
-- Also make sure the output index matches within the transaction hash
AND output.output_index = input.spent_output_index
),
'''
Checking in again:
print(QUERY)
Great, looks like we now have a functional table with TXO and relevant metadata. Now we need to create a time series to join against these TXO so we can do daily aggregation.
We could calculate the HODL wave distribution for every block, but there are ~144 blocks per day. We don't need that level of granularity so instead we'll calculate the distribution by taking a snapshot from the last block produced per day. This way we'll have one time series point per day.
To do this, we'll do a simple query against the Blocks table.
Here is the schema for the Blocks table in BigQuery:
QUERY += '''
-- blocks subquery: for each date get the final block for that date
blocks AS (
SELECT
DATE(timestamp) AS date,
-- Get last block per day
MAX(number) AS block_number,
MAX(DATETIME(timestamp)) AS block_ts
FROM
`bigquery-public-data.crypto_bitcoin.blocks`
GROUP BY
date)
'''
Now we have all the subqueries in place to do our final aggregation.
We'll cross join the txo table against our blocks table. For each row in the blocks table, which represents a day in our time series, we look at all the TXO in our set and keep the TXO that:
Then we use the DATETIME_DIFF function to get the TXO age at that given block and bucket each UTXO into one of our HODL wave bands.
The original HODL waves chart is weighted by BTC value, meaning that the relative width of the age bands represents the share of BTC value corresponding to that band. There are some additional weightings that I think could be interesting:
In this notebook we will cover 1 - 3.
Realized cap weighting requires us to upload price data into BigQuery and will be explored in the next Stack Stats tutorial on Realized Cap. Stay tuned!
QUERY += '''
-- final data aggregation query: join txo with blocks, keeping only txo
-- that were created and unspent as of that block, then bucket the txo
-- by age and sum the txo value per bucket per that day
SELECT
-- Time series metadata
blocks.date AS date,
blocks.block_number AS block_number,
blocks.block_ts AS block_ts,
-- BTC Value Weighting
-- Total UTXO value on that date
SUM(txo.output_value) AS total_utxo_value,
-- Our HODL Waves buckets, counting value of UTXO
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 1, txo.output_value, 0)) AS utxo_value_under_1d,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 1
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 7,
txo.output_value, 0)) AS utxo_value_1d_1w,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 7
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28,
txo.output_value, 0)) AS utxo_value_1w_1m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 3,
txo.output_value, 0)) AS utxo_value_1m_3m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 6,
txo.output_value, 0)) AS utxo_value_3m_6m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 6
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12,
txo.output_value, 0)) AS utxo_value_6m_12m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 18,
txo.output_value, 0)) AS utxo_value_12m_18m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 18
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 24,
txo.output_value, 0)) AS utxo_value_18m_24m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 2
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 3,
txo.output_value, 0)) AS utxo_value_2y_3y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 5,
txo.output_value, 0)) AS utxo_value_3y_5y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 5
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 8,
txo.output_value, 0)) AS utxo_value_5y_8y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 8,
txo.output_value, 0)) AS utxo_value_greater_8y,
-- Flat Weighting
-- Total UTXO count on that date
SUM(1) AS total_utxo_count,
-- Our HODL Waves buckets, counting number of UTXO
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 1, 1, 0)) AS utxo_count_under_1d,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 1
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 7,
1, 0)) AS utxo_count_1d_1w,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 7
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28,
1, 0)) AS utxo_count_1w_1m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 3,
1, 0)) AS utxo_count_1m_3m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 6,
1, 0)) AS utxo_count_3m_6m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 6
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12,
1, 0)) AS utxo_count_6m_12m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 18,
1, 0)) AS utxo_count_12m_18m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 18
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 24,
1, 0)) AS utxo_count_18m_24m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 2
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 3,
1, 0)) AS utxo_count_2y_3y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 5,
1, 0)) AS utxo_count_3y_5y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 5
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 8,
1, 0)) AS utxo_count_5y_8y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 8,
1, 0)) AS utxo_count_greater_8y,
-- Flat weighting, filtered
-- Total UTXO count on that date (> 0.01 BTC)
SUM(IF(txo.output_value / 100000000 > 0.01, 1, 0)) AS total_utxo_count_filter,
-- Our HODL Waves buckets, counting number of UTXO (> 0.01 BTC)
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 1
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_under_1d,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 1
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 7
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_1d_1w,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 7
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_1w_1m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 3
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_1m_3m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 6
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_3m_6m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 6
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_6m_12m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 18
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_12m_18m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 18
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 24
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_18m_24m,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 2
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 3
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_2y_3y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 3
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 5
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_3y_5y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 5
AND DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) < 28 * 12 * 8
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_5y_8y,
SUM(IF(DATETIME_DIFF(blocks.block_ts, txo.created_block_ts, DAY) >= 28 * 12 * 8
AND txo.output_value / 100000000 >= 0.01,
1, 0)) AS utxo_count_filter_greater_8y
FROM
blocks
CROSS JOIN
txo
WHERE
-- Only include transactions that were created on or after the given block
blocks.block_number >= txo.created_block_number
-- Only include transactions there were unspent as of the given block
AND (
-- Transactions that are spent after the given block, so they are included
blocks.block_number < txo.destroyed_block_number
-- Transactions that are never spent, so they are included
OR txo.destroyed_block_number IS NULL)
GROUP BY
date, block_number, block_ts
ORDER BY
date ASC;
'''
print(QUERY)
That is our full query to get the HODL waves data!
I recommend copy/pasting the query into the BigQuery web UI and running from there. You can also run the query from this notebook using Pandas BigQuery API if desired.
The query takes 10-60 minutes to run depending on BigQuery availability, and when it's done you'll have a time series with the HODL waves distribution. You can then save the query output as a CSV, which we'll now be loading into the notebook and plotting.
import pandas as pd
import os
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'
import chart_utils
# try:
# os.remove("btc.csv")
# except FileNotFoundError:
# pass
# !wget https://coinmetrics.io/newdata/btc.csv
# Save your own version of the HODL waves query output, or use the version from the repo
waves = pd.read_csv("./data/hodl_waves.csv")
# Load in CoinMetrics BTC data to get daily price
price = pd.read_csv("btc.csv",
usecols=['date', 'PriceUSD', 'SplyCur'])
# Join the price data onto the waves dataframe
waves = waves.merge(price, on='date')
print(waves.columns.values)
waves.head()
As a quick sanity check, let's compare our total_utxo_value to the SplyCur column from CoinMetrics. Both should correspond to the current cumulative stock of BTC on a given date.
# Since our UTXO value columns are denominated in Sats, we must dived by 100M to get to BTC.
waves['OurSplyCur'] = waves['total_utxo_value'] / 100000000
waves['SplyCurDelta'] = waves['OurSplyCur'] - waves['SplyCur']
chart_utils.two_axis_chart(
waves, x_series='date', y1_series=['OurSplyCur', 'SplyCur'], y2_series='SplyCurDelta',
title='Our calculated supply vs CoinMetrics',
y1_series_axis_type='linear',
y2_series_axis_type='linear', y2_series_axis_format="{n}")
Our calculated supply matches CoinMetric's pretty closely. This gives me some confidence that we didn't screw anything up too bad and that we are correctly accounting for UTXO.
But there is a pretty small but consistent delta between our data and theirs, and it seems to jump in a stepwise manner right after a halving. This suggests to me that our date conventions are different. Currently we calculate ~91.2 BTC greater supply at the end of our day than they do. Since 91.2 BTC is ~7 blocks worth of block rewards, they could just end their "day" a little over an hour before us, excluding the BTC issued in that hour. This wouldn't account for why the detla used to be negative, though...
Would love to hear other theories about this.
Now that we have some confidence that our data is sane, all that is left is to plot the beautiful HODL waves. I provide a function in chart_utils.py, hodl_waves_chart() that takes our waves df as input and makes a nice chart.
chart_utils.hodl_waves_chart(waves, version='value')
This matches the orgininal Unchained Capital version very well. You can clearly see the same waves patterns they identified.
Does it look different if we use the TXO count version, rather than the total value version?
chart_utils.hodl_waves_chart(waves, version='count')
The TXO count chart is much smoother but has large dislocations in May 2011, April 2013 and especially July 2015. The first two dislocations correspond to local price maxima, while the third matches up with a local price minima. Interesting.
What if we exclude small UTXO (< 0.01 BTC) and only count larger value UTXO?
chart_utils.hodl_waves_chart(waves, version='count_filter')
The big July 2015 dislocation mostly disappears when we ignore low-value outputs. There are also now pretty clear spikes in short-held (< 6 months) UTXO around local price maxima. Let's explore this some more.
# Calculate share of non-dust UTXO held for under 6 months
waves['short_held_txo_pct'] = (
waves['utxo_count_filter_under_1d'] + waves['utxo_count_filter_1d_1w'] + waves['utxo_count_filter_1w_1m']
+ waves['utxo_count_filter_1m_3m'] + waves['utxo_count_filter_3m_6m']
) / waves['total_utxo_count_filter']
chart_utils.two_axis_chart(
waves, x_series='date', y1_series='short_held_txo_pct', y2_series='PriceUSD',
title='Share of non-dust UTXO held for under 6 months',
y1_series_axis_type='linear', y1_series_axis_range=[0, 1], y1_series_axis_format=",.0%",
y2_series_axis_type='log', y2_series_axis_range=[-2, 6])
The share of UTXO >= 0.01 BTC held for under 6 months surges around (and sometimes lagging) market peaks and wanes around troughs with a long term downward trend.
The pattern that stands out to me is the rapid spikes in share: when the market peaks the share of young UTXO nearly goes vertical. Therefore the slope of this series might be interesting to look at as an indicator.
# Lets calculate the 14 day delta between datapoints to get the 14 day rolling average detla (or slope) of the
# short term held UTXO series
waves['short_held_txo_pct_DoD_delta'] = (
waves['short_held_txo_pct'] - waves['short_held_txo_pct'].shift(14)) / 14
chart_utils.two_axis_chart(
waves, x_series='date', y1_series='short_held_txo_pct_DoD_delta', y2_series='PriceUSD',
title='Stack Rate: Daily change in percent of UTXO held for under 6 months',
y1_series_axis_type='linear', y1_series_axis_format=",.2%",
y2_series_axis_type='log', y2_series_axis_range=[-2, 6])
The daily change in share of UTXO held for under 6 months, which I call the Stack Rate, is positive in the period preceding a market top and usually hits a peak of 0.3% per day before correcting during a bear market.
Thanks for following along and I hope you learned something.
If you enjoyed this tutorial you can follow me on twitter @typerbole, where I will continue to publish tutorials and Bitcoin data science content. Feel free to DM me with feedback or suggestions, or email me at [my twitter handle] at pm.me.