Author: @typerbole, GitHub, stack-stats.com
This is the fourth notebook in the Stack Stats repository. See the Stack Stats readme for more about this project and further instructions in how to run the jupyter notebook. This tutorial assumes a basic familiarity with Python and SQL.
In this tutorial we are going to pull some data on the fees paid on Bitcoin transactions to be included Bitcoin blocks. Block space is a scarce resource. From Wikipedia: "The on-chain transaction processing capacity of the bitcoin network is limited by the average block creation time of 10 minutes and the block size limit of 1 megabyte." Because block space is scarce, users typically pay fees to get their transactions included in Bitcoin blocks. Fees are typically denominated in Sats/vByte, and it is this price which helps provision usage of the scarce resource of Bitcoin block space.
The organic emergence of a block space market is a necessary condition for Bitcoin's security model to scale past the era of large block subsidies, which are decreasing by 50% every four years (AKA the halving). So it's important to track how this market is evolving.
In the first part of this notebook we'll plot some heatmap plots to show how block space price distribution changes over time. In the second part we'll try to fit a model to predict block space price as a function of transaction volume demand.
Block space is usually priced as:
For the numerator we can measure the fee paid in either BTC or USD. Both are useful and tell a bit different story.
For the denominator the transaction size is measured in vBytes, which is the size of the transaction with some adjustments made for SegWit. See this Stack Exchange post for an explainer of vBytes.
To obtain our fee data we are going to run a simple query against the the Bitcoin Transactions SQL table on Google BigQuery. I cover how to do this in more detail in tutorial #2, 02_HODLWavesPart1, so I will not repeat the basics and move right along to the query. You will also need to import a private table into your BigQuery with price data from CoinMetrics as described in tutorial #3, 03_HODLWavesPart2RealizedCap.
You should be very careful when using BigQuery against the Bitcoin data as it is easy to run many queries, exceed your free usage and incur cloud fees. I am not responsible for your query usage.
Below is the simple query we will run to get our data. It should be pretty self-explanatory. As usual I will provide the output CSV so you don't have to run the query yourself.
I created fee buckets for transactions in Sats and USD roughly based on 5% percentiles.
QUERY = '''
-- This is the coinmetrics BTC community data imported into BigQuery, as described in tutorial #3
WITH cm AS (
SELECT
date,
PriceUSD
FROM
-- ** YOU WILL HAVE TO REPLACE THE PROJECT NAME HERE TO REFLECT YOUR OWN BIGQUERY TABLE **
`replace_this_project.bitcoin.cm_btc`),
fees AS
(SELECT
-- Transaction date
DATE(tx.block_timestamp) AS date,
-- Fee in sats/vByte. Rounded to nearest integer and capped at 1k to reduce row size
IF(ROUND(tx.fee / tx.virtual_size, 0) > 1000, 1000, ROUND(tx.fee / tx.virtual_size, 0)) AS sats_per_vbyte,
-- Fee bucket in sats/vByte
CASE
WHEN tx.fee / tx.virtual_size = 0 THEN '0'
WHEN tx.fee / tx.virtual_size < 2 THEN '0-2'
WHEN tx.fee / tx.virtual_size < 4 THEN '2-4'
WHEN tx.fee / tx.virtual_size < 6 THEN '4-6'
WHEN tx.fee / tx.virtual_size < 9 THEN '6-9'
WHEN tx.fee / tx.virtual_size < 12 THEN '9-12'
WHEN tx.fee / tx.virtual_size < 18 THEN '12-18'
WHEN tx.fee / tx.virtual_size < 24 THEN '18-24'
WHEN tx.fee / tx.virtual_size < 30 THEN '24-30'
WHEN tx.fee / tx.virtual_size < 40 THEN '30-40'
WHEN tx.fee / tx.virtual_size < 50 THEN '40-50'
WHEN tx.fee / tx.virtual_size < 60 THEN '50-60'
WHEN tx.fee / tx.virtual_size < 80 THEN '60-80'
WHEN tx.fee / tx.virtual_size < 100 THEN '80-100'
WHEN tx.fee / tx.virtual_size < 130 THEN '100-130'
WHEN tx.fee / tx.virtual_size < 170 THEN '130-170'
WHEN tx.fee / tx.virtual_size < 220 THEN '170-220'
WHEN tx.fee / tx.virtual_size < 350 THEN '220-350'
WHEN tx.fee / tx.virtual_size < 600 THEN '350-600'
WHEN tx.fee / tx.virtual_size >= 600 THEN '600+'
ELSE 'NA' END AS sats_per_vbyte_bucket,
-- Fee in USD/vbye. Rounded to nearest tenth of a dollar and capped at $100 to reduce row size
IF(ROUND((tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size, 6) > 100, 100, ROUND((tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size, 6)) AS usd_per_vbyte,
-- Fee bucket in USD/vByte
CASE
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size = 0 THEN '$0'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.00005 THEN '$0-$0.00005'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.00008 THEN '$0.00005-$0.00008'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0001 THEN '$0.00008-$0.0001'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.00015 THEN '$0.0001-$0.00015'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0002 THEN '$0.00015-$0.0002'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.00025 THEN '$0.0002-$0.00025'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0003 THEN '$0.00025-$0.0003'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0004 THEN '$0.0003-$0.0004'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0005 THEN '$0.0004-$0.0005'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.00065 THEN '$0.0005-$0.00065'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0009 THEN '$0.00065-$0.0009'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.001 THEN '$0.0009-$0.001'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0015 THEN '$0.001-$0.0015'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.002 THEN '$0.0015-$0.002'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.003 THEN '$0.002-$0.003'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.0045 THEN '$0.003-$0.0045'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.007 THEN '$0.0045-$0.007'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.01 THEN '$0.007-$0.01'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size < 0.15 THEN '$0.01-$0.15'
WHEN (tx.fee * cm.PriceUSD / 100000000) / tx.virtual_size >= 0.15 THEN '$0.15+'
ELSE 'NA' END AS usd_per_vbyte_bucket,
tx.fee AS fee,
tx.fee * cm.PriceUSD / 100000000 AS fee_usd,
tx.virtual_size AS virtual_size
FROM
`bigquery-public-data.crypto_bitcoin.transactions` AS tx
JOIN
cm
ON
DATE(tx.block_timestamp) = cm.date
WHERE
NOT(tx.is_coinbase))
-- Run and save this query for the data in part 1
SELECT
month,
bucket,
bucket_type,
tx_count
FROM
(SELECT
-- Aggregate per month to make data more manageable
DATE_TRUNC(date, MONTH) AS month,
sats_per_vbyte_bucket AS bucket,
'sats' AS bucket_type,
SUM(1) AS tx_count
FROM
fees
GROUP BY
month,
bucket,
bucket_type)
UNION ALL
(SELECT
-- Aggregate per month to make data more manageable
DATE_TRUNC(date, MONTH) AS month,
usd_per_vbyte_bucket AS bucket,
'usd' AS bucket_type,
SUM(1) AS tx_count
FROM
fees
GROUP BY
month,
bucket,
bucket_type)
ORDER BY
month ASC, bucket_type, bucket
;
-- Run and save this query for the data in part 2
-- SELECT
-- date,
-- SUM(fee) AS sum_fees_sats,
-- SUM(fee_usd) AS sum_fees_usd,
-- SUM(virtual_size) AS sum_block_vbytes
-- FROM
-- fees
-- GROUP BY
-- date
-- ORDER BY
-- date ASC;
'''
print(QUERY)
import os
import pandas as pd
import numpy as np
import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation_mpl
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'
import chart_utils
import analysis_utils
Now that we have our data from BigQuery we can import it into our Jupyter notebook and create the plots.
# Download CoinMetrics data if needed
# try:
# os.remove("btc.csv")
# except FileNotFoundError:
# pass
# !wget https://coinmetrics.io/newdata/btc.csv
# Read in the transaction fee data
fee_data = pd.read_csv("data/04_block_space_price_heat.csv")
# Add in some useful date columns for aggregation
fee_data = analysis_utils.get_extra_datetime_cols(fee_data, 'month')
# Get list of unique months
months = fee_data['month_string'].unique()
# Aggregate data by Sats fee bucket
fee_data_sats = fee_data.loc[fee_data['bucket_type'] == 'sats'].reset_index(drop=True). \
groupby(['month_string', 'bucket'], group_keys=False).sum()[['tx_count']]
# Aggregate data by USD fee bucket
fee_data_usd = fee_data.loc[fee_data['bucket_type'] == 'usd'].reset_index(drop=True). \
groupby(['month_string', 'bucket'], group_keys=False).sum()[['tx_count']]
# Read in price and transaction volume data from CoinMetrics
price_data = pd.read_csv(
"btc.csv"
)
# Add in some useful date columns for aggregation
price_data = analysis_utils.get_extra_datetime_cols(price_data, 'date')
# Calculate some rolling averages of transaction volume and transaction count
price_data['transaction_volume_usd'] = price_data['TxTfrValUSD'].rolling(28).mean()
price_data['transaction_volume_btc'] = price_data['TxTfrValNtv'].rolling(28).mean()
price_data['transaction_count'] = price_data['TxCnt'].rolling(28).mean()
# Calculate monthly averages for the price and transaction volume data
price_data_monthly = price_data.groupby(['month_string'], group_keys=False).mean()[
['PriceUSD', 'transaction_volume_btc', 'transaction_volume_usd', 'transaction_count']]
I put together a function in chart_utils.py, block_space_price_heatmap() that will plot our heatmap charts. See chart_utils.py if you're interested in the plotting.
chart_utils.block_space_price_heatmap(
fee_data_sats, months, price_data_monthly, type='sats', data_source='Bitcoin ETL & CoinMetrics')
You can spot definite trends in the distribution of blockspace price denominated in Sats over time. I see four eras of fee distribution:
Let's take a look at the same chart with fees denominated in USD.
test = chart_utils.block_space_price_heatmap(
fee_data_usd, months, price_data_monthly, type='usd', data_source='Bitcoin ETL & CoinMetrics')
This chart, with fees and volume denominated in USD, looks more responsive. This makes sense because USD is still likely the chief unit of account for most Bitcoin businesses and users.
Let's try to quantify this relationship in the next section.
import statsmodels.api as sm
block_space_daily = pd.read_csv("data/04_block_space_daily.csv")
# Join the price data onto the blockspace dataframe
block_space_daily = block_space_daily.merge(price_data[
['date', 'PriceUSD', 'datetime', 'month_string', 'week_string', 'year_string',
'TxTfrValUSD', 'TxTfrValNtv']], on='date')
# Calculate daily mean fee rate
block_space_daily['fees_sats_per_vbyte'] = block_space_daily['sum_fees_sats'] / block_space_daily['sum_block_vbytes']
block_space_daily['fees_usd_per_vbyte'] = block_space_daily['sum_fees_usd'] / block_space_daily['sum_block_vbytes']
# Let's only look at data > 2018, as this is when SegWit usage hits critical mass
block_space_daily_recent = block_space_daily.loc[block_space_daily['date'] >= '2018-01-01']
Let's plot the relationship between daily USD transaction volume and daily mean fee rate (USD / vByte).
fig = plt.figure(
figsize=[12, 6],
clear=True,
tight_layout=True
)
ax = plt.scatter(
block_space_daily_recent['TxTfrValUSD'], block_space_daily_recent['fees_usd_per_vbyte'],
c=block_space_daily_recent['datetime']).axes
ax.set_yscale("log")
ax.set_ylim(0.0001, 1)
ax.set_yticklabels(['${:.4f}'.format(x) for x in ax.get_yticks()])
ax.set_ylabel('Daily Mean Fee Rate (Total $USD Fees / Total Block vByte)', fontsize=13)
ax.set_xscale("log")
ax.set_xlim(1000000000, 1000000000000)
ax.set_xticklabels(['${:.0f}B'.format(x / 1000000000) for x in ax.get_xticks()])
ax.set_xlabel('Daily USD Transaction Volume', fontsize=18)
ax.set_title('Daily Bitcoin USD Transaction Volume vs. Daily Fee Rate', fontsize=18)
ax.text(
1, -0.1,
"By: @typerbole",
transform=ax.transAxes,
horizontalalignment='center',
verticalalignment='center', fontsize=16
)
ax.text(
0, -0.1,
"Data: Bitcoin ETL & CoinMetrics",
transform=ax.transAxes,
horizontalalignment='center',
verticalalignment='center', fontsize=12
)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(pd.to_datetime(cbar.get_ticks()).strftime(date_format='%b %Y'))
plt.show()
That looks like a pretty nice linear relationship when viewed on a log-log scale. As you can see by the different colors points around the same time tend to cluster together. The early 2018 points (purple) may be outliers.
Let's build a simple model to capture this relationship.
# Create our data series
X = np.log(block_space_daily_recent['TxTfrValUSD'])
X = sm.add_constant(X)
y = np.log(block_space_daily_recent['fees_usd_per_vbyte'])
# Fit our model
model = sm.OLS(y, X).fit()
# Let's make predictions for the model over a range of transaction volume values so we can plot the line
line_x = np.log(np.arange(1000000000, 1000000000000, 1000000000))
line_x = sm.add_constant(line_x)
line_y = model.get_prediction(line_x, transform=False)
# We'll also get the prediction & confidence intervals
prediction_interval = line_y.summary_frame(alpha=0.05)
line_y_mean = np.exp(prediction_interval['mean'])
line_y_lower_ci = np.exp(prediction_interval['obs_ci_lower'])
line_y_upper_ci = np.exp(prediction_interval['obs_ci_upper'])
line_y_lower_pi = np.exp(prediction_interval['mean_ci_lower'])
line_y_upper_pi = np.exp(prediction_interval['mean_ci_upper'])
# Print out the statistics
model.summary()
Let's look at the fit and the residuals.
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model, 'TxTfrValUSD', fig=fig)