A predictive algorithm for cryptocurrency market making

  1. Introduction
  2. Input parameters, data, and dependencies
  3. Prediction of fair value
  4. Spread estimation
  5. Conclusions
  6. References

Introduction ¶

This document details a procedure to investigate if it is possible to predict the fair value of a cryptocurrency by using its limit order book data and the respective market data (trades).

Hypotheses and assumptions¶

Historical limit order-book (LOB) probability distribution¶

The probability distribution of an order at depth $D$ of the order book being executed is assumed to be a Poisson distribution.
Using such a distribution, the probability of a limit order at $D > 10$ being hit is negligible (p = 0.00045). We will therefore consider only the first 10 levels of the historical LOBs. This also has the benefit of speeding up the calculations.

Definition of fair value¶

The instantaneous fair value is assumed to be the arithmetic average of the bid and ask price at the top of the LOB (level 0), usually referred to as mid-price.
The predicted fair value is based on the same convention. Since the fair value is the variable we want to predict, it will equivalently be referred to as "target" in the rest of this document.

Frequency of historical data¶

The frequency of the historical data available is not constant, that is, the time delta between $LOB_i$ and $LOB_{i-1}$, is not constant.
Although it is possible to resample the historical data in order to have a constant time delta, doing so has two undesired effects:

  1. the dataset to analyze grows up significantly in size, especially at low resampling frequencies (which is desired to preserve the information present in the original dataset)
  2. the resampled dataset contains multiple duplicate entries (the lower the resampling frequency, the higher the duplicate entries) which make the subsequent calculations numerically unstable

We also assume that the real time data will not arrive at constant intervals. For these reasons resampling is disabled.

Methodology¶

Fair value prediction¶

The target variable we will try to predict is the average percent change of the mid-price $N$ time steps ahead (with $N$ being configurable).
According to the available literature ([2], [3]), there might be a relationship, possibly linear, between the target (future mid-price) and the independent variables (the prices and volumes at each level of the historical LOBs). We will therefore search first for a linear relationship between the target and the independent variables, and then explore other possible non-linear relationships between the target and the independent variables.

Spread estimation¶

In order to provide quotes around the predicted fair price we need to estimate the spread. The bid and ask quotes will be calculated as:

\begin{align*} & bid = fair - spread/2 \tag{1} \end{align*}\begin{equation} ask = ask + spread/2 \tag{2} \end{equation}

The spread is calculated as suggested in Avellaneda and Stoikov [1]:

\begin{equation} spread = (T - t) \cdot \gamma \cdot \sigma^2 + log_e(1 + \gamma/k) \cdot 2/\gamma \tag{3} \end{equation}

Input parameters, data, and dependencies ¶

Group all user input parameters

In [ ]:
## Maximum depth of the orderbook to consider. 
## Orders at levels greater than maxdepth will be discarded
## Use -1 to use all levels
maxdepth_parse=10
maxdepth_flatten=10

## Frequency to resample the dataframes (seconds).
## Use -1 to skip resampling
resample_freq = -1 # int(np.floor(pd.Series(ob.index).diff().describe()['25%'].total_seconds()))

## Weight the prices and volumes of each LOB level. See function calculate_weights()
orderbook_depth_weights = 'none'

## The quantity we want to predict
target_side = 'mids'

## Predict the target N steps ahead.
## If it is an array with more than one element, populate the dataframe with a column for each element
## The prediction is for the last element of the array
n_steps_future = [0,5]

## The last date of the dataset we want to analyse
## Leave empty to use the whole dataset
last_date = ''

## Merge trades with order books
merge_ob_trades = False

## Parameters for spread estimation
gamma=0.1
kappa=1.5
std_window = 1000

Load libraries

In [ ]:
import gc
import json
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import numpy as np
import pandas as pd
pd.set_option('precision',12)
pd.options.plotting.backend = 'matplotlib'
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

import xgboost
from sklearn.utils import class_weight
from sklearn import model_selection
import sklearn.linear_model as linear_model
import sklearn.ensemble as ensemble
import sklearn.feature_selection as feature_selection
import sklearn.metrics as metrics
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import statsmodels.api as sm

import dateutil
import importlib

import seaborn as sns
%config InlineBackend.figure_formats = ['png']
import matplotlib.pyplot as plt
ggplot_styles = {
    'axes.edgecolor': 'white',
    'axes.facecolor': 'white',
    'axes.grid': True,
    'axes.grid.which': 'both',
    'axes.spines.left': False,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.spines.bottom': False,
    'grid.color': 'lightgrey',
    'grid.linewidth': '0.5',
    'xtick.color': '555555',
    'xtick.major.bottom': True,
    'xtick.minor.bottom': False,
    'ytick.color': '555555',
    'ytick.major.left': True,
    'ytick.minor.left': False,
    'figure.figsize': [8,6]
}
plt.rcParams.update(ggplot_styles)

import utilities as utils

Load data from the CSV files and serialize it to retrieve it more efficiently. This is needed only once

In [ ]:
# utils.load_data()
In [ ]:
ob = pd.read_pickle('orderbook.bz2')
trades = pd.read_pickle('trades.bz2')

Remove duplicates

In [ ]:
if merge_ob_trades:
    ob = ob.drop_duplicates('lastUpdated')
    trades = trades.drop_duplicates('timestamp')

Make sure that dates are handled as such (datetime format in Pandas)

In [ ]:
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
ob['lastUpdated'] = pd.to_datetime(ob['lastUpdated'])
trades = trades.set_index('timestamp')
ob = ob.set_index('lastUpdated')

Optionally limit the amount of data to process

In [ ]:
if last_date: ob = ob[ob.index < dateutil.parser.parse(last_date)]

Visual inspection of imported data

In [ ]:
ob
Out[ ]:
asks bids
lastUpdated
2020-07-17 00:00:23.225 [[2.644e-05, 5243], [2.645e-05, 4216], [2.646e... [[2.639e-05, 4216], [2.638e-05, 7000], [2.637e...
2020-07-17 00:00:30.293 [[2.645e-05, 4216], [2.646e-05, 763], [2.647e-... [[2.638e-05, 4989], [2.637e-05, 2427], [2.6360...
2020-07-17 00:00:30.429 [[2.645e-05, 4216], [2.646e-05, 763], [2.647e-... [[2.639e-05, 4216], [2.638e-05, 4989], [2.637e...
2020-07-17 00:01:04.450 [[2.644e-05, 11], [2.646e-05, 4978], [2.647e-0... [[2.639e-05, 4216], [2.638e-05, 6689], [2.637e...
2020-07-17 00:01:10.753 [[2.643e-05, 7], [2.645e-05, 3982], [2.646e-05... [[2.639e-05, 4216], [2.638e-05, 6689], [2.637e...
... ... ...
2020-07-30 23:58:32.329 [[2.7060000000000002e-05, 7396], [2.707e-05, 2... [[2.703e-05, 495], [2.7020000000000002e-05, 28...
2020-07-30 23:58:38.233 [[2.709e-05, 46], [2.7099999999999998e-05, 275... [[2.703e-05, 3251], [2.7020000000000002e-05, 1...
2020-07-30 23:58:44.335 [[2.707e-05, 7395], [2.7099999999999998e-05, 7... [[2.703e-05, 3251], [2.7020000000000002e-05, 1...
2020-07-30 23:58:44.438 [[2.7099999999999998e-05, 54], [2.711e-05, 255... [[2.703e-05, 3251], [2.7020000000000002e-05, 1...
2020-07-30 23:59:01.545 [[2.705e-05, 2758], [2.7060000000000002e-05, 6... [[2.703e-05, 131], [2.7020000000000002e-05, 10...

96858 rows × 2 columns

In [ ]:
trades
Out[ ]:
number id price side amount
timestamp
2020-07-17 00:01:10.647 0 19088200 0.00002643 buy 4.0
2020-07-17 00:01:26.820 1 19088201 0.00002640 sell 58.0
2020-07-17 00:01:26.825 2 19088202 0.00002640 sell 25.0
2020-07-17 00:01:26.841 3 19088203 0.00002640 sell 235.0
2020-07-17 00:01:26.842 4 19088204 0.00002640 sell 36.0
... ... ... ... ... ...
2020-07-30 23:54:05.293 96498 19187367 0.00002707 sell 181.0
2020-07-30 23:57:25.885 96499 19187368 0.00002709 buy 167.0
2020-07-30 23:58:00.690 96500 19187369 0.00002706 sell 21.0
2020-07-30 23:58:00.697 96501 19187370 0.00002705 sell 43.0
2020-07-30 23:58:36.580 96502 19187371 0.00002710 buy 1.0

96503 rows × 5 columns

In [ ]:
trades[['price','amount']].diff().describe()
Out[ ]:
price amount
count 96502.000000000000 96502.000000000000
mean 0.000000000007 -0.000031087439
std 0.000000030751 3320.324964702130
min -0.000000990000 -192662.000000000000
25% -0.000000010000 -261.750000000000
50% 0.000000000000 0.000000000000
75% 0.000000010000 236.000000000000
max 0.000000960000 192641.000000000000
In [ ]:
pd.Series(ob.index).diff().describe()
Out[ ]:
count                        96857
mean     0 days 00:00:12.487670689
std      0 days 00:00:23.259605382
min         0 days 00:00:00.012000
25%         0 days 00:00:00.700000
50%         0 days 00:00:07.103000
75%         0 days 00:00:13.707000
max         0 days 00:12:44.853000
Name: lastUpdated, dtype: object
In [ ]:
if merge_ob_trades:
    ob_backup = ob.copy()
    trades_backup = trades.copy()
    trades = trades.reindex(ob.index, method='bfill')
    ob['market_price'] = trades['price']
    ob['market_volume'] = trades['amount']
    ob['market_side'] = trades['side'].apply(lambda x: 1 if x=='buy' else 0)

Prediction of fair value ¶

Data transformation¶

Parse order books¶

Parse the order books so that each one is stored as a Pandas DataFrame and not as a string

In [ ]:
parsed_ob = utils.parse_all_books(ob, maxdepth_parse=maxdepth_parse)
del(ob); gc.collect()
Out[ ]:
0

Resample dataframes¶

Optionally resample the dataframes to a common frequency

In [ ]:
if resample_freq > 0:
    parsed_ob = parsed_ob.resample(f'{resample_freq}s').bfill()
    #trades = trades.resample(f'{resample_freq}s').bfill()
parsed_ob
Out[ ]:
asks bids
lastUpdated
2020-07-17 00:00:23.225 price volume 0 0.00002644 5243 1 ... price volume 0 0.00002639 4216 1 ...
2020-07-17 00:00:30.293 price volume 0 0.00002645 4216 1 ... price volume 0 0.00002638 4989 1 ...
2020-07-17 00:00:30.429 price volume 0 0.00002645 4216 1 ... price volume 0 0.00002639 4216 1 ...
2020-07-17 00:01:04.450 price volume 0 0.00002644 11 1 ... price volume 0 0.00002639 4216 1 ...
2020-07-17 00:01:10.753 price volume 0 0.00002643 7 1 ... price volume 0 0.00002639 4216 1 ...
... ... ...
2020-07-30 23:58:32.329 price volume 0 0.00002706 7396 1 ... price volume 0 0.00002703 495 1 ...
2020-07-30 23:58:38.233 price volume 0 0.00002709 46 1 ... price volume 0 0.00002703 3251 1 ...
2020-07-30 23:58:44.335 price volume 0 0.00002707 7395 1 ... price volume 0 0.00002703 3251 1 ...
2020-07-30 23:58:44.438 price volume 0 0.00002710 54 1 ... price volume 0 0.00002703 3251 1 ...
2020-07-30 23:59:01.545 price volume 0 0.00002705 2758 1 ... price volume 0 0.00002703 131 1 ...

96858 rows × 2 columns

Visually inspect the data¶

Market price vs time¶

In [ ]:
trades.plot(title='Market price', y='price');

Volume vs LOB depth¶

Calculate and plot the "average" order book

In [ ]:
avg_ob = parsed_ob.sum()/parsed_ob.count()
In [ ]:
side = 'bids'
book = avg_ob[side]

weights = utils.calculate_weights(book)
(book['volume']*weights).plot.bar(title='Side: bid', xlabel='LOB depth', ylabel='Volume');
In [ ]:
side = 'asks'
book = avg_ob[side]
(book['volume']*weights).plot.bar(title='Side: ask', xlabel='LOB depth', ylabel='Volume');
In [ ]:
((avg_ob['asks']-avg_ob['bids'])['volume']*weights).plot.bar(title='Ask minus Bid', xlabel='LOB depth', ylabel='Volume');

Feature selection¶

We will now derive new features from the raw data as they could help uncover correlations/patterns which could help forecast the fair value. The aim is to find a relationship which links the future fair value $N$ steps ahead to the historical LOB data available:

$fair_{t+N} = f(LOB_{t}, LOB_{t-1}, ...)$

Create new features for each order book¶

We now calculate:

  1. the mid-price and its percent changes (appropriately shifted) in order to be able to use them as target variable
  2. the ask/bid ratio for both price and volume
  3. the imbalance ratio for both price and volume. The imbalance ratio is defined as $(x_{ask} - x_{bid})/x_{mid}$
In [ ]:
parsed_ob['mids'] = (parsed_ob['asks'] + parsed_ob['bids'])/2.
parsed_ob = utils.calculate_pct_changes(parsed_ob, target_side, n_steps_future, keyword='_pct_')

spread = parsed_ob['asks'] - parsed_ob['bids']
parsed_ob['askbid_ratio'] = parsed_ob['asks']/parsed_ob['bids'] - 1
parsed_ob['imbalance_ratio'] = spread/parsed_ob['mids']

Calculate descriptive statistics for each feature¶

Calculate the weighted mean of price and volume across the levels of the LOB.

In [ ]:
for metric in ['askbid_ratio','imbalance_ratio']:
    for col in ['price','volume']:
        parsed_ob[f'{metric}_{col}_weighted_mean'] = parsed_ob[metric].apply( lambda book: 
            sum(book[col] * utils.calculate_weights(book,type=orderbook_depth_weights)) 
            / sum(utils.calculate_weights(book,type=orderbook_depth_weights)) )

Specify targets¶

Specify the targets of the analysis. If our hypothesis is correct we should find a linear relationship between the target and the (historical) features $f_i$:

$target_{t+N} = [c + a_1 \cdot f_1 + a_2 \cdot f_2 + ... + a_n \cdot f_n]_{t} + [c + a_1 \cdot f_1 + a_2 \cdot f_2 + ... + a_n \cdot f_n]_{t-1} + ...$

In [ ]:
targets = [t for t in parsed_ob.columns.values if '_pct_' in t and 'target' not in t]
for t in targets:
    parsed_ob['target_' + t] = parsed_ob[t]

Flatten order books¶

In order to carry out the analysis we need to "flatten" the order books so that each level of the order book is represented by a separate column. This way we can easily transform each feature to create new ones which might help to predict the target.

In [ ]:
to_drop = targets + ['mids']
flat_parsed_ob = utils.flatten_all_books(parsed_ob, to_drop=to_drop, 
                                         maxdepth_flatten=maxdepth_flatten)
flat_parsed_ob
Flattening feature: asks
Flattening feature: bids
Flattening feature: askbid_ratio
Flattening feature: imbalance_ratio
Flattening feature: target_mids_pct_0
Flattening feature: target_mids_pct_5
Out[ ]:
lastUpdated askbid_ratio_price_weighted_mean askbid_ratio_volume_weighted_mean imbalance_ratio_price_weighted_mean imbalance_ratio_volume_weighted_mean asks_price_level_0 asks_price_level_1 asks_price_level_2 asks_price_level_3 asks_price_level_4 ... target_mids_pct_5_volume_level_0 target_mids_pct_5_volume_level_1 target_mids_pct_5_volume_level_2 target_mids_pct_5_volume_level_3 target_mids_pct_5_volume_level_4 target_mids_pct_5_volume_level_5 target_mids_pct_5_volume_level_6 target_mids_pct_5_volume_level_7 target_mids_pct_5_volume_level_8 target_mids_pct_5_volume_level_9
1 2020-07-17 00:00:30.293 0.006077951195 95.216377457823 0.006057164490 0.129558427753 0.00002645 0.00002646 0.00002647 0.00002648 0.00002649 ... -56.023900054318 900.564263322884 -67.376214715410 -78.312692598407 214.674130347861 -14.386775427705 3729.701230228471 -96.295870240423 -43.095939541437 332.422171602126
2 2020-07-17 00:00:30.429 0.005696064107 12.132931502401 0.005677517033 0.060698331709 0.00002645 0.00002646 0.00002647 0.00002648 0.00002649 ... -49.928842504744 39.481919332406 329.447446785045 -96.502218159930 1397.647058823530 -30.922277132405 1857.344064386318 -0.412532054856 855.312157721796 -93.554220381497
3 2020-07-17 00:01:04.450 0.005658170966 4.807230932065 0.005639702742 -0.269652793336 0.00002644 0.00002646 0.00002647 0.00002648 0.00002649 ... -69.245327655548 32.827633496186 44.568844870277 -20.173162537896 -49.673763736264 -48.751684744240 1746.981891348089 -26.446649570744 23.254160147916 -90.799882801055
4 2020-07-17 00:01:10.753 0.005278591847 4.644906072570 0.005262190319 -0.110178755809 0.00002643 0.00002645 0.00002646 0.00002647 0.00002648 ... -0.047359696898 5.725798894199 81.350438892640 0.000000000000 -67.269681742044 -30.181590482154 49.924204143507 1051.308900523560 -60.211557834967 -86.477174849268
5 2020-07-17 00:01:18.858 0.004897152943 2.013669986961 0.004882686142 -0.372028741073 0.00002643 0.00002645 0.00002646 0.00002647 0.00002648 ... 7.180570221753 63.979019272993 -29.339504056679 799.381387842926 -90.418124162281 -98.437500000000 33.296964251332 177.090909090909 -42.061745682470 -90.138190954774
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96848 2020-07-30 23:58:25.227 0.007229176642 13.649452908530 0.007199312138 -0.418270221678 0.00002710 0.00002712 0.00002713 0.00002717 0.00002718 ... -8.083867210250 -25.700687466949 129.524232992441 1216.624040920716 -76.126404147636 1161.637931034483 -83.653673432729 742.366554684584 -7.888446215139 -46.207515199841
96849 2020-07-30 23:58:25.328 0.006228613124 10.686884864106 0.006205477910 -0.138614138212 0.00002709 0.00002710 0.00002711 0.00002712 0.00002713 ... 13.886010362694 -66.522349637257 74.476722060129 -6.863442389758 -17.818930041152 0.000000000000 48.823836905384 0.000000000000 0.000000000000 -13.530349123569
96850 2020-07-30 23:58:25.428 0.005709750993 4.638994660619 0.005689828689 0.265893564738 0.00002708 0.00002709 0.00002710 0.00002711 0.00002712 ... 766.938110749186 168.767507002801 -30.515887183149 28.866655732327 -48.277648277648 -79.576008273009 42.849974912193 -3.040293040293 35.144237160456 -72.546223898445
96851 2020-07-30 23:58:26.225 0.006673414751 10.621061081675 0.006647136258 -0.156175782429 0.00002709 0.00002710 0.00002712 0.00002713 0.00002717 ... 14.162348877375 -68.921132693658 32.214830757190 -85.213864306785 0.000000000000 0.000000000000 51.038251366120 0.000000000000 -30.709323421390 -13.548012764723
96852 2020-07-30 23:58:31.628 0.006599422890 0.549903308992 0.006573281075 -0.504127771190 0.00002707 0.00002710 0.00002712 0.00002713 0.00002717 ... -11.162361623616 -65.487009515663 -4.101969872538 94.109904573873 -60.640508495294 4795.555555555556 5.101302460203 -10.931613197854 -63.781583255896 -7.248322147651

96852 rows × 125 columns

In [ ]:
flat_parsed_ob_backup = flat_parsed_ob.copy()
del(parsed_ob); gc.collect()
Out[ ]:
0
In [ ]:
flat_parsed_ob = flat_parsed_ob_backup.copy()

Create additional features¶

The following snippet creates new features from the original ones. The target might be related to one of the transformed features, and this could help improve the predictive power of the model we are going to fit later on.

  • Search for top of the book changes. If the top of the book bid (ask) prices have decreased (increased) an order might have been filled.
  • Calculate rolling statistics for each feature, specifically rolling mean and max/min ratio
  • Create new features by calculating the percentage change of the original features
  • Add lagged features
  • Add features powers
  • Add features logarithms
  • Add binarized features
In [ ]:
keywords = ['lastUpdated','target','rolling','lag','pctchange','traded','pow','loge','log10','bin']
flat_parsed_ob = utils.feature_engineering(flat_parsed_ob, keywords)

Clean data and check if there are any nulls left

In [ ]:
flat_parsed_ob = flat_parsed_ob.replace([np.inf, -np.inf], np.nan)
flat_parsed_ob = flat_parsed_ob.copy().dropna()
flat_parsed_ob.isnull().sum().sum()
Out[ ]:
0

Check types of columns to make sure subsequent calculations run efficiently

In [ ]:
flat_parsed_ob.dtypes.value_counts()
Out[ ]:
float64           1066
int64               42
datetime64[ns]       1
dtype: int64

Data analysis¶

We start with analysing the whole historical LOB dataset to get an understanding of the properties of the independent variables, in particular those that we expect to be able to predict the future fair value.

Analysing and fitting the entire dataset, which represents several trading days, might not provide significant insights into the actual forecasting potential of the historical data. Using two weeks old data to train a model with the aim of using it to make predictions for the next hour will likely lead to poor results. There are in fact thousands of data rows for each trading day, each containing 400 data points (100 ask prices, 100 bid prices, 100 ask volumes, 100 bid volumes). It makes therefore sense to use only data that is, for example, at most 1 day old. Older data is likely not relevant anymore, and therefore not useful in fitting a model with the aim to predict the fair price in the next 30 seconds.

For this reason the analysis of the whole dataset is only vaguely indicative. We will later perform a rolling cross-validation of the models to get an understanding of their possible predictive power in a scenario similar to the one in which we would actually use them.

Stationarity and autocorrelation checks¶

The volume imbalance ratio has positive autocorrelation as shown in the charts below. The partial autocorrelation is significant approximately up to lag 15. This indicates that a positive (negative) imbalance is often followed by periods of persistent positive (negative) imbalance. The imbalance effect is therefore not a transient one, and could help in predicting the target.

In [ ]:
results = sm.tsa.stattools.adfuller(flat_parsed_ob['imbalance_ratio_volume_weighted_mean'])
message = "The series is non-stationary"
if results[1] < 0.05:
    for key, value in results[4].items():
        if abs(results[0]) < abs(value):
            message = f'The series is stationary at a significance level of {key}'
            break
print(message)
The series is non-stationary
In [ ]:
results = sm.tsa.stattools.kpss(flat_parsed_ob['imbalance_ratio_volume_weighted_mean'], nlags="auto")
message = "The series is non-stationary"
if results[1] < 0.05:
    for key, value in results[3].items():
        if abs(results[0]) < abs(value):
            message = f'The series is stationary at a significance level of {key}'
            break
print(message)
The series is stationary at a significance level of 2.5%
In [ ]:
_ = sm.graphics.tsa.plot_acf(flat_parsed_ob['imbalance_ratio_volume_weighted_mean'], lags=40)
In [ ]:
_ = sm.graphics.tsa.plot_pacf(flat_parsed_ob['imbalance_ratio_volume_weighted_mean'], lags=40)

Plot target vs features¶

Visually inspect the data to possibly find a relationship between the target and some features.

Target change vs Volume imbalance¶

The graph immediately below shows that there is a weak relationship between the volume imbalance ratio and the instantaneous price change of the target variable. This relationship seems to get slightly stronger for longer prediction windows as shown in the second chart: future top of the book prices seem to be more likely to be lower if the current ask volume is higher than the bid volume.

In [ ]:
_ = sns.regplot(x='imbalance_ratio_volume_weighted_mean', 
                y=f'target_{target_side}_pct_0_price_level_0',
                data=flat_parsed_ob, line_kws=dict(color='red'));
_.set_xlabel('Volume imbalance ratio (LOB mean)');
_.set_ylabel('Percent price change (one step ahead)');
In [ ]:
_ = sns.regplot(x='imbalance_ratio_volume_weighted_mean_rolling_5_mean', 
                y=f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0',
                data=flat_parsed_ob, line_kws=dict(color='red'));
_.set_xlabel('Volume imbalance ratio (LOB mean), 5 steps rolling average');
_.set_ylabel(f'Percent price change ({n_steps_future[-1]} step ahead)');

Target change vs Ask bid ratio¶

The chart below shows that the future mid-price change is on average lower when ask volumes are significantly higher than bid volumes.

In [ ]:
_ = sns.regplot(x='askbid_ratio_volume_weighted_mean', 
                y=f'target_{target_side}_pct_5_price_level_0',
                data=flat_parsed_ob, line_kws=dict(color='red'));
_.set_xlabel('Volume ask/bid ratio (LOB mean)');
_.set_ylabel(f'Percent price change ({n_steps_future[-1]} step ahead)');

Spread vs Volume imbalance¶

The chart below clearly shows that for higher values of the bid-ask spread both the volume imbalance ratio and the liquidity (fewer data-points for higher spreads) are on average smaller.

In [ ]:
flat_parsed_ob.plot.scatter(x='imbalance_ratio_volume_weighted_mean', y='imbalance_ratio_price_level_0', 
                            xlabel='Volume imbalance ratio (LOB mean)', ylabel='Bid-ask spread as percentage of mid-price');

Model selection¶

Investigate which models could possibly describe the relationship between the current order book and the future top of the book mid-price.

Prepare data for the analysis¶

Split the data in train and test sets.

In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'],
                                                      filter_fun=None)

Linear regression¶

Check if there are any features that have a linear relationship with the target.

In [ ]:
model = sm.OLS(y_train, X_train)
results = model.fit()
results_summary = results.summary()

results_as_html = results_summary.tables[0].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0]
Out[ ]:
y R-squared: 0.132
Dep. Variable:
Model: OLS Adj. R-squared: 0.123
Method: Least Squares F-statistic: 15.820
Date: Sat, 09 Apr 2022 Prob (F-statistic): 0.000
Time: 17:49:00 Log-Likelihood: 52052.000
No. Observations: 59949 AIC: -103000.000
Df Residuals: 59379 BIC: -97830.000
Df Model: 569 NaN NaN
Covariance Type: nonrobust NaN NaN
In [ ]:
results_as_html = results_summary.tables[1].as_html()
pd.read_html(results_as_html, header=0, index_col=0)[0].sort_values(by='P>|t|')
Out[ ]:
coef std err t P>|t| [0.025 0.975]
const 48.8247 7.321 6.669 0.000 34.475 63.175
imbalance_ratio_volume_level_8_lag_8 0.0045 0.001 5.602 0.000 0.003 0.006
imbalance_ratio_volume_level_8_lag_9 0.0044 0.001 5.533 0.000 0.003 0.006
askbid_ratio_price_level_7_pctchange_61 161.5078 39.609 4.078 0.000 83.875 239.141
imbalance_ratio_volume_level_9_lag_1 0.0218 0.004 5.854 0.000 0.014 0.029
... ... ... ... ... ... ...
askbid_ratio_price_level_2_pctchange_201 0.1747 13.462 0.013 0.990 -26.212 26.561
imbalance_ratio_price_level_5_lag_4 0.9421 83.598 0.011 0.991 -162.909 164.794
imbalance_ratio_price_level_6_lag_3 -0.3350 91.093 -0.004 0.997 -178.877 178.207
askbid_ratio_price_level_3_pctchange_81 -0.0549 19.018 -0.003 0.998 -37.330 37.220
imbalance_ratio_price_level_3_pctchange_81 0.0265 19.064 0.001 0.999 -37.339 37.392

1069 rows × 6 columns

Residual Analysis¶

The residuals should ideally:

  1. have zero mean
  2. be Normally distributed
  3. exhibit no autocorrelation

While condition #1 is approximately satisfied, this is not the case for #2 and #3.

Both the Kolmogorov-Smirnov test and the QQ plot show that condition #2 is not satisfied. The residuals are nonetheless approximately distributed symmetrically around the mean.

The Durbin-Watson test, as well as the ACF and PACF charts, show that the residuals are positively autocorrelated: condition #3 is not satisfied.

In [ ]:
predictions = results.predict(X_test)
prediction_error = y_test - predictions
pd.DataFrame(prediction_error.describe())
Out[ ]:
0
count 23194.000000000000
mean 0.002352179778
std 0.219006116152
min -11.074054726947
25% -0.067882855451
50% 0.000787590469
75% 0.069043484067
max 3.727768332582
In [ ]:
_, pvalue = sm.stats.diagnostic.kstest_normal(prediction_error)
if pvalue < 0.05: print("The residuals are not Normally distributed; p-value: " + str(pvalue))
The residuals are not Normally distributed; p-value: 0.0009999999999998899
In [ ]:
dw = sm.stats.stattools.durbin_watson(prediction_error, axis=0)
if dw < 1: print("The residuals are positively autocorrelated. Durbin-Watson statistic: " + str(dw))
if dw > 3: print("The residuals are negatively autocorrelated. Durbin-Watson statistic: " + str(dw))
The residuals are positively autocorrelated. Durbin-Watson statistic: 0.571941507029963
In [ ]:
prediction_error.plot(kind='hist',bins=1000, title="Prediction error distribution");
In [ ]:
_ = sm.qqplot(prediction_error, marker='.');
In [ ]:
x = range(0,len(y_test))
fig, ax = plt.subplots()
ax.plot(x,y_test,label='actual');
ax.plot(x,predictions,c='r',label='predicted');
legend = ax.legend(loc='upper left')
plt.show();
In [ ]:
_ = sm.graphics.tsa.plot_acf(prediction_error, lags=40)
In [ ]:
_ = sm.graphics.tsa.plot_pacf(prediction_error, lags=40)

Rolling cross validation¶

The cross validation setup and the results using a linear regression model are shown below. The cross validation is configured so that the test dataset (orange) is always about $1/5$ of the training dataset (blue). The R-squared and MSE are evaluated for the train set and the test set. The descriptive statistics are shown in the table below, while the rolling R-squared for both the train set and the test set are shown in the subsequent figure.

The train set R-squared values obtained from the cross validation process seem to confirm that the predicted values are actually positively related to the real values. This happens for most of the days in the historical dataset.

In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'], 
                                                      filter_fun=None, test_size=0.)
n_days = flat_parsed_ob["lastUpdated"].map(pd.Timestamp.date).unique().size
max_train_size = int(np.floor(len(X_train)/n_days))
test_size = int(max_train_size/5)
n_splits = int(len(X_train)/test_size)-1
kfold = model_selection.TimeSeriesSplit(n_splits=n_splits,
                                        test_size=test_size,
                                        max_train_size=max_train_size
                                        )
fig, ax = plt.subplots()
utils.plot_cv_indices(kfold, X_train, y_train, ax);
In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'], 
                                                      filter_fun=None, test_size=0.)
scoring = ['r2', 'neg_mean_squared_error']
n_days = flat_parsed_ob["lastUpdated"].map(pd.Timestamp.date).unique().size
scaler = preprocessing.RobustScaler()#MinMaxScaler()
selector = feature_selection.SelectKBest(feature_selection.f_regression, k=20)
models = [('Ridge', pipeline.make_pipeline(scaler, selector, linear_model.Ridge(fit_intercept=False)))]
cv_results, fitted = utils.test_models(X_train, y_train, X_test, y_test, mode='cv',
                                       models=models, scoring=scoring, kfold=kfold)
cv_results.describe()
Fitting model: Ridge
Out[ ]:
fit_time score_time test_r2 train_r2 test_neg_mean_squared_error train_neg_mean_squared_error
count 69.000000000000 69.000000000000 69.000000000000 69.000000000000 69.000000000000 69.000000000000
mean 0.333873848984 0.012919432875 0.028822254092 0.070229812273 -0.016471403460 -0.015131589037
std 0.018144384567 0.001662836451 0.059078263947 0.034619643888 0.023563365223 0.013993361944
min 0.223081588745 0.012141942978 -0.131451745510 0.031828395760 -0.175018441188 -0.052642739588
25% 0.333499431610 0.012435436249 -0.006585886034 0.051146566804 -0.019409458100 -0.022903501645
50% 0.335736989975 0.012549638748 0.027211790901 0.057035963721 -0.008815412852 -0.008847181340
75% 0.339479684830 0.012677192688 0.061877578638 0.067794161364 -0.004129877889 -0.004584193482
max 0.360774993896 0.025242328644 0.183978750468 0.177001109424 -0.002426292679 -0.002778878809
In [ ]:
for m in set(cv_results['model']):
    cv_results[cv_results['model']==m][['test_r2','train_r2']].plot(title='Model: ' + m);

Accuracy of the prediction of the linear regression¶

To get a more concrete understanding of the actual capability of the linear regression model in predicting future values we calculate an accuracy score for the linear regression as follows: if the actual data and the predicted data have the same sign we consider the prediction as correct, and incorrect otherwise. We then calculate the number of correct predictions as percentage of the total and use this quantity as score. A score greater equal to 0.5 means that the prediction is not better than random.

As shown in the table and in the figure below, the accuracy is greater than 0.5 for most of the days in the historical dataset.

In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'],
                                                      filter_fun=None, test_size=0)
scoring = metrics.make_scorer(utils.binarize, greater_is_better=True)
n_days = flat_parsed_ob["lastUpdated"].map(pd.Timestamp.date).unique().size
scaler = preprocessing.RobustScaler()
selector = feature_selection.SelectKBest(feature_selection.f_regression, k=30)
models = [('Ridge', pipeline.make_pipeline(scaler, selector, linear_model.Ridge(fit_intercept=False)))]
cv_results, fitted = utils.test_models(X_train, y_train, X_test, y_test, models=models, scoring=scoring, kfold=kfold)
cv_results.describe()
Fitting model: Ridge
Out[ ]:
fit_time score_time test_score train_score
count 69.000000000000 69.000000000000 69.000000000000 69.000000000000
mean 0.335588002550 0.013615846634 0.577432354634 0.591096794227
std 0.020117858132 0.002543659147 0.045609588068 0.028600052101
min 0.216948032379 0.010088682175 0.464594127807 0.512001381454
25% 0.333982229233 0.012809753418 0.548359240069 0.579174581247
50% 0.338042259216 0.012974500656 0.577720207254 0.590398894837
75% 0.343518018723 0.013219118118 0.603626943005 0.607839751338
max 0.360073804855 0.028847694397 0.674438687392 0.657226731135
In [ ]:
for m in set(cv_results['model']):
    cv_results[cv_results['model']==m][['test_score','train_score']].plot(title='Model: ' + m);

Testing of non-linear models¶

We now perform a rolling cross-validation of other non-linear models. Before fitting a model we first scale the data, and then discard the features which are weakly correlated with the target according to the p-values provided by a linear regression. We then perform the rolling cross validation using only the remaining features.

As shown in the table below, non-linear models perform worse than the linear regression. This indicates that the relationship between the independent variables and the target variable is inherently linear. Other models are likely overfitting the past data (high R-squared on train data), and therefore perform poorly on unseen data (low R-squared on test data).

In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'],
                                                      filter_fun=None, test_size=0)
scoring = ['r2', 'neg_mean_squared_error']
n_days = flat_parsed_ob["lastUpdated"].map(pd.Timestamp.date).unique().size
scaler = preprocessing.MaxAbsScaler()
selector = feature_selection.SelectKBest(feature_selection.f_regression, k=30)
models = [
       ('RF',   pipeline.make_pipeline(scaler, selector, ensemble.RandomForestRegressor(n_jobs=-1))),
       ('GBR', pipeline.make_pipeline(scaler, selector, ensemble.GradientBoostingRegressor())),
       ('XGB', pipeline.make_pipeline(scaler, selector, xgboost.XGBRegressor()))
    ]
cv_results, fitted = utils.test_models(X_train, y_train, X_test, y_test, models=models, scoring=scoring, kfold=kfold)
cv_results.groupby('model').agg([np.mean, np.std]).drop(['fit_time', 'score_time'],axis=1)
Fitting model: RF
Fitting model: GBR
Fitting model: XGB
Out[ ]:
test_r2 train_r2 test_neg_mean_squared_error train_neg_mean_squared_error
mean std mean std mean std mean std
model
GBR -0.035045437279 0.126602822863 0.275395881971 0.095452032435 -0.017260935830 0.023765452648 -0.011328787795 0.010029123550
RF -0.081334531733 0.161601914501 0.860985559525 0.097174713448 -0.017709898251 0.023942569538 -0.002087620088 0.001902349882
XGB -0.228866571154 0.239549250860 0.810262931564 0.122869576106 -0.019637913009 0.024934347772 -0.002763408650 0.002724918205
In [ ]:
for m in set(cv_results['model']):
    cv_results[cv_results['model']==m][['test_r2','train_r2']].plot(title='Model: ' + m);

Logistic regression as benchmark¶

The logistic regression is not suitable for our end goal, as we need to predict a continuous value, not a discrete one. We can nonetheless run a rolling cross validation using a logistic regression to compare its results to the accuracy of the linear regression's prediction. To do so we transform the target variable in a discrete one by considering only the sign of the prediction: positive values are considered as +1, and negative values are considered as -1.

As shown in the table below, the accuracy of the prediction of the logistic regression is comparable to the accuracy of the "binarized" linear regression. This increases our confidence in the validity of the results of the cross validation of the linear regression model.

In [ ]:
X_train, X_test, y_train, y_test = utils.prepare_data(flat_parsed_ob, 
                                                      targets=[f'target_{target_side}_pct_{n_steps_future[-1]}_price_level_0'],
                                                      filter_fun=utils.discretify, add_constant=False, test_size=0)
scoring = ['accuracy']
scaler = preprocessing.MaxAbsScaler()
selector = feature_selection.SelectKBest(feature_selection.f_classif, k=30)
models = [('LogReg', pipeline.make_pipeline(scaler, selector, linear_model.LogisticRegression(max_iter=1000)))]
cv_results, fitted = utils.test_models(X_train, y_train, X_test, y_test, 
                                       type='classification', models=models,
                                       scoring=scoring, kfold=kfold)
cv_results.describe()
Filtering target
Fitting model: LogReg
Out[ ]:
fit_time score_time test_accuracy train_accuracy
count 69.000000000000 69.000000000000 69.000000000000 69.000000000000
mean 0.169621491778 0.011826076369 0.589784986609 0.607050206462
std 0.027328752996 0.000737363203 0.034070475403 0.019191981563
min 0.047887563705 0.011236190796 0.504317789292 0.574339492316
25% 0.153435707092 0.011511325836 0.570811744387 0.594178082192
50% 0.170368194580 0.011587858200 0.589810017271 0.604904161630
75% 0.187912702560 0.011782407761 0.608808290155 0.615783111725
max 0.226779699326 0.015849828720 0.671848013817 0.666551545502
In [ ]:
for m in set(cv_results['model']):
    cv_results[cv_results['model']==m][['test_accuracy','train_accuracy']].plot(title='Model: ' + m);

Spread estimation ¶

The spread is estimated as suggested in Avellaneda and Stoikov [1].

The charts below clearly show that the spread should increase proportionally to the market volatility. The remaining parameters ($\gamma, \kappa$) can be used to trim the risk we are willing to take. The charts below are produced by using the values suggested in the original paper.

In [ ]:
std_window=200
prices = trades.drop_duplicates()['price']
std = prices.pct_change().rolling(std_window).std().fillna(0)
prices = prices[std>0]
std = std[std>0]
std.plot(title="Standard deviation");
In [ ]:
spread = utils.calculate_spread(0, T=1, gamma=gamma, sigma=std, k=kappa)
spread.plot(title="Spread");
In [ ]:
spread = ((spread - spread.min())/(spread.max()-spread.min())).fillna(0)
spread = spread*prices
spread.plot(title="Scaled spread");

Visual inspection of the market, bid, and ask prices¶

In [ ]:
ask = prices + spread/2
bid = prices - spread/2
fig, ax = plt.subplots()
ax.plot(prices,linestyle=':',color='lightgrey',label='price');
ax.plot(ask,':r',label='ask');
ax.plot(bid,':g',label='bid');
legend = ax.legend(loc='upper left')
plt.show();
In [ ]:
fig, ax = plt.subplots()
ax.plot(prices[-1000:],linestyle=':',color='lightgrey', label='price');
ax.plot(ask[-1000:],'-r',label='ask');
ax.plot(bid[-1000:],'-g',label='bid');
legend = ax.legend(loc='upper left')
plt.show();

Conclusions ¶

The analysis has confirmed our initial hypothesis of a linear relationship between the target variable (fair price $N$ steps ahead) and the independent variables (historical prices and volumes in the first 10 levels of the order book). The results of the cross validation indicate that such a relationship could be exploited effectively to forecast the fair price and to provide quotes around it.

An example implementation of the entire predictive algorithm described in this document is provided in the module quoter.py attached to this document. The Quoter class contained in the module requires the historical LOB and trades data to fit the model. Once the Quoter is initialized, it is ready to predict future values in an "online" fashion. Since the predictive model is based on lagged features as well as on features' percent changes, the prediction requires a number of past LOBs at least equal to the maximum lag/percent change used when training the model.

Further work¶

Although the results of the analysis are encouraging, additional work is necessary before considering deploying the predictive algorithm to production:

  1. A back-test should be performed to simulate the hypothetical P&L of a strategy using the predictive values calculated as described in this document
  2. Additional analyses of the outliers should be performed before fitting the model
  3. Additional analyses of the residuals should be performed to possibly address their positive autocorrelation and non-Normality
  4. Trades data should be used as additional features
  5. Perform a rolling analysis of the residuals of the linear regression
  6. Non-linear models could be further investigated using a grid search (with cross validation) to understand if the overfitting problem highlighted by the results in this document could be mitigated
  7. A dimensionality analysis (for example PCA) could be performed to understand if the number of independent variables could be reduced further
  8. A grid search could be performed to possibly find the best forecast window (i.e., the optimal $N$ steps ahead) to employ in production
  9. The spread estimation model parameters should be calibrated according to both the desired risk level and the historical data
  10. The spread estimation model might be improved by using a volatility model (for example EWMA, GARCH, EGARCH) instead of the historical volatility
  11. All the functions used to transform and manipulate the data should be optimized to minimize execution time
  12. Appropriate logging should be put in place to catch unexpected behaviour

References ¶

  1. Avellaneda M., Stoikov S. High frequency trading in a limit order book. Quantitative Finance, Vol. 8, No. 3, April 2008
  2. Chordia T., Roll R., Subrahmanyam A. Order Imbalance, Liquidity, and Market Returns. Journal of Financial Economics, Vol. 64, no. 1, 2002
  3. Cartea A., Donnelly R., Jaimungal S. Enhancing Trading Strategies with Order Book Signals. SSRN, October 2015