Quantile regression#

Standard regression#

In this example we do quantile regression in CVXPY. We start by discussing standard regression. In a regression problem we are given data \((x_i,y_i)\in {\bf R}^n \times {\bf R}\), \(i=1,\ldots, m\). and fit a linear (affine) model

\[\hat y_i = \beta ^Tx_i - v,\]

where \(\beta \in {\bf R}^n\) and \(v \in {\bf R}\).

The residuals are \(r_i = \hat y_i - y_i\). In standard (least-squares) regression we choose \(\beta,v\) to minimize \(\|r\|_2^2 = \sum_i r_i^2\). For this choice of \(\beta,v\) the mean of the optimal residuals is zero.

A simple variant is to add (Tychonov) regularization, meaning we solve the optimization problem

\[\begin{array}{ll} \mbox{minimize} & \|r\|_2^2 + \lambda \|\beta \|_2^2, \end{array}\]

where \(\lambda>0\).

Quantile regression#

An alternative to the standard least-squares penalty is the tilted \(\ell_1\) penalty: for \(\tau \in (0,1)\),

\[ \phi(u)= \tau (u)_+ + (1-\tau) (u)_- = (1/2)|u| + (\tau-1/2) u. \]

The plots below show \(\phi(u)\) for \(\tau= 0.5\), \(\tau= 0.1\), and \(\tau= 0.9\).

title

In quantile regression we choose \(\beta,v\) to minimize \(\sum_i \phi(r_i)\). For \(r_i \neq 0\),

\[ \frac{\partial \sum_i \phi(r_i)}{\partial v} = \tau \left|\{i: r_i>0\} \right| - (1-\tau) \left|\{i: r_i<0\}\right|, \]

which means that (roughly speaking) for optimal \(v\) we have

\[ \tau \left|\{i: r_i>0\} \right| = (1-\tau) \left|\{i: r_i<0\}\right|. \]

We conclude that \(\tau m = \left|\{i: r_i<0\}\right|\), or the \(\tau\)-quantile of optimal residuals is zero. Hence the name quantile regression.

Example#

In the following code we apply quantile regression to a time series prediction problem. We fit the time series \(x_t\), \(t=0,1,2, \ldots\) with an auto-regressive (AR) predictor

\[\hat x_{t+1} = \beta^T (x_t,\ldots, x_{t-M})- v,\]

where \(M=10\) is the memory of predictor.

We use quantile regression for \(\tau = 0.1,0.5, 0.9\) to fit three AR models. At each time \(t\), the models give three one-step-ahead predictions:

\[ \hat x_{t+1} ^{0.1}, \qquad \hat x_{t+1} ^{0.5}, \qquad \hat x_{t+1} ^{0.9} \]

We divide the time series into a training set, which we use to fit the AR models, and a test set. We plot \(x_t\) and the predictions \(\hat x_{t+1} ^{0.1}\), \(\hat x_{t+1} ^{0.5}\), \(\hat x_{t+1} ^{0.9}\) for the training set and the test set.

# Generate data for quantile regression.
import numpy as np

np.random.seed(1)
TRAIN_LEN = 400
SKIP_LEN = 100
TEST_LEN = 50
TOTAL_LEN = TRAIN_LEN + SKIP_LEN + TEST_LEN
m = 10
x0 = np.random.randn(m)
x = np.zeros(TOTAL_LEN)
x[:m] = x0
for i in range(m + 1, TOTAL_LEN):
    x[i] = 1.8 * x[i - 1] - 0.82 * x[i - 2] + np.random.normal()

x = np.exp(0.05 * x + 0.05 * np.random.normal(size=TOTAL_LEN))
# Form the quantile regression problem.
import cvxpy as cp


w = cp.Variable(m + 1)
v = cp.Variable()
tau = cp.Parameter()
error = 0
for i in range(SKIP_LEN, TRAIN_LEN + SKIP_LEN):
    r = x[i] - (w.T @ x[i - m - 1 : i] + v)
    error += 0.5 * cp.abs(r) + (tau - 0.5) * r
prob = cp.Problem(cp.Minimize(error))
# Solve quantile regression for different values of tau.
tau_vals = [0.9, 0.5, 0.1]
pred = np.zeros((len(tau_vals), TOTAL_LEN))
r_vals = np.zeros((len(tau_vals), TOTAL_LEN))
for k, tau_val in enumerate(tau_vals):
    tau.value = tau_val
    prob.solve()
    pred[k, :m] = x0
    for i in range(SKIP_LEN, TOTAL_LEN):
        pred[k, i] = (x[i - m - 1 : i].T @ w + v).value
        r_vals[k, i] = (x[i] - (x[i - m - 1 : i].T @ w + v)).value

Below we plot the full time series, the training data with the three AR models, and the test data with the three AR models.

# Generate plots.
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'svg'

# Plot the full time series.
plt.plot(range(0, TRAIN_LEN + TEST_LEN), x[SKIP_LEN:], "black", label=r"$x$")
plt.xlabel(r"$t$", fontsize=16)
plt.ylabel(r"$x_t$", fontsize=16)
plt.title("Full time series")
plt.show()

# Plot the predictions from the quantile regression on the training data.
plt.plot(range(0, TRAIN_LEN), x[SKIP_LEN:-TEST_LEN], "black", label=r"$x$")
colors = ["r", "g", "b"]
for k, tau_val in enumerate(tau_vals):
    plt.plot(
        range(0, TRAIN_LEN),
        pred[k, SKIP_LEN:-TEST_LEN],
        colors[k],
        label=r"$\tau = %.1f$" % tau_val,
    )
plt.xlabel(r"$t$", fontsize=16)
plt.title("Training data")
plt.show()

# Plot the predictions from the quantile regression on the test data.
plt.plot(range(TRAIN_LEN, TRAIN_LEN + TEST_LEN), x[-TEST_LEN:], "black", label=r"$x$")
for k, tau_val in enumerate(tau_vals):
    plt.plot(
        range(TRAIN_LEN, TRAIN_LEN + TEST_LEN),
        pred[k, -TEST_LEN:],
        colors[k],
        label=r"$\tau = %.1f$" % tau_val,
    )
plt.xlabel(r"$t$", fontsize=16)
plt.title("Test data")
plt.show()
../../../_images/0745e28f507ad1262ab2aed4774bb8bcc41920d97d0da2eeba976188479123af.svg../../../_images/86bf28f88054c7bb0db081e7081236cd510d6512e89a667c9d4a4d6eda1edace.svg../../../_images/e5f183bfd37823cfa15b8a445577a6041a3348077823cec974af585b4ee72577.svg

Below we plot the empirical CDFs of the residuals for the three AR models on the training data and the test data. Notice that \(\tau\)-quantile of the optimal residuals is close to zero.

# Plot the empirical CDFs of the residuals.

# Plot the CDF for the training data residuals.
for k, tau_val in enumerate(tau_vals):
    sorted = np.sort(r_vals[k, SKIP_LEN : SKIP_LEN + TRAIN_LEN])
    yvals = np.arange(len(sorted)) / float(len(sorted))
    plt.plot(sorted, yvals, colors[k], label=r"$\tau = %.1f$" % tau_val)

x = np.linspace(-1.0, 1, 100)
for val in [0.1, 0.5, 0.9]:
    plt.plot(x, len(x) * [val], "k--")
plt.xlabel("Residual")
plt.ylabel("Cumulative density in training")
plt.xlim([-0.6, 0.6])
plt.legend(loc="upper left")
plt.show()

# Plot the CDF for the testing data residuals.
for k, tau_val in enumerate(tau_vals):
    sorted = np.sort(r_vals[k, -TEST_LEN:])
    yvals = np.arange(len(sorted)) / float(len(sorted))
    plt.plot(sorted, yvals, colors[k], label=r"$\tau = %.1f$" % tau_val)

x = np.linspace(-1, 1, 100)
for val in [0.1, 0.5, 0.9]:
    plt.plot(x, len(x) * [val], "k--")
plt.xlabel("Residual")
plt.ylabel("Cumulative density in testing")
plt.xlim([-0.6, 0.6])
plt.legend(loc="upper left")
plt.show()
../../../_images/6b61d7be3b8e1ab019d125e866d7a54c929f2416ce367a8555bf5d02441a1230.svg../../../_images/c8f2124a17c531204ecd62714969591462a20b31c2e747f41283711f6582fd2a.svg