Non-identity DR Scaling

In this note we derive the update equations when using a non-identity scaling. Standard Douglas-Rachford splitting applied to the SCS problem is described in the Algorithm page. Now consider modifying DR splitting to use a diagonal matrix \(R\) instead of \(I\). This is useful because the matrix \(R\) can be selected to provide better convergence in practice. The algorithm then becomes:

\[\begin{split}\begin{align} \tilde u^{k+1} &= (R + \mathcal{Q})^{-1} R w^k \\ u^{k+1} &= (R + N_{\mathcal{C}_+})^{-1} R (2 \tilde u^{k+1} - w^k) \\ w^{k+1} &= w^k + u^{k+1} - \tilde u^{k+1} \\ \end{align}\end{split}\]

which yields \(w^k \rightarrow u^\star + R^{-1} \mathcal{Q}(u^\star)\) where \(0 \in \mathcal{Q}(u^\star) + N_{\mathcal{C}_+}(u^\star)\).

This changes the first two steps of the procedure. The linear projection and the cone projection (explained next).

Cone projection

Note that \(R\) has to be chosen so that the cone projection is preserved. To do this we ensure that the entries of \(R > 0\) corresponding to each cone are constant within that cone (with the box cone being a slight exception to this). This means that R does not affect the cone projection in any way, since for sub-cone \(\mathcal{K}\):

\[\begin{split}\begin{align} &x = (r I + \partial I_{\mathcal{K}})^{-1} (r y) \\ \Rightarrow \quad & 0 \in r(x - y) + \partial I_{\mathcal{K}}(x) \\ \Rightarrow \quad & x = \mbox{argmin}_z || z - y ||^2 \mbox{ s.t. } z \in \mathcal{K}. \end{align}\end{split}\]

In other words, \(R\) is selected such that

\[(R + \partial I_{\mathcal{C}_+})^{-1} R = (I + \partial I_{\mathcal{C}_+})^{-1}\]

Selecting \(R\)

For SCS we take

\[\begin{split}R = \begin{bmatrix} \rho_x I_n & 0 & 0 \\ 0 & \mathrm{diag}(\rho_y) & 0 \\ 0 & 0 & d \end{bmatrix}\end{split}\]

where \(I_n\) is \(n \times n\) identity, \(\rho_x \in \mathbf{R}\), \(\rho_y \in \mathbf{R}^m\) and \(d \in \mathbf{R}\). The \(\rho_y\) term includes the effect of the parameter scale, which is updated heuristically to improve convergence. Basically

\[\rho_y \approx (1/\mathrm{scale}) I\]

with the only difference being that each cone can modify this relationship slightly (but currently only the zero cone does, which sets \(\rho_y = 1 / (1000\ \mathrm{scale})\)). So as scale increases, \(\rho_y\) decreases and vice-versa.

The quantity \(\rho_x\) is determined by the setting value rho_x. The quantity \(\rho_y\) is determined by the setting value scale but is updated adaptively using the techniques described in Dynamic scale updating. Finally, \(d\) is determined by TAU_FACTOR in the code defined in glbopts.h.

Root-plus function

Finally, the root_plus function is modified to be the solution of the following quadratic equation:

\[\tau^2 (d + r^\top R_{-1} r) + \tau (r^\top R_{-1} \mu^k - 2 r^\top R_{-1} p^k - d \eta^k) + p^k R_{-1} (p^k - \mu^k) = 0,\]

where \(R_{-1}\) corresponds to the first \(n+m\) entries of \(R\). Other than when computing \(\kappa\) (which does not affect the algorithm) this is the only place where \(d\) appears, so we have a lot of flexibility in how to choose it and it can even change from iteration to iteration. It is an open question on how best to select this parameter.

Moreau decomposition

The projection onto a cone under a diagonal scaling also satisfies a Moreau-style decomposition identity, as follows:

\[x + R^{-1} \Pi_\mathcal{C^*}^{R^{-1}} ( - R x ) = \Pi_\mathcal{C}^R ( x )\]

where \(\Pi_\mathcal{C}^R\) denotes the projection onto convex cone \(\mathcal{C}\) under the \(R\)-norm, which is defined as

\[\|x\|_R = \sqrt{x^\top R x}.\]

This identity is useful when deriving cone projection routines, though most cones are either invariant to this or we enforce that \(R\) is constant across them. Note that the two components of the decomposition are \(R\)-orthogonal.

Dual vector

In order to get the dual vector \(v^k\) that contains \(s^k\) and \(\kappa^k\) we use:

\[v^{k+1} = R( u^{k+1} + w^k - 2 \tilde u^{k+1} ) \rightarrow \mathcal{Q}(u^\star),\]

and we have

\[\begin{split}\begin{align} v^{k+1} &= R( u^{k+1} + w^k - 2 \tilde u^{k+1} ) \\ &= R( \Pi^R_{\mathcal{C}_+} (2 \tilde u^{k+1} - w^k) + w^k - 2 \tilde u^{k+1}) \\ &= R( R^{-1} \Pi^{R^{-1}}_{\mathcal{C}^*_+} (R(w^k -2 \tilde u^{k+1}))) \\ &= \Pi^{R^{-1}}_{\mathcal{C}^*_+} (R(w^k -2 \tilde u^{k+1})) \\ &\in \mathcal{C}^*_+ \end{align}\end{split}\]

by Moreau, and finally note that \(v^k \perp u^k\) from the fact that the Moreau decomposition is \(R\)-orthogonal.

Dynamic scale updating

The choice of the scale parameter can have a large impact on the performance of the algorithm and the optimal choice is highly problem dependent. SCS can dynamically adjust the scale parameter on the fly via a heuristic procedure that can substantially improve convergence in practice. This procedure is enabled by the adaptive_scale setting. The procedure attempts to balance the convergence rate of the primal residual with the dual residual. Loosely speaking, the scale parameter will be increased if the primal residual is much larger than the dual and decreased if the opposite is true.

Specifically, at iteration \(k\) consider the case where \(l\) iterations have elapsed since the last update of the scale parameter, and denote by \((x, y, \tau) = u^k\) and \((0, s, \kappa) = v^k\), and the relative residuals as

\[\hat r^k_p = \frac{\|Ax + s - b \tau\|}{\max(\|Ax\|, \|s\|, \|b \tau \|)}\]
\[\hat r^k_d = \frac{\|Px + A^\top y + c \tau\|}{\max(\|Px\|, \|A^\top y\|, \|c \tau \|)}\]

where by default we use the \(\ell_\infty\) norm for these quantities, but can be changed using the SCALE_NORM constant in include/glbopts.h. Now consider

\[\beta = \left(\prod_{i=0}^{l-1} \frac{\hat r^{k-i}_p}{\hat r^{k-i}_d}\right)^{1/l}\]

ie, \(\beta\) corresponds to the geometric mean of the ratio of the relative residuals across the last \(l\) iterations. If this number is larger than a constant (eg, 3) or smaller than another constant (eg, 1/3) and if sufficient iterations have passed since the last update (eg, 100, as determined by RESCALING_MIN_ITERS) then an update of the scale parameter is triggered:

\[\mbox{scale}^+ = \sqrt{\beta}\ \mbox{scale}\]

The presence of the square root is to prevent over-shooting the ‘optimal’ scale parameter, which could lead to oscillation.

Note that if the linear system is being solved using a direct method, then updating the scale parameter will require a new factorization of the perturbed matrix, so is somewhat expensive for larger problems and should be done sparingly. Also, since the changing the scale changes the operator we are using in DR splitting we also need to perform a reset of the Anderson acceleration.