.. _scaling: Non-identity DR Scaling ======================= In this note we derive the update equations when using a non-identity scaling. Standard Douglas-Rachford splitting applied to the SCS problem is described in the :ref:`algorithm` page. Now consider modifying DR splitting to use a diagonal matrix :math:`R` instead of :math:`I`. This is useful because the matrix :math:`R` can be selected to provide better convergence in practice. The algorithm then becomes: .. math:: \begin{align} \tilde u^{k+1} &= (R + \mathcal{Q})^{-1} R w^k \\ u^{k+1} &= (R + N_{\mathcal{C}_+})^{-1} R (2 \tilde u^{k+1} - w^k) \\ w^{k+1} &= w^k + u^{k+1} - \tilde u^{k+1} \\ \end{align} which yields :math:`w^k \rightarrow u^\star + R^{-1} \mathcal{Q}(u^\star)` where :math:`0 \in \mathcal{Q}(u^\star) + N_{\mathcal{C}_+}(u^\star)`. This changes the first two steps of the procedure. The :ref:`linear projection ` and the cone projection (explained next). Cone projection --------------- Note that :math:`R` has to be chosen so that the cone projection is preserved. To do this we ensure that the entries of :math:`R > 0` corresponding to each cone are constant within that cone (with the box cone being a slight exception to this). This means that R does not affect the cone projection in any way, since for sub-cone :math:`\mathcal{K}`: .. math:: \begin{align} &x = (r I + \partial I_{\mathcal{K}})^{-1} (r y) \\ \Rightarrow \quad & 0 \in r(x - y) + \partial I_{\mathcal{K}}(x) \\ \Rightarrow \quad & x = \mbox{argmin}_z || z - y ||^2 \mbox{ s.t. } z \in \mathcal{K}. \end{align} In other words, :math:`R` is selected such that .. math:: (R + \partial I_{\mathcal{C}_+})^{-1} R = (I + \partial I_{\mathcal{C}_+})^{-1} Selecting :math:`R` ------------------- For SCS we take .. math:: R = \begin{bmatrix} \rho_x I_n & 0 & 0 \\ 0 & \mathrm{diag}(\rho_y) & 0 \\ 0 & 0 & d \end{bmatrix} where :math:`I_n` is :math:`n \times n` identity, :math:`\rho_x \in \mathbf{R}`, :math:`\rho_y \in \mathbf{R}^m` and :math:`d \in \mathbf{R}`. The :math:`\rho_y` term includes the effect of the parameter :code:`scale`, which is updated heuristically to improve convergence. Basically .. math:: \rho_y \approx (1/\mathrm{scale}) I with the only difference being that each cone can modify this relationship slightly (but currently only the zero cone does, which sets :math:`\rho_y = 1 / (1000\ \mathrm{scale})`). So as scale increases, :math:`\rho_y` decreases and vice-versa. The quantity :math:`\rho_x` is determined by the :ref:`setting ` value :code:`rho_x`. The quantity :math:`\rho_y` is determined by the :ref:`setting ` value :code:`scale` but is updated adaptively using the techniques described in :ref:`Dynamic scale updating `. Finally, :math:`d` is determined by :code:`TAU_FACTOR` in the code defined in :code:`glbopts.h`. Root-plus function ------------------ Finally, the :code:`root_plus` function is modified to be the solution of the following quadratic equation: .. math:: \tau^2 (d + r^\top R_{-1} r) + \tau (r^\top R_{-1} \mu^k - 2 r^\top R_{-1} p^k - d \eta^k) + p^k R_{-1} (p^k - \mu^k) = 0, where :math:`R_{-1}` corresponds to the first :math:`n+m` entries of :math:`R`. Other than when computing :math:`\kappa` (which does not affect the algorithm) this is the *only* place where :math:`d` appears, so we have a lot of flexibility in how to choose it and it can even change from iteration to iteration. It is an open question on how best to select this parameter. Moreau decomposition -------------------- The projection onto a cone under a diagonal scaling also satisfies a Moreau-style decomposition identity, as follows: .. math:: x + R^{-1} \Pi_\mathcal{C^*}^{R^{-1}} ( - R x ) = \Pi_\mathcal{C}^R ( x ) where :math:`\Pi_\mathcal{C}^R` denotes the projection onto convex cone :math:`\mathcal{C}` under the :math:`R`-norm, which is defined as .. math:: \|x\|_R = \sqrt{x^\top R x}. This identity is useful when deriving cone projection routines, though most cones are either invariant to this or we enforce that :math:`R` is constant across them. Note that the two components of the decomposition are :math:`R`-orthogonal. Dual vector ----------- In order to get the dual vector :math:`v^k` that contains :math:`s^k` and :math:`\kappa^k` we use: .. math:: v^{k+1} = R( u^{k+1} + w^k - 2 \tilde u^{k+1} ) \rightarrow \mathcal{Q}(u^\star), and we have .. math:: \begin{align} v^{k+1} &= R( u^{k+1} + w^k - 2 \tilde u^{k+1} ) \\ &= R( \Pi^R_{\mathcal{C}_+} (2 \tilde u^{k+1} - w^k) + w^k - 2 \tilde u^{k+1}) \\ &= R( R^{-1} \Pi^{R^{-1}}_{\mathcal{C}^*_+} (R(w^k -2 \tilde u^{k+1}))) \\ &= \Pi^{R^{-1}}_{\mathcal{C}^*_+} (R(w^k -2 \tilde u^{k+1})) \\ &\in \mathcal{C}^*_+ \end{align} by Moreau, and finally note that :math:`v^k \perp u^k` from the fact that the Moreau decomposition is :math:`R`-orthogonal. .. _updating_scale: Dynamic scale updating ---------------------- The choice of the :code:`scale` parameter can have a large impact on the performance of the algorithm and the optimal choice is highly problem dependent. SCS can dynamically adjust the :code:`scale` parameter on the fly via a heuristic procedure that can substantially improve convergence in practice. This procedure is enabled by the :code:`adaptive_scale` :ref:`setting `. The procedure attempts to balance the convergence rate of the primal residual with the dual residual. Loosely speaking, the :code:`scale` parameter will be increased if the primal residual is much larger than the dual and decreased if the opposite is true. Specifically, at iteration :math:`k` consider the case where :math:`l` iterations have elapsed since the last update of the :code:`scale` parameter, and denote by :math:`(x, y, \tau) = u^k` and :math:`(0, s, \kappa) = v^k`, and the *relative* residuals as .. math:: \hat r^k_p = \frac{\|Ax + s - b \tau\|}{\max(\|Ax\|, \|s\|, \|b \tau \|)} .. math:: \hat r^k_d = \frac{\|Px + A^\top y + c \tau\|}{\max(\|Px\|, \|A^\top y\|, \|c \tau \|)} where by default we use the :math:`\ell_\infty` norm for these quantities, but can be changed using the :code:`SCALE_NORM` constant in :code:`include/glbopts.h`. Now consider .. math:: \beta = \left(\prod_{i=0}^{l-1} \frac{\hat r^{k-i}_p}{\hat r^{k-i}_d}\right)^{1/l} ie, :math:`\beta` corresponds to the geometric mean of the ratio of the relative residuals across the last :math:`l` iterations. If this number is larger than a constant (eg, 3) or smaller than another constant (eg, 1/3) *and* if sufficient iterations have passed since the last update (eg, 100, as determined by :code:`RESCALING_MIN_ITERS`) then an update of the :code:`scale` parameter is triggered: .. math:: \mbox{scale}^+ = \sqrt{\beta}\ \mbox{scale} The presence of the square root is to prevent over-shooting the 'optimal' scale parameter, which could lead to oscillation. Note that if the :ref:`linear system ` is being solved using a direct method, then updating the :code:`scale` parameter will require a new factorization of the perturbed matrix, so is somewhat expensive for larger problems and should be done sparingly. Also, since the changing the :code:`scale` changes the operator we are using in DR splitting we also need to perform a reset of the :ref:`Anderson acceleration `.