Non-identity DR Scaling
In this note we derive the update equations when using a non-identity scaling. Standard Douglas-Rachford splitting applied to the SCS problem is described in the Algorithm page. Now consider modifying DR splitting to use a diagonal matrix \(R\) instead of \(I\). This is useful because the matrix \(R\) can be selected to provide better convergence in practice. The algorithm then becomes:
which yields \(w^k \rightarrow u^\star + R^{-1} \mathcal{Q}(u^\star)\) where \(0 \in \mathcal{Q}(u^\star) + N_{\mathcal{C}_+}(u^\star)\).
This changes the first two steps of the procedure. The linear projection and the cone projection (explained next).
Cone projection
Note that \(R\) has to be chosen so that the cone projection is preserved. To do this we ensure that the entries of \(R > 0\) corresponding to each cone are constant within that cone (with the box cone being a slight exception to this). This means that R does not affect the cone projection in any way, since for sub-cone \(\mathcal{K}\):
In other words, \(R\) is selected such that
Selecting \(R\)
For SCS we take
where \(I_n\) is \(n \times n\) identity, \(\rho_x \in \mathbf{R}\),
\(\rho_y \in \mathbf{R}^m\) and \(d \in \mathbf{R}\). The \(\rho_y\)
term includes the effect of the parameter scale
, which is updated
heuristically to improve convergence. Basically
with the only difference being that each cone can modify this relationship slightly (but currently only the zero cone does, which sets \(\rho_y = 1 / (1000\ \mathrm{scale})\)). So as scale increases, \(\rho_y\) decreases and vice-versa.
The quantity \(\rho_x\) is determined by the setting value
rho_x
. The quantity \(\rho_y\) is determined by the setting value scale
but is updated adaptively using the techniques
described in Dynamic scale updating. Finally, \(d\)
is determined by TAU_FACTOR
in the code defined in glbopts.h
.
Root-plus function
Finally, the root_plus
function is modified to be the solution
of the following quadratic equation:
where \(R_{-1}\) corresponds to the first \(n+m\) entries of \(R\). Other than when computing \(\kappa\) (which does not affect the algorithm) this is the only place where \(d\) appears, so we have a lot of flexibility in how to choose it and it can even change from iteration to iteration. It is an open question on how best to select this parameter.
Moreau decomposition
The projection onto a cone under a diagonal scaling also satisfies a Moreau-style decomposition identity, as follows:
where \(\Pi_\mathcal{C}^R\) denotes the projection onto convex cone \(\mathcal{C}\) under the \(R\)-norm, which is defined as
This identity is useful when deriving cone projection routines, though most cones are either invariant to this or we enforce that \(R\) is constant across them. Note that the two components of the decomposition are \(R\)-orthogonal.
Dual vector
In order to get the dual vector \(v^k\) that contains \(s^k\) and \(\kappa^k\) we use:
and we have
by Moreau, and finally note that \(v^k \perp u^k\) from the fact that the Moreau decomposition is \(R\)-orthogonal.
Dynamic scale updating
The choice of the scale
parameter can have a large impact on the
performance of the algorithm and the optimal choice is highly problem
dependent. SCS can dynamically adjust the scale
parameter
on the fly via a heuristic procedure that can substantially improve convergence
in practice. This procedure is enabled by the adaptive_scale
setting. The procedure attempts to balance the convergence
rate of the primal residual with the dual residual. Loosely speaking, the
scale
parameter will be increased if the primal residual is much larger
than the dual and decreased if the opposite is true.
Specifically, at iteration \(k\) consider the case where \(l\)
iterations have elapsed since the last update of the scale
parameter,
and denote by \((x, y, \tau) = u^k\) and \((0, s, \kappa) = v^k\), and
the relative residuals as
where by default we use the \(\ell_\infty\) norm for these quantities,
but can be changed using the SCALE_NORM
constant in
include/glbopts.h
.
Now consider
ie, \(\beta\) corresponds to the geometric mean of the ratio of the relative
residuals across the last \(l\) iterations. If this number is larger than a
constant (eg, 3) or smaller than another constant (eg, 1/3) and if sufficient
iterations have passed since the last update (eg, 100, as determined by
RESCALING_MIN_ITERS
) then an update of the scale
parameter is
triggered:
The presence of the square root is to prevent over-shooting the ‘optimal’ scale parameter, which could lead to oscillation.
Note that if the linear system is being solved using a
direct method, then updating the scale
parameter will require a new
factorization of the perturbed matrix, so is somewhat expensive for larger
problems and should be done sparingly. Also, since the changing the
scale
changes the operator we are using in DR splitting we also need to
perform a reset of the Anderson acceleration.