Francis Bach: Revisiting scaling laws via the z-transform
In the last few years, we have seen a surge of empirical and theoretical works about “scaling laws”, whose goals are to characterize the performance of learning methods based on various problem parameters (e.g., number of observations and parameters, or amount of compute). From a theoretical point of view, this marks a renewed interest in asymptotic equivalents—something the machine learning community had mostly moved away from (and, let’s be honest, kind of looked down on) in favor of non-asymptotic bounds. The two types of bounds have their own pros and cons, and I value both of them, but on a personal note, after having spent close to twenty years in proving (sometimes tedious) non-asymptotic results, proving asymptotic results is particularly rewarding, especially when you experience that “aha!” moment where the empirical behavior aligns perfectly with the derived expression (see some examples below).
Scaling laws for quadratic optimization require aggregating the effects of all eigenvalues of the Hessian matrix and not considering only the worst one. This was done with Laplace’s method in an earlier post for gradient descent and partially for Nesterov acceleration. This blog post explores similar asymptotic scaling laws and will consider stochastic algorithms, but there is more to it.
I will use a classical tool from applied mathematics, electrical engineering, and computer science: the z-transform (a.k.a. generating function) of a discrete sequence. My goal in this blog post is to present the essentials of the method. For more details, see my recent preprint [1], with lots of potential follow-ups. Note that the z-transform has already been used in the context of stochastic gradient descent for stability analysis [15, 20] and for deriving convergence rates with constant momentum [22].
z-transform method and final value theorem
From Abel’s theorem to the finite value theorem. Given a sequence ((b_k)_{k \geqslant 0}), its z-transform is the function (B) defined for a complex argument (z \in \mathbb{C}), as (B(z) = \sum_{k=0}^\infty b_k z^k) (note the change of convention where (z^{-1}) is replaced by (z)). The behavior of the series defined by ((b_k)_{k \geqslant 0}) can be characterized by the behavior of (B(z)) for (z) tending to (1) while remaining in ([0,1)). This is the classical Abel’s theorem: when the series defined by ((b_k)_{k \geqslant 0}) is convergent, then \(\sum\_{k=0}^{+\infty} b\_k = \lim\_{z \to 1^-} B(z).\)
When applied to (b_k = a_k\, – a_{k-1}) for (k \geqslant 1), with (b_0 = a_0), then we have (\sum_{\ell=1}^k b_\ell = a_k), and (B(z) = (1\, – z) A(z)) where (A) is the z-transfom of ((a_k)_{k \geqslant 0}). We then obtain the final value theorem, which states \(\lim\_{z \to 1^-} \ ( 1 \, – z) A(z) = \lim\_{k \to +\infty} a\_k.\)
From limits to asymptotic equivalents. In order to obtain equivalents of the sequence ((a_k)_{k \geqslant 0}) (and not simply its limit), we can use two additional properties:
- the z-transform of (((k+1) a_{k+1})_{k \geqslant 0}) is (A’(z)), leading to \(\lim\_{z \to 1^-} \ ( 1 \, – z) A'(z)= \lim\_{k \to +\infty} k a\_k.\)
- To treat non-integer powers, we have the following identity, valid for any real (\nu > 0), \(\lim\_{z \to 1^-} \ ( 1 \,- z)^{\nu} A(z) = \Gamma(\nu) \cdot \lim\_{k \to +\infty} k^{1-\nu} a\_k,\) where (\Gamma) is the Gamma function. A classical example is (a_k = \frac{1}{k!} \nu(\nu+1) \cdots (\nu+k-1) = \frac{ \Gamma(k+\nu)}{\Gamma(\nu)\Gamma(k+1)} \sim \frac{k^{\nu-1}}{\Gamma(\nu)}), for which (A(z) = (1-z)^{-\nu}).
Combining the last two properties (with an extension to higher-order derivatives), when the two limits exist, we have: \(\lim\_{z \to 1^-} \ ( 1 \, – z)^{\nu} A^{(\mu)}(z) = \Gamma(\nu)\cdot \lim\_{k \to +\infty} k^{1+\mu-\nu} a\_k.\) Therefore, the z-transform method consists in deriving asymptotics of the z-transform or its derivatives around (z = 1^-). Before showing why this is a simplification, some words on sufficient conditions for the equivalence to hold, through a sequence of so-called “Tauberian” theorems.
Tauberian theory. In order to show that limits and asymptotic equivalents exist, several options are available. When only considering real-valued arguments for the z-transforms, various sufficient conditions have been derived from the work of Hardy and Littlewood [2] (the book from Jacob Korevaar [3] is a great source of detailed results). In a nutshell, oscillating sequences such as (a_k = (-1)^k) are the ones where the limits are not equal (or may not exist), and a classical sufficient condition is that there exists a constant (c) such that for all (k \geqslant 1), (a_k \, – a_{k-1} \leqslant c / k^{2+\mu – \nu}) (in particular, with (c = 0), being decreasing is a sufficient condition). Alternative frameworks can also be considered using complex analysis, which only depends on property of the z-transform (A(z)) for complex (z) [4, 5].
Properties of the z-transform. What makes the z-transform so versatile is a suite of properties that link modifications of the sequence ((a_k)_{k \geqslant 0}) to its z-transform (A(z)) (see more details in [1, 6]). The most important ones are:
- Shift: If (A(z)) is the z-transform of the sequence ((a_k)_{k \geqslant 0}), then (z A(z)) is the z-transform of the sequence ((0,a_0,a_1,\dots)), that is, shifted by 1. This implies that sequences that satisfy a linear difference equation lead to rational z-transforms.
- Multiplication by a polynomial: If (A) is the z-transform of ((a_k)_{k \geqslant 0}), then (z \mapsto z A’(z)) is the (z)-transform of ((k a_k)_{k \geqslant 0}). Thus, when we have a difference equation with rational coefficients (in (k)), this leads to an ordinary differential equation for the z-transform.
Classical use of the z-transform method to derive asymptotic equivalents. The z-transform method is a classical tool, with already many applications, e.g., in combinatorics, analysis of recurrence relations, queuing theory, signal processing, and control (see [6, 7]). In this blog post, I will focus on optimization of quadratic functions.
Gradient descent and spectral dimension
Like in an earlier post, we consider an idealized quadratic optimization problem in infinite dimension, which can be rewritten as the minimization of \(F(\eta) = \frac{1}{2} \langle \eta-\eta\_\ast, H ( \eta-\eta\_\ast)\rangle,\) where (\eta_\ast) is a global optimum of (F) and (H) the positive semi-definite Hessian operator. Gradient descent leads to the recursion \(\eta\_k = \eta\_{k-1} \, – \gamma F'(\eta\_{k-1}) = \eta\_{k-1} \, – \gamma H ( \eta\_{k-1} \, – \eta\_\ast),\) where (\gamma > 0) is a step-size which we choose so that the operator norm of (\gamma H) is less than one (e.g., (\gamma \leqslant 1/L), where (L) is the largest eigenvalue of the operator (H)). With (\theta_k = \eta_k \, – \eta_\ast), this leads to the following linear iteration \(\theta\_k = (I \, – \gamma H) \theta\_{k-1} = ( I \, – \gamma H)^k \delta,\) where (\delta = \theta_0 = \eta_0\, – \eta_\ast).
Thus, given a spectral decomposition (\gamma H = \sum_{i=1}^{+\infty} \lambda_i u_i \otimes u_i) with eigenvalues ((\lambda_i)_{i \geqslant 1}) in ([0,1]) and eigenvectors ((u_i)_{i \geqslant 1}), we get the following performance criterion \(a\_k = \frac{1}{2\gamma} \sum\_{i=1}^{+\infty} \lambda\_i (1 \, – \lambda\_i)^{2k} \langle u\_i,\delta \rangle^2 = \int\_{0}^1 (1 \, – \lambda)^{2k} d\sigma(\lambda),\) where \(d\sigma(\lambda) = \frac{1}{2\gamma} \sum\_{i=1}^{+\infty} \lambda\_i \langle u\_i,\delta \rangle^2 {\rm Dirac}(\lambda|\lambda\_i)\)
is a weighted spectral measure, as defined in [8] in the gossip context (see this earlier post), with a z-transform equal to \(A(z) = \sum\_{k=0}^{+\infty} z^k \int\_0^{1 } (1-\lambda)^{2k} d\sigma(\lambda)
= \int\_0^{1 } \frac{1}{1 \, – z (1-\lambda)^{2}} d\sigma(\lambda),\) which is a rational function in (z) and (\lambda). It can thus be re-written as, using a partial function decomposition in the variable (\lambda), as \(A(z) = \frac{1}{2 \sqrt{z} } \int\_0^{1 } \frac{d\sigma(\lambda) }{ \lambda + z^{-1/2} \, – 1 } – \frac{1}{2\sqrt{z}} \int\_0^{1 } \frac{d\sigma(\lambda) }{ \lambda\, – z^{-1/2} – 1 }.\) We thus get \(A(z) = \frac{1}{2 \sqrt{z} } S( z^{-1/2}-1 ) – \frac{1}{2 \sqrt{z}}S ( – z^{-1/2}-1) ,\)
where (\displaystyle S(u) = \int_0^1 ! \frac{d\sigma(\lambda)}{\lambda + u}) is the Stieltjes transform of the measure (\sigma) (see Chapter 8 of [10]), a classical tool in random matrix theory. We see above that what matters in the equivalent of (S(u)) around (u=0) (since the value at (-2) is bounded).
The spectral dimension defined by [8] corresponds to a measure whose behavior around zero has density (c \lambda^{\omega \, – 1}), or in almost equivalent terms so that (S(u) \sim c \Gamma(\omega) \Gamma(1-\omega) u^{\omega\, -1}) around (u=0), with a similar expansion for derivatives. This leads directly to a an explicit scaling law in ({1}/{k^\omega}) which exactly recovers an earlier post (which was using Laplace method) with a different proof.
The spectral dimension depends explicitly on problem parameters:
- For least-squares regression, as described in an earlier post and in many papers, this corresponds to (\lambda_i \sim L / i^\alpha) (“capacity” conditions) and (\langle \delta, u_i \rangle \sim \Delta / i^{\beta/2}) (“source” conditions), leading to (\omega = \frac{\beta-1}{\alpha} + 1 ) assumed to be positive, and (c = \frac{L \Delta^2 }{2 \alpha (\gamma L)^{\frac{\beta-1}{\alpha}+1}}). The equivalent is then \(a\_ k \sim c \frac{\Gamma(\omega)}{(2k)^{\omega}},\) with nice connections with non-asymptotic bounds for the same problem.
- For gossip, as shown in [8], we get that (\omega) is the underlying dimension of the set on which local messages are sent.
Summary. Linear iterations lead to rational z-transforms along all eigenvectors, and then asymptotics of the Stieltjes transform (based on the spectral dimension) can be used after partial function decomposition. For gradient descent (a simple linear iteration with constant coefficients), this can be easily achieved in other ways. However, for time-dependent coefficients (Nesterov acceleration) and for stochastic iterations, it will provide a new direct way that avoids lengthy computations.
Nesterov acceleration vs. heavy-ball
A classical way to accelerate the convergence of the gradient iteration is to use an extrapolation step due to Nesterov [11]. This can be formulated with two sequences (and a first-order recursion) as:
\(\begin{array}{rcl} \eta\_{k+1} & = & \zeta\_{k} \, – \gamma F'(\zeta\_{k}) \\ \zeta\_{k+1} & = & \eta\_{k+1} + ( 1 \, – \tau\_{k+1}) ( \eta\_{k+1} \, – \eta\_k), \end{array}\) and with a single sequence (and a second-order recursion) as \(\eta\_{k+1} = \eta\_{k} + ( 1 \, – \tau\_{k}) ( \eta\_{k}\, – \eta\_{k-1})\, – \gamma F’ \big[ \eta\_{k} + ( 1 \,- \tau\_{k}) ( \eta\_{k} \, – \eta\_{k-1})\big].\)
In the original formulation with convex functions, (\tau_k) is chosen asymptotically proportional to (2\rho/(k+1)) with (\rho = 3/2). For quadratic functions, [9] argues that the choice (\rho = 1) is more natural as it leads to a simple rescaled reformulation, and [12, 13] consider a general (\rho). We now show that the z-transform allows us to analyze all integers (\rho), with a conjecture that is empirically valid for all (\rho >0).
With (\theta_k = \eta_k\, – \eta_\ast), we obtain the following recursion for quadratic functions \(\theta\_{k+1} = ( I- \gamma H) \bigg[ \theta\_{k} + \Big( 1\, – \frac{ 2\rho }{k+2\rho-1} \Big) ( \theta\_{k} \, – \theta\_{k-1}) \bigg].\) I show in [1] how the performance measure leads to an explicit rational function in ((z,\lambda)) of order (2 \rho), obtained through differentiations of the z-transform (to take into account the rational depence of (\tau_k)). For (\rho=1), we can easily get by hand a partial function decomposition, while for larger integer (\rho), this can be readily done by symbolic computation tools such as Mathematica. Overall, the final equivalents for (a_k = \frac{1}{2} \langle \eta_k \, – \eta_\ast, H ( \eta_k \, – \eta_\ast) \rangle) are essentially (up to logarithmic terms) proportional to ({1}/{k^{\min{2\omega, \omega + \rho}}}). More precisely we get totally explicit equivalents:
- (\displaystyle a_k \sim c \frac{\Gamma(2 \rho)^2}{\Gamma(\rho)^2} \cdot \frac{\Gamma(\rho-\omega) \Gamma(\omega)}{\Gamma(4\rho-1-2\omega)} \frac{\Gamma(2\rho-1/2-\omega)}{\Gamma(\rho+1/2-\omega)} \frac{2^{2\rho-1}}{4^\omega } \cdot \frac{1}{k^{ 2\omega}}) for (\omega \in (0,\rho)).
- (\displaystyle a_k \sim c \frac{\Gamma(2 \rho)^2}{\Gamma(\rho)^2} \cdot\frac{1}{2^{2\rho-1}} \cdot \frac{ \log k}{k^{2\rho}}) for (\omega = \rho).
- (\displaystyle a_k \sim c \frac{\Gamma(2 \rho)^2}{\Gamma(\rho)^2} \cdot \frac{1}{2^{2\rho-1}} \Gamma(\omega-\rho)\cdot \frac{1}{k^{\omega+\rho}}) for (\omega > \rho).
And amazingly, they do match in our experiments even beyond integer (\rho), as shown below.
Nesterov acceleration for (\omega = 1) and several (\rho)’s. Note the classical vanishing oscillations for small (\rho). Moreover, the best performance is achieved around (\rho \approx \omega).
What about heavy-ball? The heavy-ball method is very similar to Nesterov acceleration, with the following iteration \(\eta\_{k+1} = \eta\_{k} \, – \gamma F'(\eta\_k) + (1-\tau\_{k+1}) ( \eta\_k \,- \eta\_{k-1}),\) leading to, with the same notations, to \(\theta\_{k+1} = \Big( I- \frac{k}{k+2\rho-1} \gamma H\Big) \theta\_{k} + \Big( 1 \, – \frac{ 2\rho }{k+2\rho-1} \Big) ( \theta\_{k} \, – \theta\_{k-1}).\) The associated z-transform is also the integral with respect to the spectral measure of a rational function, leading to an asymptotic equivalent around (z=1^-). Then, one can plot the actual performance and compare to the equivalent, as shown below.
Heavy-ball method with (\rho=1) for two values of the spectral dimension (\omega). Top: performance (a_k) vs. equivalent (\bar{a}_k), bottom: ratio.
What’s happening? The two do not match for (\omega) large enough (right plot above). It turns out that oscillations do not vanish, and the sufficient Tauberian conditions are not satisfied. What can then be shown in terms of asymptotic equivalents remains open.
Stochastic gradient descent (a.k.a. least-mean-squares algorithm)
We now consider minimizing the expexted risk for a linear predictor with square loss: \(F(\eta) = \frac{1}{2}\mathbb{E} \big[ ( y \, – \langle x, \eta \rangle )^2 \big]\) (keeping in mind that we can use a feature vector). Single-pass stochastic gradient descent (SGD) leads to the iteration \(\eta\_k = \eta\_{k-1}\, – \gamma ( \langle x\_k,\eta\_{k-1} \rangle \,- y\_k) x\_k,\) with ((x_k,y_k)) an independent sample from the distribution defining (F). We can write it using a minimizer (\eta_\ast) of (F) as \(\eta\_k -\eta\_\ast =( I \, – \gamma x\_k \otimes x\_k) (\eta\_{k-1} \, – \eta\_\ast) + \gamma \varepsilon\_k x\_k,\) where (\varepsilon_k = y_k \, – \langle x_k, \eta_\ast \rangle) is the optimal residual. We denote (\theta_k = \eta_k \, – \eta_\ast). In this context, SGD is referred to as the least-mean-squares (LMS) algorithm [14, 15, 16, 17].
The simplest but incomplete analysis uses that (\mathbb{E} [ \varepsilon_k x_k ] =0) (which is the optimality condition characterizing (\varepsilon_k)), to get: \(\mathbb{E}[\theta\_k] = ( I \, – \gamma H) \mathbb{E}[\theta\_{k-1}],\) leading to the classical linear iteration of gradient descent. However, it does not lead to any result on the performance measure (\frac{1}{2} \mathbb{E} [ \langle \theta_k, H \theta_k \rangle] = \mathbb{E} [ F(\eta_k)\, – F(\eta_\ast) ]).
Traditionally, there are two terms in the analysis: “bias” and “variance” (see a thorough discussion in [27]). For simplicity, let’s assume that (\varepsilon_k) is independent of (x_k), with (\mathbb{E} [ \varepsilon_k]=0) and (\mathbb{E} [ \varepsilon_k^2]=\sigma^2) (not to be confused with the notation for the spectral measure defined above). We then consider the operator \(\Theta\_k = (\eta\_k \, – \eta\_\ast)\otimes (\eta\_k \,- \eta\_\ast) = \theta\_k \otimes \theta\_k,\) for which we have \(F(\eta\_k) \,- F(\eta\_\ast) = \frac{1}{2} \langle \Theta\_k,H \rangle = \frac{1}{2}{\rm tr}( \Theta\_k H),\) and $$ \mathbb{E} [ \Theta_k | \Theta_{k-1}] = \mathbb{E} \big[ ( I \, – \gamma x_k \otimes x_k) \otimes ( I \, – \gamma x_k \otimes x_k) \big] \Theta_{k-1} + \gamma^2 \sigma^2 H,\(leading to:\) \mathbb{E} [ \Theta_k ] = \big( I\, – \gamma H \otimes I \, – \gamma I \otimes H\, + \gamma^2 \mathbb{E} [ x \otimes x \otimes x \otimes x] \big) \mathbb{E} [ \Theta_{k-1} ] \, + \gamma^2 \sigma^2 H.$$ |
We consider the operator (defined on self-adjoint operators) \(T = H \otimes I \, + I \otimes H \, -\, \gamma \mathbb{E} [ x \otimes x \otimes x \otimes x],\) so that \(\tag{1} \mathbb{E} [ \Theta\_k ] = \big( I\, – \gamma T \big) \mathbb{E} [ \Theta\_{k-1} ] \, + \gamma^2 \sigma^2 H,\) leading to, after summation of the geometric series (see an earlier post): \(\mathbb{E} [ \Theta\_k ] = \big( I\, – \gamma T \big)^k \Theta\_{0} \, + \gamma \sigma^2 \big[ I- \big( I- \gamma T \big)^k \big] T^{-1} H .\)
A sufficient condition for convergence is that the operator (I\, – \gamma T) (as an operator on symmetric operators) has eigenvalues in ([-1,1]). In [17], we define and characterize the largest step-size with few assumptions. We consider instead specific models for asymptotic computations.
Simplest model. In order to obtain simpler formulas, several models have been proposed that are compatible with the Hessian operator (H) having eigenvalues ((\lambda_i)_{i \geqslant 1}) and eigenvectors ((u_i)_{i \geqslant 1}). The simplest model [18] is (x = r u_j) with (r \in \mathbb{R}) and (i \in \mathbb{N}^\ast) random and independent, with (\mathbb{E}[r^2] = \sum_{i \geqslant 1} h_i = {\rm tr}(H)), and (\mathbb{P}(j = i) = h_i / {\rm tr}(H)). What makes this model special is that, like in the deterministic case, there are no interactions between eigensubspaces. While this includes one-hot encoded discrete data as recently used in [19], this remains limited compared to the model below (which includes Gaussian data and is often used in random matrix theory).
Simplified high-dimensional model. We consider \(x= \sum\_{i = 1}^{+\infty} h\_i^{1/2} z\_i u\_i,\) where the variables (z_i \in \mathbb{R}), (i \geqslant 1), are independent (and not merely un-correlated), and have zero mean and unit variance. The only parameter that we will need to characterize the operator (T) is the fourth-order moment, (\mathbb{E} [ z_i^2] = 3 + \kappa), where (\kappa) is the excess kurtosis.
The Gaussian model [15] where all (z_i) are Gaussians corresponds to (\kappa=0), while the Rademacher model where (z_i \in {-1,1}) are Rademacher random variables corresponds to the smallest possible value (\kappa=-2). Other models could also be considered, particularly when using kernels [25].
With this simplified model, as already shown in the 1980’s in the Gaussian context [20, 15], in the basis of eigenvectors of (H), the operator (T) has a simple block-diagonal structure: the off-diagonal elements of (\mathbb{E} \big[ \Theta_k ] ) evolve independently (and converge to zero linearly), while the diagonal elements that are needed to compute the performance measure evolve according to a linear iteration with a matrix that is the sum of a diagonal and a rank-one matrix. More precisely, assuming without loss of generality that the operator (H) is diagonal equal to ({\rm Diag}(h)), Eq. ((1)) projected on diagonal elements becomes \({\rm diag}( \mathbb{E}[ \Theta\_k]) = ( I\, – \gamma V) {\rm diag}( \mathbb{E}[ \Theta\_k]) + \gamma^2 \sigma^2 h,\) with \(V = {\rm Diag}\big( 2 h \, – \gamma(\kappa+2) h \circ h\big) – \gamma h \otimes h .\)
The performance (a_k = \mathbb{E} [ F(\eta_k) ] – F(\eta_\ast)) is thus \(a\_k = \frac{1}{2}\langle \delta \circ \delta, ( I \, – \gamma V )^k h \rangle + \gamma \sigma^2 \langle h, \big(I\, – ( I \, – \gamma V )^k \big) V^{-1} h \rangle.\) In order to study convergence, we could look at an eigenvalue decomposition of the matrix (V) using “secular” equations [21]. Instead, we will use the z-transform method, which avoids explicit computations of eigenvalues.
For simplicity, we consider here only the interpolation regime where (\sigma^2=0) (i.e., there is no noise on top of optimal predictions, see [1] for all details). Then, the z-transform is equal to \(A(z) = \sum\_{k=0}^{+\infty} a\_k z^k = \frac{1}{2}\langle \delta \circ \delta, ( (1-z) I \, + z \gamma V )^{-1} h \rangle.\) The matrix ((1-z) I \, + z \gamma V) that needs to be inverted has a “diagonal + rank-one” structure, and can thus be inverted using the matrix inversion lemma, leading to an explicit rational function in (z), that can be directly analyzed around (z = 1) (see [1]).
Asymptotic equivalent. Following this post and the gradient descent analysis above, we consider (h_i = {L}/{i^\alpha}) and (\delta_i = {\Delta}/{i^{\beta/2}}). If we assume (\kappa = -2) (Rademacher case, which leads to the simplest formulas), the largest step-size is (\gamma = 2/ {\rm tr}(H)) (which can be significantly smaller than (2 / \lambda_\max(H)) when (\alpha) is close to 1), and we get the asymptotic equivalent \(a\_k \sim \frac{L \Delta^2}{2 \alpha} \frac{1}{1-\gamma {\rm tr}(H)/2 } \frac{\Gamma(\omega)}{(2 \gamma L k)^{\omega}},\) with (\omega= (\beta-1)/\alpha + 1), and under the condition (\omega = (0, 2- 1/\alpha)) (see [1] for a result for all cases). Up to constants, this leads to the same equivalent as for gradient descent. See below for an illustration.
Comparison of asymptotic equivalent and performance of SGD in the interpolation regime for several replications for (\omega=1).
We recover the results from [22] with a simpler and more direct proof, thereby obtaining asymptotic results that complement the non-asymptotic results from [23]. This is also a subcase of results from [24] with a simpler proof. Various extensions are possible, e.g., using averaging of iterates, or computing estimates of the variance of the performance (and not only its expectation); see [1] for details.
Conclusion
In this blog post, I have shown how the z-transform can be used to derive asymptotic equivalents of sequences that naturally appear in the optimization of quadratic forms, with classical tools such as acceleration and stochastic approximation. In all cases, the z-transform method allows simple computations of asymptotic equivalents (e.g., scaling laws) through the expansion of rational functions.
Since quadratic functions are at the heart of optimization, many recent developments can be considered as candidates for the z-transform method, such as Richardson extrapolation, non-linear iterations for generic convex functions, time-varying coefficients beyond rational functions, mixing momentum techniques and SGD [26, 26], primal-dual algorithms, variance reduced methods, or sampling techniques. Many situations where the z-transform can prove useful!
Where does the “z” come from? For those questioning the naming of the z-transform, the wikipedia page has some interesting historical notes. The idea of using generating functions to study sequences dates back at least to De Moivre around 1730, while its first use in discrete engineering systems was done by Witold Hurewicz in 1947 using the letter “z” (to make it different from the letter “s” used for the Laplace transform (its continuous-time counterpart). The name z-transform was later coined in 1952 by John R. Ragazzini and Lofti Zadeh. As confirmed by a personal communication from Lofti Zadeh to Murat Arcak, It has nothing to do with the first letter of Loftis Zadeh’s name (“I used the letter z not because of my name but because in the mathematical literature on difference equations, z was the symbol that was used”). However, I didn’t find any formal confirmation that Zorro had nothing to do with it.
Acknowledgements. I thank Adrien Taylor, Ayoub Melliti, Nicolas Flammarion, Baptiste Goujaud, and Raphaël Berthier for insightful discussions related to this work.
References
[1] Francis Bach. On the effectiveness of the z-transform method in quadratic optimization. Technical report, arXiv:2507.03404, 2025.
[2] Godfrey H. Hardy and John E. Littlewood. Abel’s theorem and its converse. Proceedings of the London Mathematical Society, 2(1):205–235, 1920.
[3] Jacob Korevaar. Tauberian Theory: A Century of Developments. Springer, 2004.
[4] Philippe Flajolet and Andrew Odlyzko. Singularity analysis of generating functions. SIAM Journal on Discrete Mathematics, 3(2):216–240, 1990.
[5] Edward A. Bender. Asymptotic methods in enumeration. SIAM Review, 16(4):485–515, 1974.
[6] Eliahu Ibrahim Jury. Theory and Application of the z-Transform Method. Robert E. Krieger Publishing Company, 1964.
[7] Herbert S. Wilf. Generatingfunctionology. CRC Press, 2005.
[8] Raphaël Berthier, Francis Bach, and Pierre Gaillard. Accelerated gossip in networks of given dimension using Jacobi polynomial iterations. SIAM Journal on Mathematics of Data Science, 2(1):24–47, 2020.
[9] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. Conference on Learning Theory, 2015.
[10] David Vernon Widder. Laplace Transform. Princeton University Press, 1942.
[11] Yurii Nesterov. A method for solving a convex programming problem with rate of convergence (O(1/k^2)). Soviet Mathematics. Doklady, 269(3):543–547, 1983.
[12] Antonin Chambolle and Charles Dossal. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. Journal of Optimization theory and Applications, 166:968–982, 2015.
[13] Jean-François Aujol, Charles Dossal, Hippolyte Labarrière, and Aude Rondepierre. Strong convergence of FISTA iterates under Hölderian and quadratic growth conditions. Technical Report 2407.17063, arXiv, 2024.
[14] Neil Bershad. Analysis of the normalized LMS algorithm with Gaussian inputs. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4):793–806, 1986.
[15] Arie Feuer and Ehud Weinstein. Convergence analysis of LMS filters with uncorrelated Gaussian data. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(1):222–230, 2003.
[16] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate (O(1/n)). Advances in Neural Information Processing Systems, 2013.
[17] Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. International Conference on Artificial Intelligence and Statistics, 2015
[18] Dirk T. M. Slock. On the convergence behavior of the LMS and the normalized LMS algorithms. IEEE Transactions on Signal Processing, 41(9):2811–2825, 1993.
[19] Frederik Kunstner, Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under Zipf’s law. Technical report, arXiv:2505.19227, 2025.
[20] Larry L. Horowitz and Kenneth D. Senne. Performance advantage of complex LMS for controlling narrow-band adaptive arrays. IEEE Transactions on Circuits and Systems, 28(6):562–576, 1981.
[21] Gene H. Golub. Some modified matrix eigenvalue problems. SIAM Review, 15(2):318–334, 1973.
[22] Maksim Velikanov, Denis Kuznedelev, and Dmitry Yarotsky. A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta. International Conference on Learning Representations, 2023.
[23] Raphaël Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Advances in Neural Information Processing Systems, 2020b.
[24] Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 phases of compute-optimal neural scaling laws. Advances in Neural Information Processing Systems, 2024.
[25] Aditya Varre and Nicolas Flammarion. Accelerated SGD for non-strongly-convex least squares. In Conference on Learning Theory, 2022.
[26] Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension-adapted momentum outscales SGD. Technical Report 2505.16098, arXiv, 2025.
[27] Aymeric Dieuleveut, Nicolas Flammarion, Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression. Journal of Machine Learning Research, 18(101):1-51, 2017.
By Francis Bach