The Learning Theory Alliance Blog: Testing Assumptions of Learning Algorithms
Today’s technical post is by Arsen Vasilyan. This focuses on the very exciting new “testable learning” he introduced with Rubinfeld in a 2023 paper. There’s been a flurry of work since then, so this is a good chance to catch up in case you’re behind!
1. The Goal: Learning with Noisy Labels
1.1 Introduction
Many learning algorithms simply stop working when some of the training examples are mislabeled. For example, suppose we are using a linear program to learn a linear classifier. Then, a single mislabeled data-point can make the linear program infeasible, and the algorithm fails to output a solution. Today, we will discuss how to design learning algorithms that are provably robust to such noise.
For noise-resistant learning algorithms, a key consideration emerges: what can these algorithms assume about how the data-points are distributed? We will discuss three theoretical frameworks addressing this question:
- Framework 1 — the distribution-free framework — makes no assumption on how data-points are generated. For every distribution over datapoints, the algorithm has to run efficiently and be robust to noise. This framework is very well-understood, but is computationally intractable. Even the simplest concept classes lead to computationally hard learning tasks in this framework.
- Framework 2 — the distribution-specific framework — assumes that data-points come from some well-behaved distribution, such as the Gaussian distribution. Over the last 30 years, the distribution-specific framework has led to a wealth of provably efficient algorithms. However, these algorithms fail to be noise-resistant when their assumptions do not hold.
- Framework 3 — the testable framework — relies on an assumption on the data-point distribution only for the run-time of an algorithm, but not for the algorithm’s robustness to noise. Even if data-points do not come from some well-behaved distribution, the algorithm should be robust to noise. However, in this case, the algorithm might take a long time to run.
This blog post will mostly focus on the third framework, which was recently introduced in [RV ’23] in the context of learning with noise, and further explored in [GKK ’23, DKKLZ ’23, GKSV ’23, GKSV ’24, GSSV ’24, DKLZ’24, STW ’24, GKSV ’25]. In brief, although Framework 3 is more stringent than Framework 2, there is a toolkit of techniques that we can use to upgrade many Framework 2 algorithms into Framework 3.
1.2 FAQ on the Testable Framework (Framework 3)
We get a lot of questions about Framework 3, and so we begin with the following FAQs:
-
Q: Why is the framework called testable?
A: All current algorithms in Framework 3 follow the following general recipe for upgrading Framework 2 algorithms into Framework 3 — the procedure suggests the name testable learning. See Section 2.2 for more about this recipe.- a) Run a Framework 2 algorithm and obtain a hypothesis
.
- b) Run a tester that is guaranteed to output “accept” whenever the assumption holds.
- c) If the tester accepted, return the hypothesis
.
- d) Otherwise, the assumption must have been violated so we can run a computationally expensive algorithm (See Section 1.3 for more about those).
- a) Run a Framework 2 algorithm and obtain a hypothesis
-
Q: To implement this tester, cant we just check that the distribution is “close” to a Gaussian using tools from distribution testing?
A: For many notions of “closeness” this is statistically impossible: for example one cannot test whether an unknown distribution is Gaussian or far from Gaussian in total variation distance.1 For other notions of closeness, such as the earth-mover distance, these tools require a number of samples and run-time that is exponential in the dimension. In learning theory, we are looking for much faster run-times (see [RV ’23] for more discussion). -
Q: What if the assumption is violated only slightly? I am concerned that this will cause a Framework 3 algorithm to run slowly as a result.
A: Many algorithms in Framework 3 can be modified to run fast whenever the assumption is only approximately satisfied in total variation distance [GKSV ’25]. This setting is called Tolerant Testable Learning, and discussed more in Section 3.2. -
Q: The algorithm is guaranteed to run fast for only one specific distribution. What if I want it to run fast for an entire family of distributions?
A: This setting, called universal testable learning, has also been studied. [GKSV ’23] studies it for the family of log-concave distributions, and it turns out to be intimately connected with Sum-of-Squares relaxations [KS17]. -
Q: Framework 3 is defined to work with learning under label noise (a.k.a. agnostic learning). What about (noise-free) PAC learning?
A: Frameworks 2 and 3 are essentially equivalent in the noise-free setting. In this setting, one can always tell whether the learning algorithm successfully produced a good hypothesis: just estimate its prediction error by drawing a fresh set of examples. If the prediction error is large, this must be because the assumption is violated, so Framework 3 does not require us to run fast. In this case, we can produce a classifier by running a computationally expensive algorithm (See Section 1.3 for more about those). -
Q: What is the sample complexity of this framework? Is it related to VC dimension?
A: [GKK ’23]have shown that the sample complexity of this model is characterized by the Rademacher complexity of the hypothesis class. However, in this blog post we will focus on the run-time rather than sample complexity.
We will now introduce the three frameworks in more detail, survey more of the known results and highlight general ideas for designing such algorithms.
1.3 Three Frameworks for Learning with Noisy Labels
Let us be more formal now. will denote our hypothesis class, containing binary-valued functions over the domain
. We get independent examples
from a distribution
and each
is labeled bysome unknown binary-valuedfunction2
. To model arbitrary label noise, we place no assumption on
, and our goal is to get as close as possible to the smallest error among all functions
from the class
:
![\displaystyle \text{opt}{\mathcal{C}}:=\min{f\in\mathcal{C}}\underbrace{\Pr_{x\sim D_{\text{Examples}}}[f(x)\neq f_{\text{noisy labels}}(x)]}{\text{Prediction error of \ensuremath{f}.}}. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Ctext%7Bopt%7D%7B%5Cmathcal%7BC%7D%7D%3A%3D%5Cmin_%7Bf%5Cin%5Cmathcal%7BC%7D%7D%5Cunderbrace%7B%5CPr_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bf%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D_%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bf%7D.%7D%7D.+&bg=ffffff&fg=000000&s=0&c=20201002)
For instance,  denotes the smallest classification error achievable by any linear classifier.
Framework 1 [Distribution-free Framework] For any distribution and function
, the learning algorithm should run in time
and with high probability3 output a classifier
with
![\displaystyle \overbrace{\Pr_{x\sim D_{\text{Examples}}}[h(x)\neq f_{\text{noisy labels}}(x)]}^{\text{Prediction error of \ensuremath{h}.}}\leq\underbrace{\overbrace{\text{opt}{\mathcal{C}}}^{\text{Optimal prediction error in \ensuremath{\mathcal{C}.}}}+\varepsilon}{\text{\ensuremath{\text{\text{\ensuremath{\text{Weaker guarantees, such as O(\ensuremath{\text{opt}{\mathcal{C}}})+\ensuremath{\varepsilon},\ also considered.}}}}}}}. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Coverbrace%7B%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bh%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D%5E%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bh%7D.%7D%7D%5Cleq%5Cunderbrace%7B%5Coverbrace%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%5E%7B%5Ctext%7BOptimal+prediction+error+in+%5Censuremath%7B%5Cmathcal%7BC%7D.%7D%7D%7D%2B%5Cvarepsilon%7D_%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7BWeaker+guarantees%2C+such+as+O%28%5Censuremath%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%29%2B%5Censuremath%7B%5Cvarepsilon%7D%2C%5C+also+considered.%7D%7D%7D%7D%7D%7D%7D.+&bg=ffffff&fg=000000&s=0&c=20201002)
This task, also known as agnostic learning, can be solved sample-efficiently. According to a central theorem in the VC theory, samples suffice (where
is the VC dimension of
). For example, if
is the class of linear classifiers, then
, so
samples are enough.
However, when considering the run-time , the Framework 1 is revealed to be intractable. For example, no
-time algorithm is known in Framework 1 when
is taken to be the class of linear classifiers (which is arguably the most basic hypothesis class). Furthermore, even the task of deciding if  or  is known to be NP-hard [GR’06, FGKP’06] for linear classifiers and believed to require
time. In summary, these Framework 1 tasks take
samples but the run-time
is
.
These intractability results hold for certain carefully-constructed worst-case choices of . What happens if
is some specific commonly-occurring probability distribution? For instance, we may want to find a good linear classifier when the data-points come from a Gaussian distribution or some other well-behaved distribution
. This has motivated research in the following framework:
Framework 2 [Distribution-Specific Framework] For any distribution and function
, if
, then the learning algorithm should run in time
and with high probability output a classifier
with
![\displaystyle \overbrace{\Pr_{x\sim D_{\text{Examples}}}[h(x)\neq f_{\text{noisy labels}}(x)]}^{\text{Prediction error of \ensuremath{h}.}}\leq\underbrace{\overbrace{\text{opt}{\mathcal{C}}}^{\text{Optimal prediction error in \ensuremath{\mathcal{C}.}}}+\varepsilon}{\text{\ensuremath{\text{\text{\ensuremath{\text{Weaker guarantees, such as O(\ensuremath{\text{opt}{\mathcal{C}}})+\ensuremath{\varepsilon},\ also considered.}}}}}}}. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Coverbrace%7B%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bh%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D%5E%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bh%7D.%7D%7D%5Cleq%5Cunderbrace%7B%5Coverbrace%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%5E%7B%5Ctext%7BOptimal+prediction+error+in+%5Censuremath%7B%5Cmathcal%7BC%7D.%7D%7D%7D%2B%5Cvarepsilon%7D_%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7BWeaker+guarantees%2C+such+as+O%28%5Censuremath%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%29%2B%5Censuremath%7B%5Cvarepsilon%7D%2C%5C+also+considered.%7D%7D%7D%7D%7D%7D%7D.+&bg=ffffff&fg=000000&s=0&c=20201002)
This framework is also often referred to as distribution-specific agnostic learning. For many natural choices of distribution , the aforementioned hardness results for Framework 1 can be sidestepped in Framework 2. When the class
is the class of linear classifiers and the distribution
is a Gaussian distribution, the algorithm of [KKMS ’08, DGJSV ’10] can achieve an error  with run-time
, which is a lot better than the
run-time in Framework 1. This run-time is believed to be optimal [DKZ ’20, GGK ’20, DKPZ ’21, DKR ’23], but can be sped up to
if one is happy with a slightly larger error4 of  [ABL ’14, DKTZ ’22].
Yet, Framework 2 relies on the potentially unverifiable assumption that the distribution equals
. Note that it is impossible to verify whether
is exactly equal to the Gaussian distribution. Unfortunately, when
, the  error guarantee of an algorithm in Framework 2 becomes null and void. Remember how a linear-programming approach can fully fail to find a linear classifier with even a single mislabeled example? We see that Framework 2 also permits an algorithm to fully fail when
.
This limitation is addressed by the following more challenging framework, in which the algorithm must always output a near-optimal classifier (with high probability, as in Framework 1) but must run quickly only when the assumption holds (as in Framework 2):
Framework 3 [Testable Framework] For any distribution and function
, the learning algorithm should with high probability output a classifier
with
![\displaystyle \overbrace{\Pr_{x\sim D_{\text{Examples}}}[h(x)\neq f_{\text{noisy labels}}(x)]}^{\text{Prediction error of \ensuremath{h}.}}\leq\underbrace{\overbrace{\text{opt}{\mathcal{C}}}^{\text{Optimal prediction error in \ensuremath{\mathcal{C}.}}}+\varepsilon}{\text{\ensuremath{\text{Weaker guarantees, such as O(\ensuremath{\text{opt}{\mathcal{C}}})+\ensuremath{\varepsilon},\ also considered.}}}}. \ \ \ \ \ (1)](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Coverbrace%7B%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bh%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D%5E%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bh%7D.%7D%7D%5Cleq%5Cunderbrace%7B%5Coverbrace%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%5E%7B%5Ctext%7BOptimal+prediction+error+in+%5Censuremath%7B%5Cmathcal%7BC%7D.%7D%7D%7D%2B%5Cvarepsilon%7D_%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7BWeaker+guarantees%2C+such+as+O%28%5Censuremath%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%29%2B%5Censuremath%7B%5Cvarepsilon%7D%2C%5C+also+considered.%7D%7D%7D%7D.+%5C+%5C+%5C+%5C+%5C+%281%29&bg=ffffff&fg=000000&s=0&c=20201002)
Furthermore, if , then with high probability the learning algorithm should run5 in time
.
This framework is also often referred to as testable learning (for reasons explained in Section 2.2). When a Framework 3 algorithm produces a classifier , the user can be confident that
is nearly-optimal (Equation 1) whether or not
, but the fast run-time is only guaranteed when
.
Even though Framework 3 requires more from the learning algorithm than Framework 2, a growing body of research demonstrates that Framework 3 is often as computationally tractable as Framework 2. This research direction was initiated in [RV ’23] which mainly considered linear classifiers under the Gaussian distribution, and was subsequently extended to many more hypothesis classes and distributions [GKK ’23, DKKLZ ’23, GKSV ’23, GKSV ’24, GSSV ’24, DKLZ’24, STW ’24, GKSV ’25]. In this blog post, we will survey this research direction and explore some of the techniques used to design such algorithms.
2. How to Modify Framework 2 Algorithms to Work in Framework 3.
In this section we will discuss a general recipe for converting Framework 2 algorithms into Framework 3 by augmenting them with an appropriate tester. As a case study, we will focus on one of the most widely-studied Framework 2 algorithms: the Low-Degree Algorithm. We will see how it can be upgraded into Framework 3 using the Moment-Matching Tester. Later, in Section 3, we will see how a similar recipe can be used for other algorithms as well.
2.1 What is an Example of Framework 2 Algorithm? The Low-Degree Algorithm.
We begin by examining the following Framework 2 algorithm.
Degree-k Low-Degree Algorithm [KKMS ’08]:
Given: access to independent Gaussian examples labeled by unknown function and a parameter
.
Output: hypothesis .
-
{
Gaussian examples labeled by function
}.
-
Compute
![\displaystyle p^{*}\leftarrow\underset{\boldsymbol{\text{degree}}-k\text{ polynomial }p}{\text{argmin}}\left[\mathbb{E}{(x{i},f_{\text{noisy labels}}(x_{i}))\sim S} p(x_{i})-f_{\text{noisy labels}}(x_{i}) \right] ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+p%5E%7B%2A%7D%5Cleftarrow%5Cunderset%7B%5Cboldsymbol%7B%5Ctext%7Bdegree%7D%7D-k%5Ctext%7B+polynomial+%7Dp%7D%7B%5Ctext%7Bargmin%7D%7D%5Cleft%5B%5Cmathbb%7BE%7D_%7B%28x_%7Bi%7D%2Cf_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x_%7Bi%7D%29%29%5Csim+S%7D%7Cp%28x_%7Bi%7D%29-f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x_%7Bi%7D%29%7C%5Cright%5D+&bg=ffffff&fg=000000&s=0&c=20201002) by solving a linear program.
- Output hypothesis
mapping
in
to
.
It can be shown [KKMS ’08, DGJSV ’10] that, for , the resulting classifier
will achieve prediction error . Furthermore, with a slightly more sophisticated version of step 3, the prediction error improves to  [KKMS ’08, DGJSV ’10]. The run-time is
.
This approach also yields a classifier with error  for many other hypothesis classes
, such as ANDs of linear classifiers or indicators of convex sets in
[KOS ’08]. The same approach can even be used when the class
is constant-depth circuits and the distribution
is uniform over binary bit-strings [KKMS ’08]. To achieve all these results, we only need to run the algorithm above with a larger value of the degree parameter
. The run-time
will increase, but as long as we insist on error bound of  in Framework 2, this approach is known to be optimal for many hypothesis classes
[DFTWW’15, DKZ ’20, GGK ’20, DKPZ ’21, DKR ’23]. (Additionally, in order to achieve error  rather than  one needs to replace step 3 with a slightly more sophisticated version of this step [KKMS ’08]. From now on, we will make a tacit assumption that this improved version of step 3 is used.)
2.2 Converting the Low-Degree Algorithm into Framework 3 via the Moment-Matching Test.
For a wide range of concept classes , you can get Framework 3 algorithms by combining the Low-Degree Algorithm with a suitable tester
. Here is a general recipe to extend Framework 2 algorithms so they run in Framework 3:
- Run6 an algorithm
for class
in Framework 2 to get a classifier
- Run some algorithm
, which we will call the tester, that accesses
and outputs Accept or Reject.
- If
accepts, then return the classifier
from step (1).
- If
rejects, then run a (potentially very slow) algorithm
for class
in Framework 1. (See Section 1.3 for more about those)
Let us compare Framework 2 and Framework 3, to see what specifications the tester needs to meet to ensure that the procedure above satisfies Framework 3. We see that it would suffice if algorithm
accepted whenever
and rejected whenever
. Unfortunately, for example when
is the Gaussian distribution, there is provably no tester
that can distinguish if
or
.
However, we can get away with less stringent requirements for and still achieve our goal:
-
Completeness: whenever
, the tester
accepts with high probability.
-
Soundness: For all distributions
, either the tester
will with high probability reject
, or
is such that the algorithm
will with high probability give a hypothesis
satisfying the
-optimality guarantee
![\displaystyle \overbrace{\Pr_{x\sim D_{\text{Examples}}}[h(x)\neq f_{\text{noisy labels}}(x)]}^{\text{Prediction error of \ensuremath{h}.}}\leq\overbrace{\text{opt}{\mathcal{C}}}^{\text{Optimal prediction error in \ensuremath{\mathcal{C}.}}}+\varepsilon. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Coverbrace%7B%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bh%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D%5E%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bh%7D.%7D%7D%5Cleq%5Coverbrace%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%5E%7B%5Ctext%7BOptimal+prediction+error+in+%5Censuremath%7B%5Cmathcal%7BC%7D.%7D%7D%7D%2B%5Cvarepsilon.+&bg=ffffff&fg=000000&s=0&c=20201002)
-
Fast Run Time: the tester
runs in time at most
.
The point of the Soundness condition is to permit the algorithm to accept a distribution
even if
but
is still “good enough” for the algorithm
.
Let us come back to using the recipe above, taking to be the Low-Degree Algorithm. The recipe above can be executed using the following algorithm7 as the tester
:
Degree- Moment-Matching Tester [RV ’23, GKK ’23]:
- Given: access to independent examples from
, a reference distribution
, parameter
.
- Output: Accept or Reject.
-
{
examples from
}.
- For all monomials
of total degree at most
:
- If
, output Reject and terminate.
(Note: For many distributionsof interest, such as Gaussian distribution, the quantity
can be computed directly. One can also draw samples from
and use them to estimate
.)
- If
- If this step is reached, output Accept.
Since in dimensions there are at most
monomials of degree
, the Moment-Matching Tester above runs in time
.
Overall, using the recipe above to combine the degree- Low-Degree Algorithm with the degree-
Moment-Matching Tester, yields Framework 3 algorithms for many concept classes—including linear classifiers, ANDs of linear classifiers, and constant-depth circuits. As a concrete example, when
is the class of linear classifiers, the resulting algorithm in Framework 3 has a run-time bound
of
. This run-time matches best algorithm of in Framework 2 (believed to be optimal).
The proof of correctness for these algorithms in Framework 3 is far from automatic. As explained in Section 2.3, the notion of sandwiching polynomials from the field of pseudorandomness turns out to be very important for the proof of correctness [GKK ’23].
Remark: many papers referenced here phrase their main result as a tester that satisfies the three aforementioned conditions of completeness, soundness and fast run-time (together with a Framework 2 algorithm
). Note that, as described above and also noted in [RV ’23], having such a tester gives8 an algorithm in Framework 3 (in fact it is equivalent to having an algorithm in Framework 3 [RV ’23]).
2.3 Analysis of the Moment-Matching Tester through Sandwiching Polynomials.
Soon, we will discuss more Framework 3 algorithms, but let us first discuss the proof of correctness for the Moment-Matching Tester. Recall that it needs to satisfy three conditions: Completeness, Soundness and Fast Run Time. Among these three, the Soundness condition is the most challenging one to prove. [GKK ’23] give a principled way of doing this, which we will briefly sketch now. A key step is to show that every function in the class
has
-sandwiching polynomials of degree
under
, i.e. a pair of polynomials
and
satisfying:
-
Sandwiching: for every point
, we have
.
-
-Closeness: we have
.
For example, linear classifiers have -sandwiching polynomials of degree
under the Gaussian distribution [DGJSV ’10]. This notion was first studied in the field of pseudorandomness, because the existence of sandwiching polynomials for
can be leveraged to approximate the expectation
while only making deterministic queries to
.
It is shown in [GKK ’23] that whenever such a pair of polynomials exists for every in the hypothesis class
, then for the Moment-Matching Tester, the Soundness condition holds. Let us briefly sketch the argument, denoting by
the function in
with the smallest prediction error . By assumption, there is a pair of sandwiching polynomials
and
for
.
If the distribution is such that the Moment-Matching Tester is likely to pass, then for every monomial
of degree at most
we have
. Therefore,
![\displaystyle \mathbb{E}{x\sim D{\text{Examples}}}\left[\left | f^{*}(x)-p_{\text{down}}(x)\right | \right] \ \leq\mathbb{E}{x\sim D{\text{Examples}}}[p_{\text{up}}(x)-p_{\text{down}}(x)]\approx\mathbb{E}{x\sim D{\text{Assumption}}}[p_{\text{up}}(x)-p_{\text{down}}(x)]\leq\varepsilon, ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Cleft%5B%5Cleft%7Cf%5E%7B%2A%7D%28x%29-p_%7B%5Ctext%7Bdown%7D%7D%28x%29%5Cright%7C%5Cright%5D+%5C%5C+%5Cleq%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bp_%7B%5Ctext%7Bup%7D%7D%28x%29-p_%7B%5Ctext%7Bdown%7D%7D%28x%29%5D%5Capprox%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BAssumption%7D%7D%7D%5Bp_%7B%5Ctext%7Bup%7D%7D%28x%29-p_%7B%5Ctext%7Bdown%7D%7D%28x%29%5D%5Cleq%5Cvarepsilon%2C+&bg=ffffff&fg=000000&s=0&c=20201002) |
where the first step uses the sandwiching property, the second step breaks into monomials and then applies9
to each monomial
. The last step above uses the
-Closeness property.
Finally, the upper bound on ![{\mathbb{E}{x\sim D{\text{Examples}}}\left[\left | f^{*}(x)-p_{\text{down}}(x)\right | \right]}](https://s0.wp.com/latex.php?latex=%7B%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Cleft%5B%5Cleft%7Cf%5E%7B%2A%7D%28x%29-p_%7B%5Ctext%7Bdown%7D%7D%28x%29%5Cright%7C%5Cright%5D%7D&bg=ffffff&fg=000000&s=0&c=20201002) tells us that the Low-Degree Algorithm finds a polynomial |
![\displaystyle \frac{1}{2}\mathbb{E}{x\sim D{\text{Examples}}}\left[\left | f_{\text{noisy labels}}(x)-p^{*}(x)\right | \right]\lesssim\frac{1}{2}\mathbb{E}{x\sim D{\text{Examples}}}\left[\left | f_{\text{noisy labels}}(x)-p_{\text{down}}(x)\right | \right] \lesssim \ \underbrace{\frac{1}{2}\mathbb{E}{x\sim D{\text{Examples}}}\left[\left | f_{\text{noisy labels}}(x)-f^{*}(x)\right | \right]}{=\text{opt}{\mathcal{C}}\text{ because } f^{*} \text{ is best classifier in }\mathcal{C}+\varepsilon}. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7B1%7D%7B2%7D%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Cleft%5B%5Cleft%7Cf_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29-p%5E%7B%2A%7D%28x%29%5Cright%7C%5Cright%5D%5Clesssim%5Cfrac%7B1%7D%7B2%7D%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Cleft%5B%5Cleft%7Cf_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29-p_%7B%5Ctext%7Bdown%7D%7D%28x%29%5Cright%7C%5Cright%5D+%5Clesssim+%5C%5C+%5Cunderbrace%7B%5Cfrac%7B1%7D%7B2%7D%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Cleft%5B%5Cleft%7Cf_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29-f%5E%7B%2A%7D%28x%29%5Cright%7C%5Cright%5D%7D_%7B%3D%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%5Ctext%7B+because+%7D+f%5E%7B%2A%7D+%5Ctext%7B+is+best+classifier+in+%7D%5Cmathcal%7BC%7D%2B%5Cvarepsilon%7D.+&bg=ffffff&fg=000&s=0&c=20201002) |
3. Going Beyond Low-Degree Algorithm.
To recap: in Framework 2, the Low-Degree Algorithm works for many hypothesis classes , giving a classifier
with error . In Framework 3, this algorithm can be used together with the Moment-Matching Tester to get computationally efficient algorithms. However, this approach has a number of limitations:
- Run-time for the Low-Degree Algorithm tends to be exponential in
. For example, to get error , the Low-Degree Algorithms needs to run in time
.
- The algorithms are improper, which means the hypothesis
they give is not itself in the class
. To be concrete, compare the two following two tasks:
- (a) We get a labeled data-set
, and we want to fit approximately the best linear classifier to
.
- (b) We get a labeled data set
, and we want to fit a classifier to
of the form
where
is some polynomial, this classifier should approximately be as good as the best linear classifier on
. The Low-Degree Algorithm is solving task (b) not task (a). However, most people would agree that task (a) is far more natural.* The Moment-Matching Tester is guaranteed to accept when the distribution
equals to a specific distribution
, but still might reject when
is very similar to
. Formally, one can construct a distribution
that will be rejected by the degree-
Moment-Matching Tester, even though
.
- (a) We get a labeled data-set
We will now discuss subsequent work that focuses on removing these limitations.
3.1 The Matter of Efficiency: Polynomial-Time Algorithms in Framework 3.
3.1.1 Learning Beyond Low-Degree Algorithm.
In contrast with the Low-Degree Algorithm, some algorithms in Framework 2 are tailor-made for specific classes and have run-times
, at the price of looser error guarantees such as  or . For example, consider the following algorithm:
Averaging Algorithm [KKMS ’08]:
- Given: access to independent standard Gaussian examples
, labeled by unknown function
.
- Output: hypothesis
.
-
{
Gaussian examples
labeled by
}.
-
.
- Output
mapping
in
to
When the distribution is the standard Gaussian, this simple Framework 2 algorithm runs in time
, and gives a classifier
with error . Here  is the best prediction error of an origin-centered linear classifier (i.e. a classifier of the form
) [KKMS ’08]. The analysis of the Averaging Algorithm is self-contained and only takes three pages [KKMS ’08], and a key idea is to denote the optimal classifier as
and decompose
into
and
as follows:
arguing10 as follows:
-
For a Gaussian dataset , the angle
will be at most
with high probability. This is argued by observing that the inner product
is likely to be large while the inner product with directions orthogonal to
will likely be much smaller. Likewise, the norm  can be shown to be close to  with high probability. -
Since the classifier has the optimal error , only  fraction of points in
will contribute to
(with high probability over
). Yet, it can be shown that small subsets of a Gaussian data-set have small averages. Formally, with high probability, if
comes from the standard Gaussian and has size
, every subset
of size  satisfies ![\displaystyle \left \sum_{x_{i}\in S’}\left[x_{i}\right]\right \leq\left(\tilde{O}\left(\alpha\right)+\varepsilon\right) S . ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cleft%7C%5Csum_%7Bx_%7Bi%7D%5Cin+S%27%7D%5Cleft%5Bx_%7Bi%7D%5Cright%5D%5Cright%7C%5Cleq%5Cleft%28%5Ctilde%7BO%7D%5Cleft%28%5Calpha%5Cright%29%2B%5Cvarepsilon%5Cright%29%7CS%7C.+&bg=ffffff&fg=000000&s=0&c=20201002) This lets us upper-bound the norm  with . -
Steps (a) and (b) together can be used to conclude that

We now use the bound on the angle
to bound how much less accurate the classifier
can be compared to
. Indeed, for a Gaussian sample
, the probability that
is
, which allows us to conclude that
![\displaystyle \Pr_{x\sim D_{\text{Examples}}}[\text{sign}(w{\cdot x})\neq f_{\text{noisy labels}}(x)]\leq\ \underbrace{\Pr_{x\sim D_{\text{Examples}}}[\text{sign}(w^{}{\cdot x})\neq\text{sign}(w{\cdot x})]}_{=\Theta(\angle(w,w^{}))\leq\tilde{O}(\text{opt}{\mathcal{C}})+\varepsilon}+\underbrace{\Pr{x\sim D_{\text{Examples}}}[\text{sign}(w^{*}{\cdot x})\neq f_{\text{noisy labels}}(x)]}{=\text{opt}{\mathcal{C}}}=\tilde{O}(\text{opt}{\mathcal{C}})+\varepsilon. ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5B%5Ctext%7Bsign%7D%28w%7B%5Ccdot+x%7D%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%5Cleq%5C%5C+%5Cunderbrace%7B%5CPr_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5B%5Ctext%7Bsign%7D%28w%5E%7B%2A%7D%7B%5Ccdot+x%7D%29%5Cneq%5Ctext%7Bsign%7D%28w%7B%5Ccdot+x%7D%29%5D%7D_%7B%3D%5CTheta%28%5Cangle%28w%2Cw%5E%7B%2A%7D%29%29%5Cleq%5Ctilde%7BO%7D%28%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%29%2B%5Cvarepsilon%7D%2B%5Cunderbrace%7B%5CPr_%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5B%5Ctext%7Bsign%7D%28w%5E%7B%2A%7D%7B%5Ccdot+x%7D%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D_%7B%3D%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%3D%5Ctilde%7BO%7D%28%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%29%2B%5Cvarepsilon.+&bg=ffffff&fg=000&s=0&c=20201002)
A series of subsequent works developed new algorithms, improving the prediction error to  [ABL ’14, DKTZ ’20, DKTZ ’22]. Also, unlike the Low-Degree Algorithm, these algorithms are proper, i.e. the hypothesis they produce is itself a linear classifier, rather than a complicated function of the form
.
However, all these algorithms inherently operate in Framework 2 rather than Framework 3, relying on the Gaussianity assumption not only for their run-time but also for their accuracy guarantee. For example, we can see that all three aforementioned steps — (a), (b) and (c) — in the accuracy analysis of the Averaging Algorithm use Gaussianity in crucial ways.
3.1.2 Modified Moment-Matching Tester.
Can we again use the recipe from Section 2.2 and the Moment-Matching Tester to get a -time algorithm in Framework 3 based on the Averaging Algorithm? Conceptually, the Moment-Matching Tester checks that the distribution
is indistinguishable from
by low-degree monomials. It is therefore unsurprising that this tester works well with the Low-Degree Algorithm, as this algorithm is based on finding the best-fitting polynomial. By the same token, one would not expect the Moment-Matching Tester to work well with tailor-made algorithms such as the Averaging Algorithm, as these algorithms do not use polynomials in any way.
Surprisingly, a modified version of the Moment-Matching Tester can nevertheless be used in this way. First, we run one of these algorithms and obtain a hypothesis . As in Section 2.2, we need to decide whether to output the classifier
or to run a slow Framework 1 Algorithm. This is again done by running a tester
, but here the tester also uses the vector
obtained prior to running the tester:
Strip Tester [DKKLZ ’23, GKSV ’23, GKSV ’24]:
- Given: access to independent examples from
, a reference distribution
, unit vector
in
.
- Output: Accept or Reject.
-
{
independent examples from
}.
- For all monomials
of total degree at most
and integers
:
-
Define
![\displaystyle \mathbf{1}{w\cdot x\in[i\varepsilon,(i+1)\varepsilon]}=\begin{cases} 1 & \text{if }w\cdot x\in[i\varepsilon,(i+1)\varepsilon]\ 0 & \text{otherwise.} \end{cases} ](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cmathbf%7B1%7D%7Bw%5Ccdot+x%5Cin%5Bi%5Cvarepsilon%2C%28i%2B1%29%5Cvarepsilon%5D%7D%3D%5Cbegin%7Bcases%7D+1+%26+%5Ctext%7Bif+%7Dw%5Ccdot+x%5Cin%5Bi%5Cvarepsilon%2C%28i%2B1%29%5Cvarepsilon%5D%5C%5C+0+%26+%5Ctext%7Botherwise.%7D+%5Cend%7Bcases%7D+&bg=ffffff&fg=000000&s=0&c=20201002)
-
If ![{\lvert\mathbb{E}{x\sim S}[m(x)\mathbf{1}{w\cdot x\in[i\varepsilon,(i+1)\varepsilon]}]-\mathbb{E}{x\sim D{\text{Assumption}}}[m(x)\mathbf{1}{w\cdot x\in[i\varepsilon,(i+1)\varepsilon]}]\rvert>\left(\varepsilon/d\right)^{O(1)}}](https://s0.wp.com/latex.php?latex=%7B%5Clvert%5Cmathbb%7BE%7D%7Bx%5Csim+S%7D%5Bm%28x%29%5Cmathbf%7B1%7D_%7Bw%5Ccdot+x%5Cin%5Bi%5Cvarepsilon%2C%28i%2B1%29%5Cvarepsilon%5D%7D%5D-%5Cmathbb%7BE%7D_%7Bx%5Csim+D_%7B%5Ctext%7BAssumption%7D%7D%7D%5Bm%28x%29%5Cmathbf%7B1%7D_%7Bw%5Ccdot+x%5Cin%5Bi%5Cvarepsilon%2C%28i%2B1%29%5Cvarepsilon%5D%7D%5D%5Crvert%3E%5Cleft%28%5Cvarepsilon%2Fd%5Cright%29%5E%7BO%281%29%7D%7D&bg=ffffff&fg=000000&s=0&c=20201002), output Reject and terminate.
-
- If this step is reached, output Accept.
Essentially, the algorithm above breaks the -dimensional space into strips along the vector
and runs the degree-
moment tester for each of the strips.
Combining this Strip Tester with the Framework 2 algorithm of [DKTZ ’20], yields an algorithm in Framework 3 with prediction error  and run-time [DKKLZ ’23, GKSV ’23, GKSV ’24], where
is the class of origin-centered linear classifiers.
Moreover, like the Averaging Algorithm, this algorithm is proper, i.e. the hypothesis is itself a linear classifier. This addresses the second limitation of the Low Degree Algorithm we pointed out earlier.
3.2 Assumption-Tolerance and the Spectral Tester.
Finally, we briefly discuss some recent work on addressing the third limitation of the Moment-Matching Tester described in the beginning of Section 3. To recap, the limitation is that the Moment-Matching Tester might reject distributions that are very close to the distribution but not exactly equal to it. To address this, the paper considers the following11 more stringent version of Framework 3:
Framework 4 [Tolerant Testable Framework] For any and
the learning algorithm should with high probability output a classifier
with
![\displaystyle \overbrace{\Pr_{x\sim D_{\text{Examples}}}[h(x)\neq f_{\text{noisy labels}}(x)]}^{\text{Prediction error of \ensuremath{h}.}}\leq\underbrace{\overbrace{\text{opt}{\mathcal{C}}}^{\text{Optimal prediction error in \ensuremath{\mathcal{C}.}}}+\varepsilon}{\text{\ensuremath{\text{Weaker guarantees such as O(\ensuremath{\text{opt}{\mathcal{C}}})+\ensuremath{\varepsilon\ }also considered.}}}}. \ \ \ \ \ (2)](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Coverbrace%7B%5CPr%7Bx%5Csim+D_%7B%5Ctext%7BExamples%7D%7D%7D%5Bh%28x%29%5Cneq+f_%7B%5Ctext%7Bnoisy+labels%7D%7D%28x%29%5D%7D%5E%7B%5Ctext%7BPrediction+error+of+%5Censuremath%7Bh%7D.%7D%7D%5Cleq%5Cunderbrace%7B%5Coverbrace%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%5E%7B%5Ctext%7BOptimal+prediction+error+in+%5Censuremath%7B%5Cmathcal%7BC%7D.%7D%7D%7D%2B%5Cvarepsilon%7D_%7B%5Ctext%7B%5Censuremath%7B%5Ctext%7BWeaker+guarantees+such+as+O%28%5Censuremath%7B%5Ctext%7Bopt%7D_%7B%5Cmathcal%7BC%7D%7D%7D%29%2B%5Censuremath%7B%5Cvarepsilon%5C+%7Dalso+considered.%7D%7D%7D%7D.+%5C+%5C+%5C+%5C+%5C+%282%29&bg=ffffff&fg=000000&s=0&c=20201002)
Furthermore, for some absolute constant , if
, then with high probability the learning algorithm should run in time
.
In [GSSV ’24], a general methodology is developed for designing algorithms in this more general framework. For example, when is the Gaussian distribution and
is the class of linear classifiers, a run-time of
is achieved, matching the best algorithms in Frameworks 2 and 3.
In brief, one of the key insights is that the degree- Moment-Matching tester can often be replaced by what is called the degree-
Spectral Tester. Given a data-set
, this tester checks that
The second insight is that the degree- Spectral Tester can be implemented in time
using an eigenvalue computation. The final insight12 is that when
one can remove
fraction of elements from
and have it pass the Spectral Tester. [GSSV ’24] designs an algorithm that efficiently finds which
fraction of elements needs to be removed from
to achieve this.
4. Other recent work.
This blog post focused on side-stepping the intractability of Framework 1 via making assumptions on the distribution and not making any assumptions on the labeling function
. Yet, improved algorithms are possible when the labeling function
is also assumed to be well-behaved. For example, the Random Classification Noise assumption essentially presupposes that
is formed by taking a hypothesis
in the class
and flipping each function value with some probability
. The work of [GKSV ’25] gives algorithms for which (like in Framework 3) the failure of the Random Classification Noise assumptions can affect only the run-time rather than the accuracy guarantee. [GKSV ’25] also designs such algorithms for the Massart Noise assumption that is far more general than the Random Classification Noise assumption.
The work of [STW ’24] shows how to use the Low-Degree Algorithm and the Moment-Matching Tester to get algorithms in Framework 3 for the challenging hypothesis class of polynomial threshold functions.
The work of [GKSV ’23] extends Framework 3 to families of distributions. Rather than being required to run in time only when
for a specific distribution
, the algorithms are required to run in time
when
belongs to an entire family of distributions . Designing such algorithms turns out to be intimately connected with the field of Sum-of-Squares relaxations [KS17].
[KKV24, GSSV’24, CKKSV’24, KSV’24] build on the approaches described here to test distribution shift, i.e. to test that the unlabeled data found during deployment of a classifier comes from the same distribution as the training data. Similar to the test in Section 2.2, the test is required to either guarantee that
has a good prediction accuracy on this new dataset, or to detect that the new dataset came from a data distribution that differs from the distribution used during training.
5. Open Problems.
There are unsolved problems for all three frameworks we discussed today. Even in Framework 1, where there are still large gaps in our understanding of some basic problems:
Open problem 1: Is there an algorithm in Framework 1 with accuracy  and run-time bound , when
is the class of linear classifiers? What about algorithms with run-times
or
when
is fixed to be a small constant (say
)?
To the best of our understanding, assuming the Exponential Time Hypothesis, the NP-hardness results of [GR’06, FGKP’06] imply that no -time algorithm exists for this problem for a proper learning algorithm, i.e. an algorithm that itself outputs a linear classifier. However, when general improper algorithms are considered (i.e. algorithms with no restrictions on what classifier they can produce), no such hardness result is known.
The following is an open question in Framework 2:
Open problem 2: Is there an algorithm in Framework 2 with accuracy  and run-time bound , when
is the class of linear classifiers and
is the uniform distribution over
? What about algorithms with weaker accuracy bounds such as  and ?
For a long time, we have had algorithms with run-time and accuracy  when
is the standard Gaussian distribution [ABL ’14, DKTZ ’20, DKTZ ’22]. Ultimately, one would hope to achieve this for all product distributions over
and not only the Gaussian distribution, but even for the uniform distribution over
such algorithms are yet to be developed.
Finally, the following is an open question in Framework 3:
Open problem 3: Is there an algorithm in Framework 3 with accuracy  and run-time bound , when
is the class of general linear classifiers and
is the standard Gaussian distribution?
Note that [DKLZ’24] gives an algorithm with a run-time bound and accuracy , [RV ’23, GKK ’23] give algorithms with accuracy  and run-time
. As mentioned earlier, [DKKLZ ’23, GKSV ’23, GKSV ’24] give algorithms with accuracy  but only when
is the class of origin-centered linear classifiers.
Acknowledgements
We are grateful to Gautam Kamath, Adam Klivans, and Ronitt Rubinfeld for their valuable feedback on a draft of this blog post.
Footnotes
- A proof sketch: Draw $latex {N^{2}}&fg=000000$ samples $latex {S}&fg=000000$ from the Gaussian distribution and let $latex {D_{\text{sub-sample}}}&fg=000000$ be uniform over $latex {S}&fg=000000$. The distribution $latex {D_{\text{sub-sample}}}&fg=000000$ is $latex {\Omega(1)}&fg=000000$-far from Gaussian in total variation distance, but it can be shown that $latex {\Omega(N)}&fg=000000$ samples are necessary to distinguish such $latex {D_{\text{sub-sample}}}&fg=000000$ from the Gaussian distribution. We can take $latex {N}&fg=000000$ to be arbitrarily large, so no finite-sample tester exists.
︎
- One can also consider labels $latex {\left{ y_{i}\right} }&fg=000000$ that are themselves random variables rather than a deterministic function of example $latex {x}&fg=000000$. Everything is this blog post will apply to that setting as well.
︎
- Throughout blog post, we will assume a success probability of (say) $latex {0.99}&fg=000000$. This probability can be improved to $latex {1-\delta}&fg=000000$ by repeating the process $latex {\log1/\delta}&fg=000000$ times and selecting the classifier that does best on a fresh set of samples.
︎
- Note: even for this easier task, no algorithm is known in Framework 2 with run-time better than $latex {2^{O(d)}}&fg=000000$.
︎
- Henceforth, for simplicity, when we say “a Framework 3 algorithm $latex {\mathcal{A}}&fg=000000$ runs in time $latex {T}&fg=000000$” we really mean “$latex {\mathcal{A}}&fg=000000$ runs in time $latex {T}&fg=000000$ when $latex {D_{\text{Examples}}=D_{\text{Assumption}}}&fg=000000$” as Framework 3 does not require a run-time bound when $latex {D_{\text{Examples}}\neq D_{\text{Assumption}}}&fg=000000$.
︎
- If the algorithm runs longer than $latex {T}&fg=000000$ or fails to output a classifier, stop and go to last step.
︎
- The tester below generalizes the tester of [AGM ’03, AAKMRX ’07, OZ ’18] for testing $latex {k}&fg=000000$-wise independence.
︎
- Inspecting the general recipe described in this section, we see that, strictly speaking, to execute this recipe one needs not only a tester $latex {\mathcal{T}}&fg=000000$ but also an algorithm $latex {\mathcal{A}’}&fg=000000$ for class $latex {\mathcal{C}}&fg=000000$ in Framework 1 (which is allowed to be very slow). Existence of such $latex {\mathcal{A}’}&fg=000000$ is a very mild condition and holds for all concept classes $latex {\mathcal{C}}&fg=000000$ for which this problem has been considered.
︎
- To do this, one needs to show that none of the coefficients of $latex {p_{\text{up}}(x)-p_{\text{down}}(x)}&fg=000000$ is too large, which turns out to be true.
︎
- For the rest of this subsection we will refer to Standard Gaussian distribution as simply Gaussian and origin-centered linear classifiers as simply linear classifiers.
︎
- The definition is inspired by the setting of tolerant property testing [PRR06].
︎
- This last insight is closely related to the method used in [DKS18] to solve a different problem.
︎
By avasilyan