Introduction

Backgrounds

Quasi-generalized linear models (quasi-GLMs) extend ordinary and generalized linear regressions. They allow response variables to follow distributions beyond the normal and exponential families1,2,3,4,5. Robust estimation of heavy-tailed GLMs, especially those with infinite variance data, has recently gained attention6,7,8,9,10. In practice, data often contains noise or contamination, leading to heavier tails, as observed in finance, engineering, and genomic studies. For instance11, found that normalized gene expression levels often exhibit high kurtosis, indicating outliers and heavy tails. Outliers, although few, can introduce substantial bias in standard estimation. In finance, extreme events and outliers significantly affect models12,13, with infinite variance distributions like t-distributions and Pareto families modeling extreme risks. Ignoring outliers or applying overly aggressive bounded transformations can lead to inconsistent and inefficient estimates14.

Non-identically distributed data. An important factor contributing to data outliers is that samples may originate from different distributions. Classical machine learning and statistical theories often assume that n samples \(\{(X_i, Y_i)\}_{i=1}^n\) are i.i.d., which is often too restrictive for real-world data. The i.i.d. assumption is frequently violated for three main reasons15: 1. Data points are influenced by outliers, such as dataset bias or domain shift; 2. Distribution changes occur at some point (change point); 3. Data is generated from multiple distinct distributions rather than a single one, such as multi-armed bandit problems16.

The independent non-identically distributed (i.n.i.d.) data assumption is more relevant to real-world applications, particularly in industrial and complex social problems. I.n.i.d. data often arises in time-dependent contexts where distributions change over time9,10. In behavioral and social applications, significant heterogeneity exists that cannot be captured by the i.i.d. assumption17. For example18, performed a classification task on a dataset of cats and dogs with unknown breeds, violating the i.i.d. assumption. In computational biology, datasets often include multiple organisms and phenotypes19. In finance, factor models assume returns are driven by common factors, with effects varying across assets20. GLMs with i.n.i.d. data enable modeling such heterogeneity, improving understanding and prediction. Studies by21,22 have also developed distributionally robust i.n.i.d. learning methods for real-world applications.

Heavy-tailed data under finite moments. Extensive research assumes both input and output variables exhibit sub-Gaussian behavior23,24,25. When inputs or outputs are contaminated with measurement errors, these errors are also often assumed to be sub-Gaussian26. However, this assumption lacks robustness for unbounded covariates, as sub-Gaussian distributions have all k-th moments27. In practice, particularly in finance and extreme value theory, heavy-tailed data often lacks exponential moments, presenting additional challenges13.

Determining data corruption precisely is often impossible, making moment conditions a more practical assumption. We focus on data contaminated by heterogeneous, heavy-tailed errors, assuming heavy tails with bounded lower-order moments (e.g., Pareto distributions). Our goal is to develop robust estimators for general GLMs to ensure reliable estimation.

Notation. For \(\theta \in \mathbb {R}^p\), the \(\ell _q\)-norm is \({\Vert \theta \Vert _{\ell _q}} = {( \sum _{j=1}^p \theta _j^q )^{1/q}}\). The \(L_p\)-norm is \({\Vert Z \Vert _p} = (\textrm{E}|Z|^p)^{1/p}\) for \(p> 0\). Let \({\textrm{E}_n}f(X) = \frac{1}{n}\sum _{i=1}^n f(X_i)\) be the average moment for independent random variables \(\{X_i\}_{i=1}^n\) in a space \({\mathscr {X}}\). Let \((\Theta , d)\) be a metric space, and let \(K \subset \Theta\). A subset \({\mathscr {N}}(K,\varepsilon ) \subset \mathbb {R}^p\) is an \(\varepsilon\)-net for K if, for all \(x \in K\), there exists \(y \in {\mathscr {N}}(K,\varepsilon )\) such that \(\Vert x - y\Vert _{\ell _2} < \varepsilon\). The covering number \(N(K,\varepsilon )\) is the smallest number of closed balls of radius \(\varepsilon\) centered at points in K that cover K. For a smooth function f(x) on \(\mathbb {R}\), let \(\dot{f}(x) = df(x)/dx\) and \(\ddot{f}(x) = d^2 f(x)/dx^2\). For differentiable \(f: \mathbb {R}^p \rightarrow \mathbb {R}\), let \(\nabla f\) denote the gradient. For any scalar random variable Z, \(\Vert \cdot \Vert _{\psi _2}\) is the sub-Gaussian norm defined by \(\Vert Z\Vert _{\psi _2}=\sup _{p \ge 1} \frac{\left( \textrm{E}|Z|^p\right) ^{1 / p}}{\sqrt{p}}\).

Problem setup and contributions

Let the loss function be \(l(y, x, \theta )\), where \(y \in \mathbb {R}\) is the output, \(x \in \mathbb {R}^p\) is the input, and \(\theta \in \Theta\) is the parameter from the hypothesized space \(\Theta\). The common assumption that data is i.i.d. with a shared distribution F is often unrealistic in robust learning. Instead, we assume the random variables \(\{(X_i, Y_i) \sim F_i\}_{i=1}^n \in \mathbb {R}^p \times \mathbb {R}\) are independent, as concentration inequalities require either independence or a common mean. Define the expected empirical risk as: \(R_l^n(\theta ):= \frac{1}{n} \sum _{i=1}^n \textrm{E} l(Y_i, X_i, \theta )\). The true parameter \(\theta _n^*\in \mathbb {R}^p\) (which may not be unique) is defined as the minimizer of the empirical mean of the expected loss

$$\begin{aligned} \theta _n^* : = \mathop {\arg \min }\limits _{\theta \in {\Theta }} \frac{1}{n}\sum \limits _{i = 1}^n {\textrm{E} l({Y_i}, X_i,\theta )} = \mathop {\arg \min }\limits _{\theta \in {\Theta }} {{R}_l^{n}}(\theta ). \end{aligned}$$
(1)

We refer to \(\theta _n^*\) as the sample-dependent true parameter. In regression, the loss function is \(l(y, x, \theta ) = \ell (y, x^\top \theta )\), where \(\ell (\cdot , \cdot )\) is a bivariate function on \(\mathbb {R}^2\). Since \(\theta _n^*\) depends on n, it differs from the i.i.d. version: \(\theta ^*:= \arg \min _{\theta \in \Theta } \mathbb {E}[l(Y_1, X_1, \theta )]\), which is independent of n. The excess risk, \({{R}_l^{n}}(\hat{\theta }) - {{R}_l^{n}}({\theta ^*})\), is a commonly used measure of prediction accuracy for any estimator \(\hat{\theta }\) approximating the true value \({\theta ^*}\) in machine learning. When \({{R}_l^{n}}(\hat{\theta }) - {{R}_l^{n}}({\theta ^*})=o_p(1)\), it indicates prediction consistency in statistical regressions. The loss-function-based estimator is obtained through the empirical risk minimization (ERM):

$$\begin{aligned} \bar{\theta }:= \arg \min _{\theta \in \Theta } \frac{1}{n} \sum _{i=1}^n l(Y_i, X_i, \theta ) = \arg \min _{\theta \in \Theta } \hat{R}_l^n(\theta ). \end{aligned}$$

However, \(\bar{\theta }\) is not a robust estimator under the heavy tailed assumption; see28 and the Remark 4 below.

This paper studies log-truncated minimization for quasi-GLMs under potentially unbounded Lipschitz losses. Although the resulting optimization is non-convex, we derive excess risk bounds for it. These bounds demonstrate robustness against heavy-tailed data and ensure predictive consistency, even for cases with finite or infinite variance. Additionally, we establish \(\ell _2\)-estimation error bounds using excess risk bounds from the log-truncated loss for quasi-GLMs. The contributions of this work are in 4 aspects:

  • Independent non-identical distributed data and \((1+\varepsilon )\)-moment conditions, \(\varepsilon \in (0,1]\). A major theoretical contribution of our work is relaxing the i.i.d. assumption for the random input \(X \in \mathbb {R}^p\) and output \(Y \in \mathbb {R}\), allowing for independent but non-identically distributed data. Additionally, we impose only weak moment conditions by the a log-truncated loss, better reflecting real-world scenarios.

  • Sharper excess risk bounds. Using refined proof techniques, we derive excess risk bounds with tighter constants, improving upon Theorem 2 in29, which had looser bounds. Additionally, our focus on the convex loss case addresses a gap in29, which primarily examined non-convex losses (e.g., regularized deep neural networks) without \(\ell _2\)-risk bounds.

  • Iteration complexity. Under mild moment conditions, we derive an iteration complexity of \(O(\epsilon ^{-4})\) for the stochastic gradient descent (SGD) log-truncated optimizer, guaranteeing that the average gradient norm is reduced to below \(\epsilon\).

  • Misspecified quasi-GLMs. We relax the requirement for a correctly specified output distribution, allowing for misspecification. Despite this, our results show that the SGD algorithm enables computationally feasible estimation for quasi-GLM regression, even with potential outliers.

Our goal is (i) to robustify against heavy-tailed noise via truncation and (ii) to allow non-identical sampling across i (drift in scale/design). Plain SGD tolerates i.n.i.d. under light tails(sub-Gaussian), but guarantees deteriorate with heavy tails with infinite-variance condition of data; our analysis shows truncation restores stability while retaining i.n.i.d. flexibility. The robust approach to GLMs with SGD is well-aligned with the goals of inverse problems, addressing challenges such as heavy-tailed noise and ill-posedness of Hessian matrix.

Outlines

The rest of the paper is organized as follows. In Section 2, we address the non-convex learning problem for quasi-GLMs using log-truncated loss functions. Section 3 establishes excess risk bounds under both finite and infinite moment assumptions, and analyzes the iteration complexity of the SGD algorithm for finding stationary points under mild conditions. We also demonstrate the robustness of log-truncated loss for misspecified GLMs. Sections 4 and 5 present simulations and real-world data analyses, showing the superiority of log-truncated quasi-GLMs over standard MLE-based GLMs in handling heterogeneous distributions and heavy-tailed noise. Proofs are provided in Supplementary Materials, along with key examples of quasi-GLMs, such as negative binomial and self-normalized Poisson regression, demonstrating the established excess risk bounds.

Robust quasi-GLMs for the i.n.i.d. data

Log-truncated loss

To relax exponential moment conditions to finite moment conditions, a robust procedure can be achieved by log-truncating the original estimating equations or loss functions, requiring only finite second moments30. For i.i.d. data \(\{X_i\}_{i=1}^{n}\) with finite variance, Catoni’s M-estimator is defined as the minimizer

$$\begin{aligned} \hat{\theta }_{n}^{Ca}:= \mathrm{\arg \min }_{\theta \in \mathbb {R}} L_n(\theta ), \end{aligned}$$

where \(L_n(\theta ):=\frac{1}{\alpha ^2 n}\sum _{i=1}^{n}\phi (\alpha (X_i-\theta ))\) with \(\phi (x)=\int _{0}^{x}\psi (s)ds\); see31, and \(\alpha\) is a tuning parameter. Here, the truncation function is non-decreasing and satisfies

$$\begin{aligned} -\log \left( 1-x+\frac{x^{2}}{2}\right) \le \psi (x) \le \log \left( 1+x+\frac{x^{2}}{2}\right) . \end{aligned}$$
(2)

The rationale is that a function \(\psi (x)\) which grows significantly slower than a linear function–for example, a logarithmic function–reduces the impact of extreme outliers. This makes the outliers comparable to typical data points. The unbounded nature of the log-truncated function \(\psi (x)\) retains substantial data variability, offering flexibility without imposing restrictive bounds. This contrasts with traditional bounded M-estimators, which may overly constrain data and lose valuable information. For example, the Catoni log–truncated score function

$$\begin{aligned}\psi (x) = \mathrm{{sign}}(x)\log (1 + |x| + {x^2}/2). \end{aligned}$$

The function \(\psi\) is odd, nondecreasing, and satisfies \(\psi (x)=x+O(x^{2})\) as \(x\rightarrow 0\) while \(|\psi (x)|\asymp \log (x^{2})\) as \(|x|\rightarrow \infty\), thereby downweighting extreme residuals. As is well known, the Cauchy distribution is heavy–tailed and has no finite moments; the classical Cauchy loss \(\rho _{c}(x):=\log (1+c x^{2})\) (\(c>0\)) is a standard robust alternative. Our choice may be viewed as an extension of the Cauchy loss/score, with the additional linear term inside the logarithm ensuring a smooth interpolation between the quadratic regime near zero and a logarithmic growth in the tails. We apply the \(\psi (x)\) to quasi-GLMs in below.

GLMs, quasi-GLMs and robust quasi-GLMs

The exponential family is a flexible class of distributions, including commonly used sub-exponential and sub-Gaussian distributions such as binomial, Poisson, negative binomial, normal, and gamma. This family has convexity properties that ensure finite variance, making it foundational in many statistical models. To explore the loss function from the exponential family, we begin with a dominating measure \(\nu (\cdot )\). Now, consider a random variable Y that follows the natural exponential family \(P_{\eta }\), which is indexed by the canonical parameter \(\eta\):

$$\begin{aligned} P_{\eta }(dy) = dF_Y(y) = c(y) \exp \{y \eta - b(\eta )\} \, \nu (dy), \quad c(y)> 0, \end{aligned}$$
(3)

where c(y) is independent of \(\eta\), and \(\eta\) lies in \(\Theta = \left\{ \eta : \int c(y) \exp \{y \eta \} \, \nu (dy) < \infty \right\}\). This structure highlights the adaptability of exponential family models. They are effective at capturing diverse data-generating processes and are well-suited for supporting robust statistical analysis.

Let \(\eta _i = u(X_i^\top \theta )\) in \(dF_{Y_i}(y) = c(y_i) \exp \{y_i \eta _i - b(\eta _i)\}\), representing a non-identical distribution for \(i \in [n]\), where \(u(\cdot )\) is a known link function. The conditional likelihood of \(\{Y_i | X_i\}_{i=1}^{n}\) is the product of the n individual terms in (3). The average negative log-likelihood function is defined as the empirical risk

$$\begin{aligned} \hat{R}_l^n(\theta ) := \frac{-1}{n} \sum _{i=1}^{n} \left[ Y_i u(X_i^\top \theta ) - b(u(X_i^\top \theta ))\right] = \frac{1}{n} \sum _{i=1}^{n} l(Y_i, X_i^\top \theta ), \quad \theta \in \mathbb {R}^{p}, \end{aligned}$$
(4)

where the loss function is \(l(y, x^\top \theta ):= k(x^\top \theta ) - y u(x^\top \theta )\), with \(k(t):= b \circ u(t)\). If \(u(t) = t\), this is called the canonical link, aligning the model with the natural parameterization of the exponential family.

In robust GLMs, we do not assume that \(Y_i\) follows an exponential family distribution as in (3). For example,32 assumed that \(Y_i\) is sub-Gaussian. Additionally, the sample may contain outliers, and the data may not be i.i.d. Instead, we assume that \(\{ Y_i\}_{i=1}^{n}\) satisfies certain lower-order moment conditions(established in Theorem 1 below). The function in (4) is called the quasi-log-likelihood, from which we obtain the quasi-GLM loss \(k(x^\top \theta ) - y u(x^\top \theta )\). Let \(\alpha\) and \(\rho\) be tuning parameters. The robust log-truncated ridge-penalized estimator \(\hat{\theta }\) for quasi-GLMs is given by:

$$\begin{aligned} \hat{\theta }: = \mathop {\arg \min }\limits _{\theta \in {\Theta }} \left\{ \frac{1}{{n\alpha }}\sum \limits _{i = 1}^n \psi [\alpha (k(X_{i}^ \top \theta )-Y_iu(X_{i}^ \top \theta ))]+\rho \Vert \theta \Vert _2^2\right\} =\mathop {\arg \min }\limits _{\theta \in {\Theta }}\{{{{\hat{R}}}_{\psi \circ l}}(\theta )+\rho \Vert \theta \Vert _2^2\}, \end{aligned}$$
(5)

where \({{{\hat{R}}}_{\psi \circ l}}(\theta ):=\frac{1}{{n\alpha }}\sum _{i = 1}^n \psi [\alpha (k(X_{i}^ \top \theta )-Y u(X_{i}^ \top \theta ))]\) with Catoni’s log-truncated function \(\psi (x) = \mathrm{{sign}}(x)\log (1 + |x| + {x^2}/2)\). Note that Catoni’s score function is \(\dot{\psi }(x) = \frac{{1 + |x|}}{{1 + |x| + {x^2}/2}}\in (0,1]\), and thus the gradient of the penalized loss is given by

$$\begin{aligned}\nabla _{\theta }[{{\hat{R}}_{{\psi } \circ l}}(\theta )+\rho \Vert \theta \Vert _2^2] = \frac{1}{n}\sum \limits _{i = 1}^n {X_i^{}\dot{l}({Y_i},X_i^ \top \theta )\dot{\psi }} [\alpha l({Y_i},X_i^\top \theta )]+2\rho \theta \approx \frac{1}{n}\sum \limits _{i = 1}^n X_i\dot{l}({Y_i},X_i^ \top \theta ),\end{aligned}$$

where the approximation follows from

$$\begin{aligned} \alpha ~\text {and}~\rho \rightarrow 0~\text {and thus}~{\dot{\psi }[\alpha l({Y_i},{X_i}^\top {\theta _n^*})]}\rightarrow 1~\text {as}~n \rightarrow \infty . \end{aligned}$$
(6)

Note that \(\ddot{\psi }(x) = \frac{{ - 2x(2 + \left| x \right| )}}{{{{(2 + 2\left| x \right| + {x^2})}^2}}}\in (-0.5,0.5)\), and the Hessian matrix of \({{{\hat{R}}}_{{\psi } \circ l} }(\theta )\) is

$$\begin{aligned} \nabla _{\theta }^2{{{\hat{R}}}_{{\psi } \circ l}}(\theta )|_{\theta = {\theta _n^*}}&= \frac{1}{n}\sum \limits _{i = 1}^n X_i X_i^\top \{\ddot{l}({Y_i},X_i^\top {\theta _n^*})\dot{\psi }[\alpha l({Y_i},X_i^\top {\theta _n^*})] + \frac{\alpha }{n} {{\dot{l}}^2}({Y_i},X_i^\top {\theta _n^*})\ddot{\psi }[\alpha l({Y_i},X_i^\top {\theta _n^*})]\}\nonumber \\&\approx \frac{1}{n}\sum \limits _{i = 1}^n X_iX_i^ \top \ddot{l}({Y_i},X_i^\top {\theta _n^*})+O_p(\alpha )=\frac{1}{n}\sum \limits _{i = 1}^n X_iX_i^ \top \ddot{l}({Y_i},X_i^\top {\theta _n^*})+o_p(1), \end{aligned}$$
(7)

as \(\rho \rightarrow 0\). The last approximation holds by (6), provided that \({\mathrm{{E}}_n}\{ XX_{}^ \top {{\dot{l}}^2}(Y,X,{\theta _n^*})\ddot{\psi }[\alpha l(Y,X,{\theta _n^*})]\mathrm{{\} < }}\infty .\) The empirical Hessian matrix of the log-truncated ERM approximates the Hessian matrix of the original ERM. Therefore, an appropriate choice of a sufficiently small parameter \({\alpha }\) ensures that the resulting estimating equation closely resembles the original one.

The classical Newton’s algorithm becomes inapplicable when the empirical Hessian matrix is singular or lacks finite moments, which often occurs when the dimension p is large. This is due to the instability of inverting the Hessian. The problem is exacerbated by heavy-tailed data and the non-convexity of the log-truncated ERM problem. SGD is effective for solving non-convex optimization problems due to its simple iterative updates and avoidance of computationally expensive Hessian inversion33. This makes SGD particularly well-suited for robust machine learning, where traditional second-order methods are impractical. In Section 5, we use SGD to avoid Hessian inversion, relying only on high-probability bounds on the Hessian and moment conditions of the loss. This approach simplifies optimization, enhances efficiency, and effectively handles robust GLM estimation.

Main result

Excess risk bound under only variance conditions

Technically, we consistently impose the following compact space assumption (G.0) to simplify the analysis of the excess risk bound. This assumption allows us to leverage concentration inequalities for certain suprema of empirical processes defined over the compact space.

  • (G.0) Compact parameter space: the domain \(\Theta \subseteq \mathbb {R}^{p}\) and its radius is bounded with a constant r: \(\Vert \theta \Vert _{\ell _2} \le r, \quad \forall \theta \in \Theta\).

Assumption (G.0) ensures the existence of a finite covering number for the \(\varepsilon\)-net. That is, for any \(\varepsilon>0,\) there exists a finite \(\varepsilon\)-net of \(\Theta\). Next, we introduce several regularity conditions that are essential for establishing excess risk bound and the rate of convergence of the proposed log-truncated ERM estimators.

  • (G.1): Given \(0<A<\infty\), assume that u(x) is continuous differentiable and \(\dot{u}(x)\ge 0\) is locally bounded. That is, there exists a positive function \({g_A}({x})\) such that: \(0 \le \dot{u}(x^ \top \theta ) \le {g_A}(x),~\text {for}~\Vert \theta \Vert _{\ell _2}\le A\).

  • (G.2): Suppose that k(x) is continuous differentiable and \(\dot{k}(x)\ge 0\) is locally bounded. That is there exists a positive function \({h_A}({x})\) such that: \(0<\dot{k}(x^ \top \theta ) \le {h_A}(x),~\text {for}~\Vert \theta \Vert _{\ell _2}\le A.\)

  • (G.3): \(\mathrm{{E}}_n\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2 <\infty\) and \(\mathrm{{E}}_n\Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2<\infty\).

  • (G.4): Let \(R_{{l^2}}^n(\theta ): = {\mathrm{{E}}_n}[k(X_{}^ \top \theta ) - Yu(X_{}^ \top \theta )]^2\). We have \(\sigma _R^\mathrm{{2}}(n): ={\sup }_{\theta \in {\Theta }} {R_{l^2}^n}(\theta )<\infty\).

Remark 1

The (G.1) and (G.2) are technical conditions to ensure the local Lipschitz condition of the loss function \(l(y, x, \theta ) = k(x^\top \theta ) - y u(x^\top \theta )\), that is, \(\exists\) a Lipschitz function of the data H(yx) s.t.

$$\begin{aligned} |l(y,x,{\eta _2}) - l(y,x,{\eta _1})| \le H(y,x)\Vert {\eta _2} - {\eta _1}\Vert _{\ell _2};{\eta _1},{\eta _2} \in {B_r}=\{\theta \in \mathbb {R}^{p}|\Vert \theta \Vert _{\ell _2} \le r\}, \end{aligned}$$
(8)

see (C.3) in29. Lipschitz modulus to control the supremum over \(\theta \in \Theta\) via covering numbers. In covering arguments, the local Lipschitz condition translates the pointwise concentration at a net point to the whole parameter ball. By the first-order Taylor expansion of \(l(y,x,\cdot )\) as the following

$$\begin{aligned}l(y,x,{\eta _2}) = l(y,x,{\eta _1}) +({\eta _2} - {\eta _1})^\top \nabla _\eta l(y,x,(t{\eta _2} + (1 - t){\eta _1})),~~{\eta _1},{\eta _2} \in \Theta ,~\exists ~t \in (0,1),\end{aligned}$$

We can choose H(yx) satisfying \(H(y,x)\ge \sup \limits _{{\eta _1},{\eta _2} \in B_{r_n}} \Vert {\nabla _\eta l(y,x,(t{\eta _2} + (1 - t){\eta _1}))}\Vert _2\) with \(r_n>0\). Fix a \(\eta \in B_{r_n}\), we compute the gradient function \(\nabla _\eta l(y,x^\top \eta )= [-y\dot{u}(x^\top \eta )+\dot{k}(x^\top \eta ) ]x^\top\). From (8), H(yx) is given by

$$\begin{aligned} \sup \limits _{{\eta _1},{\eta _2} \in B_{r_n}} \Vert [-y\dot{u}(x^\top \eta )+\dot{k}(x^\top \eta ) ]x^\top \Vert _2&\le \mathop {\sup }\limits _{\Vert \eta \Vert _2\le {r_n}} |-y\dot{u}(x^\top \eta )+\dot{k}(x^\top \eta ) |\cdot \left\| x \right\| _2\le [|y|{g_{r_n}}(x)+{h_{r_n}}(x)]\left\| x \right\| _2 \nonumber \\&:=H(y,x). \end{aligned}$$
(9)

Under condition (G.2) and put \({r_n}=r+1/n\), one has \({\mathrm{{E}}_n}H^2(Y,X)\le 2(\mathrm{{E}}_n\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2+\mathrm{{E}}_n\Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2) <\infty\), which is exactly the moment condition (C.4) in29.

A trivial example of (G.1) is \(u(x)=x\) with \({g_A}(x) \equiv 1\), which corresponds to the natural link function in GLMs. Recall that \(k(t):=b \circ u (t)\), giving \(\dot{k}(t) = \dot{u}(t)\dot{b}(u(t))\). If \(\{{Y_i}|{X_i}\}_{i=1}^n\) follows a distribution in the exponential family, then we have \(\mathrm{{E}}\{ {Y_i}|{X_i}\} = \dot{b}(u(X_i^ \top \theta _n^*)), i = 1,2, \cdots ,n.\) Under (G.0) and (G.1), it follows that (G.2) holds with \(A=r\) and \({h_r}({X_i}) = {g_r}({X_i})\mathrm{{E}}\{ {Y_i}|{X_i}\}\), as implied by the inequality

$$\begin{aligned}\dot{k}(X_i^ \top \theta _n^*) = \dot{u}(X_i^ \top \theta _n^*)\dot{b}(u(X_i^ \top \theta _n^*)) \le {g_r}({X_i})\mathrm{{E}}\{ {Y_i}|{X_i}\} = {h_r}({X_i}),~\text {for}~| {X_i^ \top \theta _n^*} | \le r\left\| {{X_i}} \right\| _{\ell _2}.\end{aligned}$$

The (G.3) only need \(\mathrm{{E}}_n\Vert {{X}{Y}} \Vert _{\ell _2}^2 <\infty\) and \(\mathrm{{E}}_n\Vert X\Vert _{\ell _2}^2<\infty\), provided that both \({h_A}({x})\) and \({g_A}({x})\)are bounded. For quasi-GLMs with i.i.d. data34, imposed the moment condition \(\mathrm{{E}}{Y_1^{7/3}} < \infty\) to establish the laws of iterated logarithm for the quasi-maximum likelihood estimator in GLMs with bounded inputs; a similar bounded input condition is required in3. However, our (G.4) only requires the second moment condition on the responses, that is,

$$\begin{aligned}R_{{l^2}}^n(\theta ): = {\mathrm{{E}}_n}{[k(X_{}^ \top \theta ) - Yu(X_{}^ \top \theta )]^2} \le 2{\mathrm{{E}}_n}{[k(X_{}^ \top \theta )]^2} + 2{\mathrm{{E}}_n}{[Yu(X_{}^ \top \theta )]^2} \le {C_1} + {C_2}{\mathrm{{E}}_n}Y^2,\end{aligned}$$

where \(C_1\) and \(C_2\) are constants, assuming that both \({h_A}({x})\) and \({g_A}({x})\) are bounded.

The first main result is presented as follows.

Theorem 1

(Excess risk bound under finite variance conditions) Let \(\hat{\theta }\) be defined by (5) and \({\theta _n^*}\) be given by (1) with \(l(y, x,\theta ):=k(x^ \top \theta )-y u(x^ \top \theta )\). If \(\alpha \le \sqrt{\frac{1}{{2n}}\mathop {\sup }\nolimits _{\theta \in {\Theta }}[ R_{l^2}^{ n}(\theta )]^{-1}[\log (\frac{1}{{{\delta ^2}}}) + p\log ( {1 + 2rn})]}\) with \(\delta \in (0,1)\), under (G.0)-(G.4), we have

$$\begin{aligned} {R_{l}^n}(\hat{\theta }) - {R_{l}^n}({\theta _n^*})&\le \rho \Vert {\theta _n^*}\Vert _2^2+\frac{2}{n}{\mathrm{{E}}_n[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2} + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}]} \\&+ \sqrt{\frac{{\log (\frac{1}{{{\delta ^2}}}) + p \log ( {1 + 2rn} )}}{n}} \left[ \frac{3\mathrm{{E}}_n[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2 + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2]}{{2\sqrt{2}{\sigma _R(n)}{n^2}}} + \sqrt{2}{\sigma _R}(n)\right] , \end{aligned}$$

with probability at least \(1-2\delta\). Further more, if \(u(t)=t\), we have

$$\begin{aligned} R_l^n(\hat{\theta }) - R_l^n(\theta _n^*)&\le \rho \Vert {\theta _n^*}\Vert _2^2+\frac{{2{\mathrm{{E}}_n}[\left\| {YX} \right\| _{{\ell _2}} + \Vert {X{h_{r + \frac{1}{n} }}(X)} \Vert _{{\ell _2}}]}}{n}\\&+ \sqrt{\frac{{\log ({\delta ^{ - 2}}) + p\log (1 + 2rn)}}{{n}}} [ \sqrt{2}{\sigma _R}(n) + \frac{{{\mathrm{{E}}_n}[\left\| {YX} \right\| _{{\ell _2}}^2 +\Vert {X{h_{r + \frac{1}{n} }}(X)} \Vert _{{\ell _2}}^2]}}{{2 \sqrt{2}{\sigma _R}(n){n^2}}}]. \end{aligned}$$

Remark 2

If \(\rho \rightarrow 0\), Theorem 1 provides non-asymptotic excess risk bounds, forming the theoretical foundation for establishing the convergence rate

$$\begin{aligned}R_l^n(\hat{\theta }) - R_l^n(\theta _n^*)=O_p(\rho )+O_p(1/n)+O_p(({p\log n}/{n})^{ \frac{1}{2 }}[1/n^2+O(1)])=O_p(({p\log n}/{n})^{ \frac{1}{2 }})\end{aligned}$$

and the prediction consistency of the proposed estimators. In the data i.i.d. setting, the constants \(3/(2\sqrt{2})\approx 1.06\) and \(\sqrt{2}\) of \(\frac{\mathrm{{E}}_n[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2 + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2]}{{{\sigma _R(n)}{n^2}}}+ \sqrt{2}{\sigma _R}(n)\) in Theorem 1 is sharper than the constants \(2/\sqrt{3}\approx 1.15\) and 6 in Theorem 2 of29 (also summarized in Corollary 2 with \(\beta =2\) in the following subsection). To see this constant, Theorem 4 (\(\beta =2\)) and equation (43) in29 gives

$$\begin{aligned}&{R_l}(\hat{\theta }_{n} ) -{R_l}({\theta ^*})\le \frac{2}{n}{\mathrm{{E}}}H(Y,X) +{\left( {\frac{{\log ({\delta ^{ - 2}}) + p\log \left( {1 + 2rn} \right) }}{{3n}}} \right) ^{{1/2}}}\left[ {\frac{{{\mathrm{{E}}}H^2({Y,X})}}{{{{\sigma _R}n^2}}} + 6{\sigma _R}} \right] \nonumber \\&\le \frac{2}{n}{\mathrm{{E}}}H(Y,X) +{\left( {\frac{{\log ({\delta ^{ - 2}}) + p\log \left( {1 + 2rn} \right) }}{{n}}} \right) ^{{1/2}}}\left[ \frac{{{2\mathrm{{E}}}[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2 + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2]}}{{{\sqrt{3}{\sigma _R}n^2}}} + 6{\sigma _R} \right] \end{aligned}$$
(10)

with probability at least \(1 - 2\delta\), where \({R_l}(\hat{\theta }_{n} )\), \({R_l}({\theta ^*})\) and \(\sigma _R\) are the i.i.d. version of \({R_{l}^n}(\hat{\theta })\), \({R_{l}^n}({\theta _n^*})\) and \(\sigma _R(n)\), respectively. So the constant of \(\frac{{{\mathrm{{E}}}[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2}^2 + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}^2]}}{{{{\sigma _R}n^2}}}\) is \(2/\sqrt{3}\) and the constant of \(\sigma _R\) is 6.

Remark 3

If the tuning parameter \(\alpha\) is misspecified, the excess–risk bound deteriorates. Letting \(\alpha \downarrow 0\) recovers the untruncated estimating equation \(\frac{1}{n}\sum _{i=1}^n X_i\,\dot{\ell }\!\big (Y_i, X_i^\top \theta \big )=0,\) corresponding to the ERM without truncation. Thus \(\alpha\) plays the usual variance–bias principle. Theorem 1 yields consistency when \(\alpha\) balances stochastic and truncation errors; optimizing the bound gives \(\alpha \asymp \sqrt{\frac{p\log n}{n}}.\) In the proof we enforce the balance by equating the “variance” and “bias” terms, \(2\alpha \sup _{\theta \in \Theta } R^{\,n}_{\ell ^{2}}(\theta ) \;=\; \frac{1}{n\alpha }\,\log \!\left( \frac{{\mathscr {N}}(\Theta ,\varepsilon )}{\delta ^{2}}\right) ,\) where \(R^{\,n}_{\ell ^{2}}(\theta )=n^{-1}\sum _{i=1}^{n}\ell ^{2}\!\big (Y_i,X_i^\top \theta \big )\) and \(N(\Theta ,\varepsilon )\) denotes the \(\varepsilon\)-covering number of \(\Theta\). This choice yields the stated order for \(\alpha\) and the resulting excess–risk rate.

Remark 4

In the i.i.d. setting,25 analyzed the high-probability performance of the excess risk bounds for the ERM estimator \(\bar{\theta }\), stating that \(\bar{\theta }\): \({R_l}(\bar{\theta }) - {R_l}({\theta _n^*})\le \frac{O(1)}{{n }}({{ {{\Vert {{\left\| {{X_1}} \right\| _{\ell _2}}} \Vert }_{{\psi _2}}}}})\) with high probability. This result holds under the assumptions of a three-times continuously differentiable self-concordant and convex loss function, as well as the sub-Gaussian exponential moment condition on \(X\). Similarly, for regression with quadratic loss35, established the high-probability excess risk bound: \({R_l}(\bar{\theta }_H) - {R_l}({\theta _n^*})\le O(\frac{d}{{n }}+\frac{1}{{n^{3/4} }})\) under the 8th moment condition for the regression noise, where \(\bar{\theta }_H\) is the ERM of a smoothed convex Huber loss.

However, if \({{ {{| {{\left\| {{X_1}} \right\| _{\ell _2}}}\Vert }_{{\psi _2}}}}}= \infty\) or the higher moments do not exist, consistency of the excess risk fails, and \(\bar{\theta }\) becomes non-robust under heavy-tailed assumptions. In contrast, under the moment conditions (G.3) and (G.4), our estimator \(\hat{\theta }\), based on the log-truncated loss, is more robust than \(\bar{\theta }\).

In the following result, we present an explicit upper bound for the \(\ell _2\)-error \({\Vert {\hat{\theta }- \theta _n^*} \Vert _{\ell _2}^2}\) under two additional assumptions: one concerning model specification and the other regarding the Hessian matrix condition.

  • (G.5) Specification of the average model: Assume that

    $$\begin{aligned} \sum _{i=1}^n\mathrm{{E}}\{X_i\dot{u}(X_i^ \top \theta _n^*)[Y_i - \dot{b}(u(X_i^ \top \theta _n^*))]\}=0. \end{aligned}$$
    (11)
  • (G.6) Hessian matrix condition(local strong convexity): Assume that there exists a positive constant \({c_{2r} }\) such that

    $$\begin{aligned} \inf _{\Vert \eta \Vert _{\ell _2}\le 2r}{\mathrm{{E}}_n}\left\{ { - XX^ \top \{ \ddot{u}(X^ \top \eta )[Y - \dot{b}(u(X^ \top \eta ))] + {{\dot{u}}^2}(X_{}^ \top \eta )\ddot{b}(u(X^ \top \eta ))\} } \right\} \succ \textrm{I}_{p}{c_{2r} }. \end{aligned}$$
    (12)

(G.5) is the population first-order stationarity of \(R_l^n(\theta )\); it does not require pointwise correct specification. It only requires that misspecification be on average to zero, when residuals are mean-zero and orthogonal to \(X_i\dot{u}(\cdot )\). (G.6) is a local strong-convexity: the average negative Hessian on the ball \(\{\Vert \eta \Vert _2\le 2r\}\) is uniformly positive definite with constant \(c_{2r}>0\). With the canonical link \(u(t)=t\), it reduces to \(\mathrm{{E}}_n[X X^\top \ddot{b}(X^\top \eta )]\succeq c_{2r} I_p\), which holds if \(\mathrm{{E}}_n[XX^\top ]\succeq \kappa I_p\) and \(\ddot{b}\) is bounded below on the relevant range (e.g., logistic: \(\ddot{b}(z)\in (0,1/4]\); Poisson: \(\ddot{b}(z)=e^z\ge e^{-M}\) over \(|z|\le M\)). For smooth non-canonical links, the term with \(\ddot{u}(\cdot )[Y-\dot{b}(u(\cdot ))]\) is neutralized by (G.5); positive curvature comes from \(\dot{u}^2\,\ddot{b}\).

With (G.5) and (G.6), we can obtain the following non-asymptotical \(\ell _2\)-estimation error.

Corollary 1

(\(\ell _2\)-estimation error) Under the notations and assumptions in Theorem 1, and assuming that Conditions (G.5) and (G.6) hold, then we have with probability at least \(1 - 2\delta\)

$$\begin{aligned} {\Vert {\hat{\theta }- \theta _n^*} \Vert _{\ell _2}^2}&\le \frac{\rho \Vert {\theta _n^*}\Vert _2^2}{c_{2r}}+\frac{2}{{c_{2r} }n}{\mathrm{{E}}_n[\Vert {{X}{g_{r + \frac{1}{n} }}({X}){Y}} \Vert _{\ell _2} + \Vert {{X}{h_{r + \frac{1}{n} }}({X})}\Vert _{\ell _2}]} \\&+ \frac{1}{{c_{2r} }}\sqrt{\frac{{\log (\frac{1}{{{\delta ^2}}}) + p \log ( {1 + 2rn} )}}{n}} \left[ \sqrt{2{\sigma _R}(n)} + \frac{{{\mathrm{{E}}_n}[\left\| {YX} \right\| _{{\ell _2}}^2 +\Vert {X{h_{r + \frac{1}{n} }}(X)} \Vert _{{\ell _2}}^2]}}{{2 \sqrt{2{\sigma _R}(n)}{n^2}}}\right] . \end{aligned}$$

Corollary 1 with (G.5) implies that our robust quasi-GLMs can be misspecified, i.e., \({\textrm{E}}\{ {Y_i}|{X_i}\} \ne \dot{b}(u(X_i^ \top \theta _n^*))\) for some i. Conditions (G.5) and (G.6) are the minimal additional requirements necessary to ensure consistency in terms of the \(\ell _2\)-estimation error, provided that the excess risk bound in Theorem 1 holds.

Excess risk bound under infinite variance conditions

Motivated by the log-truncated loss for mean estimation under \((1+\varepsilon )\)-moment condition in29,36 and references therein, we employ the almost surely continuous and non-decreasing function \(\psi : \mathbb {R} \rightarrow \mathbb {R}\) as the truncated function, that is,

$$\begin{aligned} -\log \left[ 1-x+\lambda (|x|)\right] \le {\psi }(x) \le \log \left[ 1+x+\lambda (|x|)\right] , \quad \forall x \in \mathbb {R} \end{aligned}$$
(13)

where \(\lambda (x)>0\) is a higher-order function of |x| that satisfies29:

  • (C.1) The function \(\lambda (x): \mathbb {R}_+ \rightarrow \mathbb {R}_+\) is a continuous non-decreasing function: \(\lim _{x \rightarrow \infty } \frac{\lambda (x)}{x}=\infty\). Moreover, there exist some \(c_2>0\) and a function \(f: \mathbb {R}_+ \rightarrow \mathbb {R}_+\) such that

    • (C.1.1) \(\lambda (tx) \le f(t)\lambda (x)\) for all \(t, x \in \mathbb {R}_+\), where \(\lim _{t \rightarrow 0^{+}} {f(t)}/{t}=0\);

    • (C.1.2) \(\lambda (x+y)\le c_2[\lambda (x)+\lambda (y)]\) for all \(x, y \in \mathbb {R}_+\).

Throughout this section, we specifically choose \(\lambda (x)=|x|^{\beta }/{\beta }\) in (13), resulting in the following expression:

$$\begin{aligned} \psi _\beta (x)&:=\log [ {1 + x +|x|^{\beta }/{\beta }} ]\mathrm{{1(}}x \ge 0\mathrm{{)}} - \log [ {1 - x +|x|^{\beta }/{\beta }}]\mathrm{{1(}}x \le 0\mathrm{{)}}=\mathrm{{sign}}(x)\log (1 + |x| +|x|^{\beta }/{\beta }),~{\beta } \in (1,2]. \end{aligned}$$
(14)

According to (C.1), for sufficiently small values of x, we have \(\psi _\beta (x) \approx x\), whereas for larger values of x, \(\psi _\beta (x)\) is significantly smaller than x (\(\psi _\beta (x) \ll x\)). Based on (14), the log-truncated robust estimator \(\hat{\theta }\) is defined as

$$\begin{aligned} \hat{\theta }: = \mathop {\arg \min }\limits _{\theta \in {\Theta }} \frac{1}{{n\alpha }}\sum \limits _{i = 1}^n {\psi _\beta }[\alpha l({Y_i} , {{X}}_i^ \top \theta )]+\rho \Vert \theta \Vert _2^2, \end{aligned}$$
(15)

where \(l(y, x,\theta ):=k(x^ \top \theta )-y u(x^ \top \theta )\), and \(\rho>0\) is the penalty parameter.

Next, following the theoretical framework of29, we present the second result, which provides insights into determining the appropriate order of the tuning parameter \(\alpha\) in Theorem 2. Our proposed tuning parameter is variance-dependent, ensuring that the SGD optimization remains computationally feasible for a wide class of loss functions.

Corollary 2

(Excess risk bound under infinite variance conditions) Let \(\hat{\theta }\) be defined by (15) with \(\beta \in (0,2]\) and \({\theta _n^*}\) be given by (1). Under (G.0)-(G.4), and with the tuning parameter \(\alpha \le \frac{1}{n^{1/\beta }}\left( \frac{\log ({\delta ^{ - 2}}) + p\log \left( {1 + 2rn} \right) }{[(2^{{\beta }-1} + 1)\sup _{\theta \in \Theta } R_{\lambda \circ l}^n(\theta )]}\right) ^{1/\beta }\), we have with probability at least \(1 - 2\delta\),

$$\begin{aligned} {R_l^n}(\hat{\theta }) - {R_l^n}({\theta _n^*})&\le \frac{2}{n}{\mathrm{{E}}_n}\{[|Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2\} +\frac{1}{{{n^{{{\frac{\beta - 1}{\beta } }}}}}}{\left( {\frac{{\log ({\delta ^{ - 2}}) + p\log \left( {1 + 2rn} \right) }}{{{2^{\beta - 1}} + 1}}} \right) ^{{\textstyle {{\beta - \mathrm{{1}}} \over \beta }}}}\\&\cdot \left[ {\frac{{{2^{\beta - 1}}{\mathrm{{E}}_n}\{ ||Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2|^\beta \} }}{{{n^\beta }{{[ {\sup _{\theta \in \Theta }}R_{\lambda \circ l}^n(\theta )]}^{{\textstyle {{\beta - \mathrm{{1}}} \over \beta }}}}}} + 2({2^{\beta - 1}} + 1){{[\mathop {\sup }\limits _{\theta \in \Theta } R_{\lambda \circ l}^n(\theta )]}^{{\beta ^{ - 1}}}}} \right] +\rho \Vert {\theta _n^*}\Vert _2^2, \end{aligned}$$

where we allow \(p=p_n\) to grow slowly with n.

Remark 5

If Condition (G.4) does not holds, we additionally assume that \({\mathrm{{E}}_n}\{[|Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2\} \ll O(n)\), \({\mathrm{{E}}_n}\{ ||Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2|^\beta \} = O({n^\beta })\) and \(p = o({n^{ - 1}}\log n).\) Under these conditions, the consistency of the excess risk holds, i.e.,

\({R_l^n}(\hat{\theta }) - {R_l^n}({\theta _n^*})=O\left( {{{\left( {\frac{{p\log n}}{n}} \right) }^{ \frac{{\beta - 1}}{\beta }}}}+\rho \right) =o_p(1),\) provided that \(\rho =o(1)\).

By the definition of the true risk minimization (1), we have \({R_l^n}(\hat{\theta }) - {R_l^n}({\theta _n^*})>0\) for all \(\theta\) such that \(\Vert \theta \Vert _{\ell _2} \le r\). Furthermore, the lower bound condition for excess risk in terms of \(\ell _2\)-error is indispensable.

  • (G.7) Hessian condition for the risk function:

    \({R_l^n}(\theta ) - {R_l^n}({\theta _n^*})\ge C_L{\left\| {\theta - \theta _n ^*} \right\| _{\ell _2}^2}\) for all \(\theta \in \Theta\), where \(C_L\) is a constant.

Suppose that \({R_l^n}(\theta )\) admits a Taylor expansion at \({\theta _n^*}\), i.e.,

$$\begin{aligned}{R_l^n}(\theta ) - {R_l^n}(\theta _n^*) = {\nabla {R}_l^n}(\theta _n^*)(\theta - \theta _n^*) + {(\theta - \theta _n^*)^\top }{\nabla ^2 {R}_l^n}(t\theta _n^* + (1 - t)\theta )(\theta - \theta _n^*),\end{aligned}$$

for some \(t\in (0, 1)\). Since \(\nabla {R}_l^n(\theta _n^*)=0\) by the definition of the true risk minimization (1), the first-order term vanishes. Under (G.0), Condition (G.7) holds naturally if the Hessian matrix \({\nabla ^2{R}_l^n}(t\theta _n^* + (1 - t)\theta )\) is positive definite under the restriction \(\left\| {t\theta _n^* + (1 - t)\theta } \right\| \le \left\| {\theta _n^*} \right\| + \left\| {\theta } \right\| \le 2r\). Combing this with (G.7), Theorem 2 immediately yields the \(\ell _2\)-risk bounds.

Corollary 3

Under the same conditions as in Theorem 2, if (G.7) holds with a constant \({C_L}\), then with probability at least \(1 - 2\delta\), we have

$$\begin{aligned} {\Vert {\hat{\theta }- \theta _n^*} \Vert _{\ell _2}^2}&\le \frac{2}{{C_L}n}{\mathrm{{E}}_n}\{[|Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2\} +\frac{1}{{{n^{{{\frac{\beta - 1}{\beta } }}}}}}{\left( {\frac{{\log ({\delta ^{ - 2}}) + p\log \left( {1 + 2rn} \right) }}{{{2^{\beta - 1}} + 1}}} \right) ^{{\textstyle {{\beta - \mathrm{{1}}} \over \beta }}}}\\&\cdot \left[ {\frac{{{2^{\beta - 1}}{\mathrm{{E}}_n}\{ ||Y|{g_{r}}(X)+{h_{r}}(X)]\Vert X \Vert _2|^\beta \} }}{{{C_L}{n^\beta }{{[ {\sup _{\theta \in \Theta }}R_{\lambda \circ l}^n(\theta )]}^{{\textstyle {{\beta - \mathrm{{1}}} \over \beta }}}}}} + 2({2^{\beta - 1}} + 1){{[\mathop {\sup }\limits _{\theta \in \Theta } R_{\lambda \circ l}^n(\theta )]}^{{\beta ^{ - 1}}}}} \right] +\frac{\rho \Vert {\theta _n^*}\Vert _2^2}{{C_L}}, \end{aligned}$$

i.e., \({\Vert {\hat{\theta }- \theta _n^*} \Vert _{\ell _2}^2}=O_p( {{{( {\frac{{p\log n}}{n}})}^{\frac{{\beta - 1}}{\beta }}}}+\rho )\).

The SGD algorithm

SGD for Parameter Estimations

Consider the general log-truncated regularized optimization with a \(\ell _2\)-regularization penalty:

$$\begin{aligned} \hat{\theta }_n(\alpha ,\rho ): =\mathop {\arg \min }\limits _{\theta \in {\Theta }} \left\{ \frac{1}{{n\alpha }}\sum _{i = 1}^n {\psi }[\alpha l({Y_i} , {X}_i^\top \theta )]+\rho \Vert \theta \Vert _{\ell _2}^2\right\} , \end{aligned}$$
(16)

where \(l(y,x^\top \theta ):=b(u(x^ \top \theta ))-y u(x^ \top \theta )\) and \({\psi }(x)\) satisfies (13). Here \(\rho> 0\) serves as the penalty parameter to control model complexity, and \(\alpha> 0\) is a robust tuning parameter that needs to be calibrated appropriately. In our simulations, the optimization problem is addressed using SGD, which is implemented as follows:

$$\begin{aligned} \theta _{t+1} = \theta _t - \frac{r_t}{\alpha } \nabla _{\theta } {\psi } \left[ \alpha l(Y_{i_t}, X_{i_t}^\top \theta _t) \right] - 2r_t\rho \theta _t, \quad t = 0, 1, 2, \ldots , \end{aligned}$$
(17)

where \(i_t\) is an index randomly sampled from the data, and \(\{r_t\}\) is the learning rate sequence.

The SGD algorithm aims to approximate the minimizer of the empirical risk by iteratively updating the parameter vector \(\theta\) in the direction opposite to the gradient of the objective function. This iterative approach provides a computationally efficient way to solve the non-convex optimization problem, particularly in robust learning scenarios.

To select the two tuning parameters, \(\alpha\) and \(\rho\), we employ a five-fold cross-validation (CV) procedure to identify the optimal parameter pair \((\alpha , \rho )\) within an effective subset of \(\mathbb {R}_{+}^2\). The selection criterion is based on minimizing the mean absolute error (MAE) between the observed outputs \(Y_i\) and the estimated outputs \(\hat{Y}_i\), is computed as \(\frac{1}{n_0}\sum _{i=1}^{n_0}|\hat{Y}_i(\alpha ,\rho )-Y_i|.\)

For comparison, we also consider the standard ridge regression without truncation, where the optimization problem is \(\hat{\theta }_n(\rho ): =\textrm{argmin }_{\theta \in {\Theta }}\{\frac{1}{{n}}\sum _{i = 1}^n l({Y_i}, {{X}}_i^\top \theta )+\rho \Vert \theta \Vert _{\ell _2}^2\}\), where \(\rho\) is the regularization parameter. The corresponding SGD iterations for solving this optimization problem are given by: \(\theta _{t+1}=\theta _{t}-r_t \nabla _{\theta }l(Y_{i_t}, X_{i_t}^\top \theta _{t}) - 2r_t\rho \theta _t, t=0,1,2,\cdots\), where the penalty parameter \(\rho\) is also selected via cross-validation. This approach ensures that the model complexity is controlled while minimizing the empirical risk, thereby providing a balance between bias and variance.

Iteration Complexity of SGD

In this section, we study the iteration complexity of SGD for minimizing the proposed log-truncated loss functions. Our results show that that the number of iterations required to make the average gradient norm less than \(\epsilon\) is \(O(\epsilon ^{-4})\).

Recall that the gradient of the empirical log-truncated loss function is

$$\begin{aligned} \nabla {{{\hat{R}}}_{\psi _{\beta } \circ l}}(\theta ) =\mathop {\arg \min }\limits _{\theta \in {\Theta }} \frac{1}{n}\sum \limits _{i = 1}^nX_i\dot{u}(X_i^\top \theta )[Y_i - \dot{b}(u(X_i^\top \theta ))] {\dot{\psi }^\lambda [\alpha (b(u(X_i^\top \theta ))-Y_i u(X_i^\top \theta ) )]}. \end{aligned}$$
(18)

The Hessian matrix of the empirical log-truncated GLM’s loss function is

$$\begin{aligned} \nabla ^2 {{{\hat{R}}}_{\psi \circ l}}(\theta )&= \frac{1}{n}\sum \limits _{i = 1}^n {X_i X_i^\top \{ \ddot{u}(X_i^\top \theta )[\dot{b}(u(X_i^\top \theta ))-Y_i] + {{\dot{u}}^2}(X_i^\top \theta )\ddot{b}(u(X_i^\top \theta ))\} \dot{\psi }^\lambda [\alpha l(Y_i,X_i^\top \theta )]}\\&+\frac{1}{n}\sum \limits _{i = 1}^n {X_i X_i^\top [\dot{u}(X_i^\top \theta )[Y_i - \dot{b}(u(X_i^\top \theta ))]]^2 \ddot{\psi }^\lambda [\alpha l(Y_i,X_i^\top \theta )]}. \end{aligned}$$

where \(l(y, x^\top \theta )=b(u(x^\top \theta ))-y u(x^\top \theta )\). The bounded Hessian matrix is required to ensure the iteration complexity of SGD.

We establish the iteration complexity theory for SGD, based on the second-moment assumption of the gradient function of data33,37. This assumption ensures that the gradient’s variance is controlled, allowing the algorithm to converge efficiently. The results indicate that, under appropriate conditions, the SGD algorithm can reliably minimize the log-truncated loss function, even in non-convex settings.

Theorem 2

Let \(F_n(\theta ):= \frac{1}{{n\alpha }}\sum _{i = 1}^n {\psi }[\alpha l({Y_i}, {X}_i^\top \theta )]+\rho \Vert \theta \Vert _{\ell _2}^2\) be the penalized empirical log-truncated loss function. Define the input vector \(\varvec{X}:= \{X_i\}_{i=1}^n\) and \(\theta ^*: = \mathop {\arg \min }\nolimits _{\theta \in {\Theta }}\textrm{E} [F_n(\theta )\mid \varvec{X}]\), where \(l(y, x^\top \theta )\) is the quasi-GLMs loss function. Suppose the following high-probability Hessian matrix condition holds:

$$\begin{aligned}P(\Vert \nabla ^2 F_n(\theta )\Vert \le J_\epsilon (X) \mid \varvec{X}) \ge 1-\varepsilon ,~\varepsilon \in (0,1),~\theta \in \Theta ,\end{aligned}$$

where \(\Vert \cdot \Vert\) is the spectral norm of the matrix. Assume the following moment condition on the gradient function:

$$\begin{aligned} \textrm{E}&\left[ \left\| \nabla _\theta \frac{1}{{\alpha }}\psi ^{\lambda }[\alpha l(Y_{i_t}, X_{i_t}^\top \theta _{t})] - \frac{1}{{n\alpha }}\sum \nolimits _{i = 1}^n \nabla _\theta {\psi }[\alpha l({Y_i} , {X}_i^\top \theta )]\right\| _{\ell _2}\mid \varvec{X}\right] \le \sigma _{\alpha }^2(X)\nonumber \\&\text {and}~~~~{J_\epsilon (X) r_{t}} \textrm{E}[ \Vert \nabla F_n(\theta _t )\Vert _{\ell _2}^{2} \mid \varvec{X}]+\frac{1}{2} \textrm{E}[\Vert \nabla F_n(\theta _t )\Vert _{\ell _2}^{2}\mid \varvec{X}]\nonumber \\&\le \textrm{E}[\nabla \{\frac{1}{{n\alpha }}\sum _{i = 1}^n {\psi }[\alpha l({Y_i} , {X}_i^\top \theta )]+\rho \Vert \theta \Vert _{\ell _2}^2\}^\top \nabla _\theta \{\psi _{\lambda }[\alpha l(Y_{i_t}, X_{i_t}^\top \theta _{t})]/{\alpha } + \rho \Vert \theta _t\Vert _{\ell _2}^2\} \mid \varvec{X}] \end{aligned}$$
(19)

holds for a randomly selected data \((Y_{i_t}, X_{i_t})\) from \(\{(Y_{i}, X_{i})\}_{i=1}^n\).

  1. (a)

    . If \(r_t\) satisfies (19) in the SGD algorithm (16), then with probability at least \(1-\varepsilon\)

    $$\begin{aligned} \textrm{E}_R[\textrm{E}[\Vert \nabla F_n(\theta _R )\Vert ^{2}\mid {\varvec{X}}] \le \frac{2J_\epsilon (X) \textrm{E}[F_n(\theta _{1}) - F_n(\theta ^*)\mid \varvec{X}]}{\sqrt{T}} + \frac{\sigma _{\alpha }^2(X)}{{\sqrt{T}}}, \end{aligned}$$
    (20)

    where \(R \sim U([T])\) (U([T]) is the discrete uniform random variable on the support [T]).

  2. (b)

    . Under the setting in (a), if \(T = O\left( \frac{\max \{\sigma _{\alpha }^4(X), J_\epsilon (X)^2\}}{\epsilon ^4}\right)\) with \(\epsilon> 0\), then the SGD sequence \(\{\theta _t\}_{t \ge 1}\) finds an approximate stationary point with probability at least \(1-\varepsilon\) .

\(\textrm{E}_R[\textrm{E}\left\| \nabla F_n(\theta _R )\right\| _{\ell _2}^{2}\mid \varvec{X}] \le \epsilon ^2\) for \(R \sim U([T])\)

Remark 6

The high-probability Hessian bound \(\Vert \nabla ^2 F_n(\theta )\Vert \le J_\epsilon (X)\) is a local smoothness condition (with that yields the one-step descent inequality underlying (20). The moment condition in (19) upper-bounds the conditional variance of the stochastic gradient; the log-truncation (via \(\psi\) and the scale \(\alpha\)) makes \(\sigma _\alpha ^2(X)\) finite even under heavy tails. When the data is i.i.d., we set \(r_t = 1/(J_\epsilon (X) \sqrt{t})\) in the SGD algorithm (16) to satisfy (19)[See the proof in A.5].

Compared with standard SGD analyses37 that assume uniformly bounded second moments of stochastic gradients, the log-truncated loss induces bounded effective gradients (captured by \(\sigma _\alpha ^2(X)\)) without requiring sub-Gaussian tails. This is analogous in spirit to gradient clipping schemes, but keeps the update as an unbiased gradient of a smoothly modified objective.

The theorem matches the state-of-the-art nonconvex SGD iteration complexity under smoothness and bounded variance, while extending it to heavy-tailed settings via log-truncation of loss function.

Simulations

In this section, we conduct simulations to evaluate the performance of our proposed roobust GLMs. The experiments compare the estimation errors of log-truncated logistic and negative binomial regressions against those of standard models. All models are optimized via SGD and are evaluated under various data-generating processes.

For these two types of GLMs, we first generate their covariates, assuming that the covariates are contaminated by heavy-tailed distributions. Specifically, the covariates \(\{X_i\}_{i=1}^n\) for each \(X_i \in \mathbb {R}^p\) can be expressed as \(X_i=X_i'+\xi _i\), where \(\{X_i'\}_{i=1}^n\) are \(\mathbb {R}^p\)-valued random vectors sampled from different normal distributions \(N(\textbf{0},{\textbf{Q}}(\varsigma ))\), each characterized by a distinct covariance structure. The covariance matrix \({\textbf{Q}}(\varsigma )\) takes the form of either an identity matrix (when \(\varsigma = 0\)) or a Toeplitz matrix (when \(\varsigma \ne 0\)). The Toeplitz matrix is given by

$$\begin{aligned} {\textbf{Q}}(\varsigma )=\left[ \begin{array}{cccccc} 1 & \varsigma & \varsigma ^2 & \cdots & \cdots & \varsigma ^{p-1} \\ \varsigma & 1 & \varsigma & \ddots & & \vdots \\ \varsigma ^2 & \varsigma & \ddots & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \varsigma & \varsigma ^2 \\ \vdots & & \ddots & \varsigma & 1 & \varsigma \\ \varsigma ^{p-1} & \cdots & \cdots & \varsigma ^2 & \varsigma & 1 \end{array}\right] . \end{aligned}$$

In our simulations, we set \(\varsigma\) to 0, 0.3, and 0.5. For a heterogeneous sample set \(\{X_i'\}_{i=1}^n\), \(\lceil n/3 \rceil\) samples are drawn from \(N({\textbf{0}}, {\textbf{I}})\), \(\lceil n/3 \rceil\) samples from \(N({\textbf{0}}, {\textbf{Q}}(0.3))\), and \(\lfloor n/3 \rfloor\) samples from \(N({\textbf{0}}, {\textbf{Q}}(0.5))\). The noise terms \(\{\xi _i\}_{i=1}^n\) are i.i.d. \(\mathbb {R}^p\)-valued random vectors, with each component \(\xi _i = (\xi _{i1}, \dots , \xi _{id})^\top\) independently drawn from a Pareto distribution with a scale parameter of 1 and shape parameters \(\tau \in \{1.6, 1.8, 2.01, 4.01, 6.01\}\). We set \(\beta = 1.5\) if \(\tau < 2\), and \(\beta = 2\) if \(\tau \ge 2\). The true regression coefficients for both GLMs are sampled independently from a uniform distribution on [0, 1]. Tables 1 and 2 present the average \(\ell _2\)-estimation and standard errors of 100 times for the logistic regression coefficient with and without regularization, respectively.

Table 1 compares the average \(\ell _2\)-estimation errors and standard errors for logistic regression under Pareto-distributed noise across various settings of the noise tail parameter \(\tau\), robust loss parameter \(\beta\), and dataset dimensions (np). The results demonstrate that truncation significantly reduces \(\ell _2\)-estimation errors compared to non-truncation, particularly for heavy-tailed noise (\(\tau \le 2.01\)) and smaller datasets (\((n, p) = (100, 50), (200, 100)\)). For instance, at \(\tau = 2.01\), \(\beta = 2.0\), and \((n, p) = (100, 50)\), truncation yields an error of 0.351, while non-truncation results in 0.503, a 30.2% reduction. Optimal performance is observed with larger datasets, such as \((n, p) = (700, 300)\) and (1000, 1000), where errors drop to 0.169–0.185 under truncation. For heavy-tailed noise (\(\tau = 1.60, 1.80, 2.01\)), \(\beta = 2.0\) consistently outperforms \(\beta = 1.5\) for smaller datasets, reducing errors (e.g., 0.351 vs. 0.369 at \(\tau = 1.60\), \((n, p) = (100, 50)\)). For lighter-tailed noise (\(\tau \ge 2.01\)), \(\beta = 2.0\) achieves comparable or better performance, particularly for larger datasets.

Table 1 Average \(\ell _2\)-estimation errors (standard errors) for logistic regression under Pareto noise with regularization.

Table 2 highlights the superior performance of the robust method (Truncation) in logistic regression under Pareto noise, with low and stable \(\ell _2\)-estimation errors (0.318–0.340) and standard errors (0.008–0.013) across all settings. The method without regularization struggles in the Non-truncation setting with heavy-tailed noise (\(\tau = 1.60, 1.80\)), exhibiting high errors (up to 63.558) and variability (up to 5.744), but improves for lighter-tailed noise (\(\tau = 4.01, 6.01\)). The Robust method is preferable for heavy-tailed noise, while Ridge becomes competitive as noise tails lighten. It should be noted that the logistic regression has bad behavior without ridge penalty and truncation, since the Hessian matrix of optimization under heavy-tailed input is unbounded and thus is unstable.

Table 2 Average \(\ell _2\)-estimation errors (standard errors) for logistic regression under Pareto noise without regularization.

Tables 3 and 4 present the average \(\ell _2\)-estimation and standard errors of the negative binomial regression parameters \(\theta _n^*\) with and without regularization, respectively. Table 3 reports average \(\ell _2\)-estimation errors for robust negative binomial regression under Pareto noise, varying the noise tail parameter \(\tau\), robust loss parameter \(\beta\), and dataset dimensions \((n, p)\). Consistent with Table 1, truncation significantly outperforms non-truncation, especially for heavy-tailed noise (\(\tau \le 2.01\)) and smaller datasets. For instance, at \(\tau = 2.01\), \(\beta = 2.0\), and \((n, p) = (100, 50)\), truncation achieves an error of 0.250 versus 0.300 for non-truncation (16.7% reduction). Lowest errors occur at \((n, p) = (700, 300)\) and \((1000, 1000)\), with truncation errors of 0.330–0.358. For heavy-tailed noise (\(\tau = 1.60, 1.80\)), \(\beta = 1.5\) slightly outperforms \(\beta = 2.0\) in smaller datasets (e.g., 0.263 vs. 0.250 at \(\tau = 1.60\), \((n, p) = (100, 50)\)), while \(\beta = 2.0\) performs comparably for lighter-tailed noise (\(\tau \ge 4.01\)).

Table 3 Average \(\ell _2\)-estimation errors (standard errors) for negative binomial under Pareto noise with regularization.

From Table 4, it can be seen that the robust method (truncation) slightly outperforms the non-robust method (without regularization), showing a lower \(\ell _2\text {-error}\) (0.284–0.365 vs. 0.316–0.392) and \(\text {standard errors}\) (0.038–0.172 vs. 0.056–0.296). The differences compared to a regularized scenario are likely minimal due to the inherent stability of the Robust method through truncation and the negative binomial model’s capability to manage overdispersed noise. In comparison to logistic regression, the negative binomial model demonstrates more consistent performance across noise levels, thereby reducing the necessity for regularization to achieve stable estimates.

Table 4 Average \(\ell _2\)-estimation errors (standard errors) for negative binomial under Pareto noise without regularization.

Real data analysis

To investigate count data regressions, we apply the robust negative binomial regressions to the German health care demand database, which is provided by the German Socioeconomic Panel (GSOEP). This dataset can be accessed at:

http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/.

The dataset contains 27,326 patient observations (each identified by a unique ID) collected over seven years: 1984, 1985, 1986, 1987, 1988, 1991, and 1994. The sample sizes for each year are {3874, 3794, 3792, 3666, 4483, 4340, 3377}.

For each patient, there are 21 covariates available for analysis, along with two alternative dependent variables: DOCVIS (the number of doctor visits in the three months preceding the survey) and HOSPVIS (the number of hospital visits in the last calendar year). Table 5 presents the descriptive statistics of the variables in the dataset. It is noteworthy that the distributions of some covariates, such as age, handper, hhninc, and educ, exhibit heavy tails, as indicated by their high kurtosis coefficients. Consequently, we employ our proposed robust NBR to separately explore the relationships between the independent variables DOCVIS, HOSPVIS, and the 21 covariates. One challenge with the NBR model is the unknown dispersion parameter \(\eta\). To address this, we first estimate \(\eta\) by maximizing the joint log-likelihood function of the model.

Once we obtain the estimated \(\hat{\eta }\), we incorporate it into the NBR optimization process. For each year, we randomly split the German health care demand dataset into a training set \((X_{\text {train}}, Y_{\text {train}})\), comprising 33% of the observations, and a testing set \((X_{\text {test}}, Y_{\text {test}})\), comprising the remaining 67%. We use the truncation function \(\lambda (|x|) = |x|^{\beta }/\beta\), where \({\beta } \in (1, 2]\) is incremented in steps of 0.1 to determine the optimal moment order. The training dataset is used to fit the NBR models, after which we compute the mean absolute errors (MAEs) of the predicted values of \(Y_{\text {test}}\) to evaluate model accuracy. A standard NBR is also trained using the same procedure.

Table 5 Descriptive statistics of German health care demand database.
Table 6 Comparison of MAEs on German health care demand dataset.

Table 6 reports MAEs for NBR on the German health care demand dataset, comparing non-truncation and truncation (gradient clipping) methods for doctor visits (DOCVIS) and hospital visits (HOSPVIS) across 1984–1994. Truncation consistently achieves lower MAEs than non-truncation for both outcomes. For DOCVIS, truncation reduces MAEs by 5.0–9.5%, with the largest improvement in 1991 (\(\beta = 2.0\): 2.167 vs. 2.350, 7.8% reduction). For HOSPVIS, truncation yields 49.7–50.7% lower MAEs, with the smallest error in 1984 (\(\beta = 1.9\): 0.127 vs. 0.256). In contrast to synthetic datasets with heavy-tailed Pareto noise (cf. Table 3), where truncation also excels, these results suggest truncation’s robustness extends to real-world health care data, likely due to effective handling of outliers or noise in the dataset.

Discussion

In this paper, we introduce log-truncated robust regression models specifically designed to address heavy-tailed contamination in both input and output data, with a particular emphasis on quasi-GLMs. Under the independent non-identical distributed data, we derive sharp excess risk bounds for the proposed log-truncated ERM estimator, accommodating both finite and infinite variance conditions. The optimization of the log-truncated ERM estimator is carried out via SGD, and we analyze the iteration complexity of SGD in minimizing the associated non-convex loss functions. This ensures the computational feasibility of the proposed methods even for non-identical distributions datasets with potential outliers.

Our numerical simulations demonstrate that the proposed log-truncated quasi-GLMs consistently outperform standard approaches in terms of robustness, particularly in scenarios involving non-identically independently distributed data and heavy-tailed noise. By considering various covariance structures and Pareto-distributed noise, the simulation results reveal that our models achieve substantially lower estimation errors compared to conventional models.

Furthermore, we applied our robust NBR model to the German health care demand dataset, which is characterized by heavy-tailed covariates, to examine the relationship between medical visit frequencies and socio-economic factors. The results indicate that our truncated NBR model yields significantly smaller prediction errors than standard NBR models, thereby underscoring the practical advantages of our approach. These findings underscore the effectiveness of our robust models in handling real-world data with heavy-tailed distributions. The practical relevance of our methods extends beyond healthcare to various domains involving count data regression, making them particularly valuable for applications in fields with unpredictable or non-Gaussian data behavior, such as finance, insurance, and epidemiology.