Lasso算法

統計學機器學習中,Lasso算法(英語:least absolute shrinkage and selection operator,又譯最小絕對值收斂和選擇算子、套索算法)是一種同時進行特徵選擇正則化(數學)的迴歸分析方法,旨在增強統計模型的預測準確性和可解釋性,最初由史丹佛大學統計學教授羅伯特·蒂布希拉尼於1996年基於Leo Breiman非負參數推斷(Nonnegative Garrote, NNG)提出[1][2]。Lasso算法最初用於計算最小平方法模型,這個簡單的算法揭示了很多估計量的重要性質,如估計量嶺迴歸(Ridge regression,也叫吉洪諾夫正則化)和最佳子集選擇的關係,Lasso係數估計值軟閾值(soft thresholding)之間的聯繫。它也揭示了當協變量共線時,Lasso係數估計值不一定唯一(類似標準線性迴歸)。

雖然最早是為應用最小平方法而定義的算法,lasso正則化可以簡單直接地拓展應用於許多統計學模型上,包括廣義線性模型廣義估計方程式成比例災難模型M-估計[3][4]。Lasso選擇子集的能力依賴於限制條件的形式並且有多種表現形式,包括幾何學貝氏統計,和凸分析

Lasso算法與基追蹤降噪聯繫緊密。

歷史來源

蒂布希拉尼最初使用Lasso來提高預測的準確性與迴歸模型的可解釋性,他修改了模型擬合的過程,在協變量中只選擇一個子集應用到最終模型中,而非用上全部協變量。這是基於有著相似目的,但方法有所不同的Breiman的非負參數推斷。

在Lasso之前,選擇模型中協變量最常用的方法是移步選擇,這種方法在某些情況下是準確的,例如一些協變量與模型輸出值有強相關性情況。然而在另一些情況下,這種方法會讓預測結果更差。在當時,嶺迴歸是提高模型預測準確性最常用的方法。嶺迴歸可以通過縮小大的迴歸係數來減少過擬合從而改善模型預測偏差。但是它並不選擇協變量,所以對模型的準確構建和解釋沒有幫助。

Lasso結合了上述的兩種方法,它通過強制讓迴歸係數絕對值之和小於某固定值,即強制一些迴歸係數變為0,有效地選擇了不包括這些迴歸係數對應的協變量的更簡單的模型。這種方法和嶺迴歸類似,在嶺迴歸中,迴歸係數平方和被強制小於某定值,不同點在於嶺迴歸只改變係數的值,而不把任何值設為0。

基本形式

Lasso最初為了最小平方法而被設計出來,Lasso的最小平方法應用能夠簡單明了地展示Lasso的許多特性。

最小平方

假設一個樣本包括N種事件,每個事件包括p個協變量和一個輸出值。讓 為輸出值,並且 為第i種情況的協變量向量,那麼Lasso要計算的目標方程式就是:

對所有  ,計算  [1]

這裡   是一個決定規則化程度的預定的自由參數。 設 協變量矩陣,那麼  ,其中   的第 i 行,那麼上式可以寫成更緊湊的形式:

對所有  ,計算  

這裡   是標準   範數  維的1的向量。

因為  ,所以有

 

對變量進行中心化是常用的數據處理方法。並且共變異數一般規範化為   ,這樣得到的解就不會依賴測量的規模。

它的目標方程式還可以寫為:

 

拉格朗日形式為:

 

其中    的關係取決於數據特徵。

正交協變量

現在考慮一些Lasso迴歸估計的基本性質。

首先假定所有的協變量都是正交的,即  ,其中 克羅內克δ函數。等價的矩陣寫法為  ,使用次梯度法可有如下的表達形式

  [1]

  用於表示軟閾值算子,當這個值非常小的時候為0。一個與之相近的記號 用來表示硬閾值算子,將較小的數值記為0的同時保留原有的較大數值。

與嶺迴歸相比較,其中嶺迴歸的目標在於最小化

 

即有

 

因此嶺迴歸是對OLS迴歸中所有的係數以一致的係數 縮放,並不會進行變量選擇。

同樣也可以對best subset selection算法進行比較,其目標在於最小化

 

其中   表示 "  norm",即0範數,被定義為該向量中非零元素的個數。在這個例子中,可以得到

 

其中   被稱為軟閾值算子,  為示性函數。

總的來說,Lasso估計量展現出了嶺迴歸和最佳子劃分算法的係數收縮的優點,使得部分係數為0。此外,在嶺迴歸全部使用一個常數係數縮放的時候,Lasso迴歸會將一個接近0的係數變為0。


相關協變量

對於一般的情況中,不同的協變量之間可能並不是獨立的,其中一種特例即為變量存在重複,例如變量j和變量k,有  。在這種情況下參數   的Lasso迴歸的估計量不是唯一確定的。

事實上,如果有一些   中存在  ,尋找一個 進行轉換,將 轉換為  的同時有   轉換為  ,並保留其他參數不變,此時Lasso迴歸具有有效的連續性質。一些基於Lasso迴歸的改進,例如彈性網絡正則化,旨在解決這個缺點。

一般形式

Lasso正則化可以擴展為其他目標函數,例如廣義線性模型,廣義估計方程式,比例風險模型和M估計。[1][5] 有目標函數

 

其中Lasso正則化迴歸給出了下面模型的估計量

 

在這裡只有 是一個懲罰項, 是一個自由變量,與最基本的模型中的   變量一樣。

算法解釋

幾何解釋

 
Forms of the constraint regions for lasso and ridge regression.

Lasso迴歸可以使得某些項係數為0,從幾何上來看,不同約束邊界形狀的嶺迴歸則不能。他們都可以解釋為最小化相同的目標函數

 

但是有不同的約束條件:在Lasso迴歸中為   而在嶺迴歸中為  

已隱藏部分未翻譯內容,歡迎參與翻譯

The figure shows that the constraint region defined by the   norm is a square rotated so that its corners lie on the axes (in general a cross-polytope), while the region defined by the   norm is a circle (in general an n-sphere), which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of   are identically zero, while in the case of an n-sphere, the points on the boundary for which some of the components of   are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of   are zero than one for which none of them are.

Making λ easier to interpret with an accuracy-simplicity tradeoff

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of  .[6] It is assumed that   is standardized with z-scores and that   is centered (zero mean). Let   represent the hypothesized regression coefficients and let   refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values.[7] This results in

 

where   is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and   balances between the two.

 
Solution paths for the   norm and   norm when   and  

Given a single regressor, relative simplicity can be defined by specifying   as  , which is the maximum amount of deviation from   when  . Assuming that  , the solution path can be defined in terms of  :

 

If  , the ordinary least squares solution (OLS) is used. The hypothesized value of   is selected if   is bigger than  . Furthermore, if  , then   represents the proportional influence of  . In other words,   measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution.

If an  -norm is used to penalize deviations from zero given a single regressor, the solution path is given by

 . Like  ,   moves in the direction of the point   when   is close to zero; but unlike  , the influence of   diminishes in   if   increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from  ) is also determined by a regressor's contribution to   accuracy. First,

 

An   of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized   values. The individual contribution of deviating from each hypothesis can be computed with the   x   matrix

 

where  . If   when   is computed, then the diagonal elements of   sum to  . The diagonal   values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the   diagonal element of   simply corresponds to the   value between   and  .

A rescaled version of the adaptive lasso of can be obtained by setting  .[8] If regressors are uncorrelated, the moment that the   parameter is activated is given by the   diagonal element of  . Assuming for convenience that   is a vector of zeros,

 

That is, if regressors are uncorrelated,   again specifies the minimal influence of  . Even when regressors are correlated, the first time that a regression parameter is activated occurs when   is equal to the highest diagonal element of  .

These results can be compared to a rescaled version of the lasso by defining  , which is the average absolute deviation of   from  . Assuming that regressors are uncorrelated, then the moment of activation of the   regressor is given by

 

For  , the moment of activation is again given by  . If   is a vector of zeros and a subset of   relevant parameters are equally responsible for a perfect fit of  , then this subset is activated at a   value of  . The moment of activation of a relevant regressor then equals  . In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.[6]

Bayesian interpretation

 
Laplace distributions are sharply peaked at their mean with more probability density concentrated there compared to a normal distribution.

Just as ridge regression can be interpreted as linear regression for which the coefficients have been assigned normal prior distributions, lasso can be interpreted as linear regression for which the coefficients have Laplace prior distributions. The Laplace distribution is sharply peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.[1]

Convex relaxation interpretation

Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of   covariates that results in the smallest value of the objective function for some fixed  , where n is the total number of covariates. The "  norm",  , (the number of nonzero entries of a vector), is the limiting case of "  norms", of the form   (where the quotation marks signify that these are not really norms for   since   is not convex for  , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the "  norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by   is the convex hull of the region defined by   for  .

應用

LASSO已被應用於經濟和金融領域,可以改善預測結果並選擇有時被忽視的變量。例如:公司破產預測[9]和高增長公司預測[10]

參見

參考文獻

  1. ^ 1.0 1.1 1.2 1.3 1.4 Tibshirani, Robert. 1996. 「Regression Shrinkage and Selection via the lasso」. Journal of the Royal Statistical Society. Series B (methodological) 58 (1). Wiley: 267–88. http://www.jstor.org/stable/2346178頁面存檔備份,存於網際網路檔案館).
  2. ^ Breiman, Leo. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995-11-01, 37 (4): 373–384 [2017-10-06]. ISSN 0040-1706. doi:10.2307/1269730. (原始內容存檔於2020-06-08). 
  3. ^ Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 1996, 58 (1): 267–288 [2016-07-25]. (原始內容存檔於2020-11-17). 
  4. ^ Tibshirani, Robert. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997-02-28, 16 (4): 385–395. ISSN 1097-0258. doi:10.1002/(sici)1097-0258(19970228)16:4%3C385::aid-sim380%3E3.0.co;2-3 (英語). [永久失效連結]
  5. ^ 引用錯誤:沒有為名為Tibshirani 1997的參考文獻提供內容
  6. ^ 6.0 6.1 Hoornweg, Victor. Chapter 8. Science: Under Submission. Hoornweg Press. 2018 [2023-08-08]. ISBN 978-90-829188-0-9. (原始內容存檔於2023-11-02). 
  7. ^ Motamedi, Fahimeh; Sanchez, Horacio; Mehri, Alireza; Ghasemi, Fahimeh. Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies. Bioinformatics. October 2021, 37 (19): 469–475. ISSN 1367-4803. PMID 34979024. doi:10.1093/bioinformatics/btab659. 
  8. ^ Zou, Hui. The Adaptive Lasso and Its Oracle Properties (PDF). 2006 [2023-08-08]. (原始內容存檔 (PDF)於2021-07-11). 
  9. ^ Shaonan, Tian; Yu, Yan; Guo, Hui. Variable selection and corporate bankruptcy forecasts. Journal of Banking & Finance. 2015, 52 (1): 89–100. doi:10.1016/j.jbankfin.2014.12.003 . 
  10. ^ Coad, Alex; Srhoj, Stjepan. Catching Gazelles with a Lasso: Big data techniques for the prediction of high-growth firms. Small Business Economics. 2020, 55 (1): 541–565. doi:10.1007/s11187-019-00203-3 .