Math Behind GOSS (Gradient-Based One Side Sampling)?

Question

As per my understanding through books & Google Search,

GOSS (Gradient-Based One Side Sampling) is a novel sampling method that downsamples the instances on the basis of gradients. As we know instances with small gradients are well trained (small training error) and those with large gradients are undertrained. A naive approach to downsample is to discard instances with small gradients by solely focussing on instances with large gradients but this would alter the data distribution. In a nutshell, GOSS retains instances with large gradients while performing random sampling on instances with small gradients. Source

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.

Can someone please explain the math behind GOSS?

Bruno Ambrozio · Answer 1 · 2021-01-19T12:16:27.420

2

Wang et al., (2019) have provided a nice and clear explanation. Please, check out their paper to find the answer you are looking for:

Part II. BAYESIAN OPTIMIZED LIGHTGBM

Section: A. The Principle of the LightGBM

Wang, R., Liu, Y., Ye, X., Tang, Q., Gou, J., Huang, M., & Wen, Y. (2019). Power System Transient Stability Assessment Based on Bayesian Optimized LightGBM. 2019 IEEE 3Rd Conference On Energy Internet And Energy System Integration (EI2). doi: 10.1109/ei247390.2019.9062027

edited Jan 19 '21 at 12:16

answered May 04 '20 at 15:20

Bruno Ambrozio

121
3

link seems to be dead – Nikos M. Jan 05 '21 at 08:38
Fixed the link. You can also find the paper by googling the doi: 10.1109/ei247390.2019.9062027 – Bruno Ambrozio Jan 19 '21 at 12:18

score 0 · Answer 2 · answered Dec 10 '22 at 07:29

GOSS is a technique used in LightGBM to select a subset of the data to use when training a gradient boosting model. It works by first sorting the training data by the gradients of the loss function with respect to the current model, and then selecting a subset of the data based on the magnitude of these gradients.

To understand the math behind GOSS in more detail, it's helpful to first review some of the key concepts behind gradient boosting. In gradient boosting, a model is trained by iteratively adding weak learners (e.g., decision trees) to the model, with each new learner being trained on the residual errors of the previous learners. This process continues until the model reaches a predefined stopping criterion.

At each iteration, GOSS selects a subset of the training data based on the gradients of the loss function with respect to the current model.

In detail, GOSS uses a two-step process to determine the subset of data instances to consider for split points:

For each data instance, the algorithm computes its gradient and adds it to a sorted list. This list is then divided into two parts: the top (i.e. largest) $K$ gradients, and the bottom (i.e. smallest) $N-K$ gradients. The top $K$ gradients are considered large gradients and the bottom $N-K$ gradients are considered small gradients .
For the large gradients , the algorithm includes all of the corresponding data instances in the subset of data instances to consider for split points. For the small gradients, the algorithm randomly samples a fixed number of data instances to include in the subset. The number of data instances to sample is determined by a parameter called bagging fraction.

In summary, the basic idea behind GOSS is to sample instances in a way that takes into account the gradient of the loss function with respect to the predictions made by the model. This is done by retaining a certain percentage of instances with the largest gradients, while randomly sampling the remaining instances. This allows the model to focus on instances that are more important for training, while still maintaining a diverse training set.

The exact details of the GOSS algorithm can be found in the LightGBM paper, which is available here: LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In particular, the GOSS algorithm is described in Section 3.3 of the paper.

How is K set? Is it hard coded or is it a specific parameter of LightGBM? — Augustin, Aug 08 '23 at 16:13

Math Behind GOSS (Gradient-Based One Side Sampling)?

2 Answers2