Lexico Gradient Boosting Classifier¶
Abstract of LexicoGradientBoostingClassifier
Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".
Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.
Adapted and integrated into a Gradient Boosting framework, this estimator boosts LexicoDecisionTreeRegressors as base learners, so each successive tree applies the lexicographic split-selection procedure above while fitting the residuals of the previous iterations.
What are features_group and non_longitudinal_features?
Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the
temporal structure of longitudinal data.
- features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
- non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.
Proper setup of these attributes is critical for leveraging temporal patterns effectively.
LexicoGradientBoostingClassifier ¶
Bases: CustomClassifierMixinEstimator
Lexico Gradient Boosting Classifier for longitudinal data analysis.
This classifier extends scikit-learn's GradientBoostingClassifier for longitudinal data by integrating a
lexicographic optimisation approach within each base learner (a LexicoDecisionTreeRegressor). Splits are
evaluated with a bi-objective rule: the primary objective minimises the loss (friedman_mse criterion), and
the secondary objective favours features from more recent waves whenever competing loss reductions are within
threshold_gain. Boosting aggregates these decisions over successive iterations by fitting residuals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold_gain
|
float, default=0.0015
|
Threshold for comparing loss reductions during split selection. Lower values enforce stricter recency preference; higher values allow more flexibility. |
0.0015
|
features_group
|
List[List[int]]
|
Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal functionality. |
None
|
criterion
|
str, default="friedman_mse"
|
The split quality metric. Fixed to "friedman_mse"; do not modify. |
'friedman_mse'
|
splitter
|
str, default="lexicoRF"
|
The split strategy. Fixed to "lexicoRF"; do not modify. |
'lexicoRF'
|
max_depth
|
Optional[int], default=3
|
Maximum depth of each tree. |
3
|
min_samples_split
|
int, default=2
|
Minimum samples required to split an internal node. |
2
|
min_samples_leaf
|
int, default=1
|
Minimum samples required at a leaf node. |
1
|
min_weight_fraction_leaf
|
float, default=0.0
|
Minimum weighted fraction of total sample weight at a leaf. |
0.0
|
max_features
|
Optional[Union[int, str]], default=None
|
Number of features to consider for splits (e.g., "sqrt", "log2", int). |
None
|
random_state
|
Optional[int], default=None
|
Seed for random number generation. |
None
|
max_leaf_nodes
|
Optional[int], default=None
|
Maximum number of leaf nodes per tree. |
None
|
min_impurity_decrease
|
float, default=0.0
|
Minimum impurity decrease required for a split. |
0.0
|
ccp_alpha
|
float, default=0.0
|
Complexity parameter for pruning; non-negative. |
0.0
|
n_estimators
|
int, default=100
|
Number of boosting stages (trees) to perform. |
100
|
learning_rate
|
float, default=0.1
|
Learning rate shrinks the contribution of each tree. There is a trade-off between |
0.1
|
Attributes:
| Name | Type | Description |
|---|---|---|
_lexico_gradient_boosting |
GradientBoostingClassifier
|
The underlying gradient boosting model. |
classes_ |
ndarray
|
The class labels. |
Examples:
Basic Usage
from sklearn.metrics import accuracy_score
from scikit_longitudinal.estimators.ensemble import LexicoGradientBoostingClassifier
import numpy as np
from scikit_longitudinal.data_preparation import LongitudinalDataset
# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)
clf = LexicoGradientBoostingClassifier(features_group=dataset.feature_groups())
clf.fit(dataset.X_train, dataset.y_train)
y_pred = clf.predict(dataset.X_test)
print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred)}")
Advanced: tuning learning rate and threshold gain
# ... Similar setup as above ...
clf = LexicoGradientBoostingClassifier(
features_group=[[0, 1], [2, 3]],
threshold_gain=0.001, # Adjusted for hyperparameter tuning
learning_rate=0.01, # Lower learning rate for more gradual learning
n_estimators=200 # Increased number of estimators for better performance
)
clf.fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {accuracy_score(y, y_pred)}")
# ... Similar evaluation as above ...
Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_gradient_boosting.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | |
feature_importances_
property
¶
Return the feature importances.
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: The feature importances. |
Note
Feature importances are calculated based on the impurity decrease across all trees in the ensemble.
fit(X, y=None, sample_weight=None)
¶
Fit the classifier to the training data.
Validates X (and y when provided) with scikit-learn's
check_X_y / check_array and then delegates to the subclass
implementation in _fit. sample_weight is forwarded only when
the subclass's _fit declares it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Training input samples of shape |
required |
y
|
ndarray
|
Target class labels of shape |
None
|
sample_weight
|
ndarray
|
Per-sample weights of shape |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
CustomClassifierMixinEstimator |
CustomClassifierMixinEstimator
|
The fitted estimator ( |
Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
predict(X)
¶
Predict class labels for the input samples.
Validates X with scikit-learn's check_array and delegates to
the subclass implementation in _predict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input samples of shape |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Predicted class labels of shape |
Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
predict_proba(X)
¶
Predict class probabilities for the input samples.
Validates X with scikit-learn's check_array and delegates to
the subclass implementation in _predict_proba.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input samples of shape |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Class probabilities of shape |
ndarray
|
with columns ordered as in |