Skip to content

Lexico Gradient Boosting Classifier

Abstract of LexicoGradientBoostingClassifier

Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".

Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.

Adapted and integrated into a Gradient Boosting framework, this estimator boosts LexicoDecisionTreeRegressors as base learners, so each successive tree applies the lexicographic split-selection procedure above while fitting the residuals of the previous iterations.

See More In References

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

  • features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
  • non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

LexicoGradientBoostingClassifier

Bases: CustomClassifierMixinEstimator

Lexico Gradient Boosting Classifier for longitudinal data analysis.

This classifier extends scikit-learn's GradientBoostingClassifier for longitudinal data by integrating a lexicographic optimisation approach within each base learner (a LexicoDecisionTreeRegressor). Splits are evaluated with a bi-objective rule: the primary objective minimises the loss (friedman_mse criterion), and the secondary objective favours features from more recent waves whenever competing loss reductions are within threshold_gain. Boosting aggregates these decisions over successive iterations by fitting residuals.

Parameters:

Name Type Description Default
threshold_gain float, default=0.0015

Threshold for comparing loss reductions during split selection. Lower values enforce stricter recency preference; higher values allow more flexibility.

0.0015
features_group List[List[int]]

Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal functionality.

None
criterion str, default="friedman_mse"

The split quality metric. Fixed to "friedman_mse"; do not modify.

'friedman_mse'
splitter str, default="lexicoRF"

The split strategy. Fixed to "lexicoRF"; do not modify.

'lexicoRF'
max_depth Optional[int], default=3

Maximum depth of each tree.

3
min_samples_split int, default=2

Minimum samples required to split an internal node.

2
min_samples_leaf int, default=1

Minimum samples required at a leaf node.

1
min_weight_fraction_leaf float, default=0.0

Minimum weighted fraction of total sample weight at a leaf.

0.0
max_features Optional[Union[int, str]], default=None

Number of features to consider for splits (e.g., "sqrt", "log2", int).

None
random_state Optional[int], default=None

Seed for random number generation.

None
max_leaf_nodes Optional[int], default=None

Maximum number of leaf nodes per tree.

None
min_impurity_decrease float, default=0.0

Minimum impurity decrease required for a split.

0.0
ccp_alpha float, default=0.0

Complexity parameter for pruning; non-negative.

0.0
n_estimators int, default=100

Number of boosting stages (trees) to perform.

100
learning_rate float, default=0.1

Learning rate shrinks the contribution of each tree. There is a trade-off between learning_rate and n_estimators.

0.1

Attributes:

Name Type Description
_lexico_gradient_boosting GradientBoostingClassifier

The underlying gradient boosting model.

classes_ ndarray

The class labels.

Examples:

Basic Usage

from sklearn.metrics import accuracy_score
from scikit_longitudinal.estimators.ensemble import LexicoGradientBoostingClassifier
import numpy as np
from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)

clf = LexicoGradientBoostingClassifier(features_group=dataset.feature_groups())
clf.fit(dataset.X_train, dataset.y_train)
y_pred = clf.predict(dataset.X_test)
print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred)}")

Advanced: tuning learning rate and threshold gain

# ... Similar setup as above ...

clf = LexicoGradientBoostingClassifier(
    features_group=[[0, 1], [2, 3]],
    threshold_gain=0.001, # Adjusted for hyperparameter tuning
    learning_rate=0.01, # Lower learning rate for more gradual learning
    n_estimators=200 # Increased number of estimators for better performance
)
clf.fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {accuracy_score(y, y_pred)}")

# ... Similar evaluation as above ...
Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_gradient_boosting.py
class LexicoGradientBoostingClassifier(CustomClassifierMixinEstimator):
    """
    Lexico Gradient Boosting Classifier for longitudinal data analysis.

    This classifier extends scikit-learn's `GradientBoostingClassifier` for longitudinal data by integrating a
    lexicographic optimisation approach within each base learner (a `LexicoDecisionTreeRegressor`). Splits are
    evaluated with a bi-objective rule: the primary objective minimises the loss (`friedman_mse` criterion), and
    the secondary objective favours features from more recent waves whenever competing loss reductions are within
    `threshold_gain`. Boosting aggregates these decisions over successive iterations by fitting residuals.

    Args:
        threshold_gain (float, default=0.0015):
            Threshold for comparing loss reductions during split selection. Lower values enforce stricter recency
            preference; higher values allow more flexibility.
        features_group (List[List[int]], optional):
            Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal
            functionality.
        criterion (str, default="friedman_mse"):
            The split quality metric. Fixed to "friedman_mse"; do not modify.
        splitter (str, default="lexicoRF"):
            The split strategy. Fixed to "lexicoRF"; do not modify.
        max_depth (Optional[int], default=3):
            Maximum depth of each tree.
        min_samples_split (int, default=2):
            Minimum samples required to split an internal node.
        min_samples_leaf (int, default=1):
            Minimum samples required at a leaf node.
        min_weight_fraction_leaf (float, default=0.0):
            Minimum weighted fraction of total sample weight at a leaf.
        max_features (Optional[Union[int, str]], default=None):
            Number of features to consider for splits (e.g., "sqrt", "log2", int).
        random_state (Optional[int], default=None):
            Seed for random number generation.
        max_leaf_nodes (Optional[int], default=None):
            Maximum number of leaf nodes per tree.
        min_impurity_decrease (float, default=0.0):
            Minimum impurity decrease required for a split.
        ccp_alpha (float, default=0.0):
            Complexity parameter for pruning; non-negative.
        n_estimators (int, default=100):
            Number of boosting stages (trees) to perform.
        learning_rate (float, default=0.1):
            Learning rate shrinks the contribution of each tree. There is a trade-off between `learning_rate` and
            `n_estimators`.

    Attributes:
        _lexico_gradient_boosting (GradientBoostingClassifier):
            The underlying gradient boosting model.
        classes_ (ndarray):
            The class labels.

    Examples:
        !!! example "Basic Usage"

            ```python
            from sklearn.metrics import accuracy_score
            from scikit_longitudinal.estimators.ensemble import LexicoGradientBoostingClassifier
            import numpy as np
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load dataset
            dataset = LongitudinalDataset('./stroke_longitudinal.csv')
            dataset.load_data()
            dataset.load_target(target_column="stroke_w2")
            dataset.setup_features_group("elsa")
            dataset.load_train_test_split(test_size=0.2, random_state=42)

            clf = LexicoGradientBoostingClassifier(features_group=dataset.feature_groups())
            clf.fit(dataset.X_train, dataset.y_train)
            y_pred = clf.predict(dataset.X_test)
            print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred)}")
            ```

        !!! example "Advanced: tuning learning rate and threshold gain"

            ```python
            # ... Similar setup as above ...

            clf = LexicoGradientBoostingClassifier(
                features_group=[[0, 1], [2, 3]],
                threshold_gain=0.001, # Adjusted for hyperparameter tuning
                learning_rate=0.01, # Lower learning rate for more gradual learning
                n_estimators=200 # Increased number of estimators for better performance
            )
            clf.fit(X, y)
            y_pred = clf.predict(X)
            print(f"Accuracy: {accuracy_score(y, y_pred)}")

            # ... Similar evaluation as above ...
            ```
    """

    def __init__(
        self,
        threshold_gain: float = 0.0015,
        features_group: List[List[int]] = None,
        criterion: str = "friedman_mse",  # Do not change this value
        splitter: str = "lexicoRF",  # Do not change this value
        max_depth: Optional[int] = 3,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        min_weight_fraction_leaf: float = 0.0,
        max_features: Optional[Union[int, str]] = None,
        random_state: Optional[int] = None,
        max_leaf_nodes: Optional[int] = None,
        min_impurity_decrease: float = 0.0,
        ccp_alpha: float = 0.0,
        n_estimators: int = 100,
        learning_rate: float = 0.1,
    ):
        self.threshold_gain = threshold_gain
        self.features_group = features_group
        self.criterion = criterion
        self.splitter = splitter
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.max_features = max_features
        self.max_leaf_nodes = max_leaf_nodes
        self.min_impurity_decrease = min_impurity_decrease
        self.ccp_alpha = ccp_alpha
        self.random_state = random_state
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate

        self._lexico_gradient_boosting = None
        self.classes_ = None

    @ensure_valid_state
    @override
    def _fit(
        self,
        X: np.ndarray,
        y: np.ndarray,
        sample_weight=None,
    ) -> "LexicoGradientBoostingClassifier":
        """Fit the Lexico Gradient Boosting Classifier model according to the given training data.

        Args:
            X (np.ndarray):
                The training input samples.
            y (np.ndarray):
                The target values (class labels).

        Returns:
            LexicoGradientBoostingClassifier: The fitted classifier.

        Raises:
            ValueError:
                If there are less than or equal to 1 feature group.

        !!! tip "Tuning Tip"
            Adjust `n_estimators` and `learning_rate` to balance model complexity and convergence speed. A lower
            `learning_rate` with more `n_estimators` can improve generalization but increases computation time.
        """
        self._lexico_gradient_boosting = GradientBoostingClassifier(
            splitter=self.splitter,
            threshold_gain=self.threshold_gain,
            features_group=self.features_group,
        )
        self._lexico_gradient_boosting.fit(X, y, sample_weight=sample_weight)
        self.classes_ = getattr(
            self._lexico_gradient_boosting, "classes_", unique_labels(y)
        )
        return self

    @ensure_valid_state
    @override
    def _predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class labels for each input sample.

        !!! tip "Quick Predictions"
            After fitting, use this method to generate predictions efficiently. It leverages the boosted ensemble for
            accurate classification.
        """
        return self._lexico_gradient_boosting.predict(X)

    @ensure_valid_state
    @override
    def _predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class probabilities for each input sample.
        """
        return self._lexico_gradient_boosting.predict_proba(X)

    @property
    def feature_importances_(self) -> np.ndarray:
        """Return the feature importances.

        Returns:
            np.ndarray:
                The feature importances.

        !!! note
            Feature importances are calculated based on the impurity decrease across all trees in the ensemble.
        """
        return self._lexico_gradient_boosting.feature_importances_

feature_importances_ property

Return the feature importances.

Returns:

Type Description
ndarray

np.ndarray: The feature importances.

Note

Feature importances are calculated based on the impurity decrease across all trees in the ensemble.

fit(X, y=None, sample_weight=None)

Fit the classifier to the training data.

Validates X (and y when provided) with scikit-learn's check_X_y / check_array and then delegates to the subclass implementation in _fit. sample_weight is forwarded only when the subclass's _fit declares it.

Parameters:

Name Type Description Default
X ndarray

Training input samples of shape (n_samples, n_features).

required
y ndarray

Target class labels of shape (n_samples,).

None
sample_weight ndarray

Per-sample weights of shape (n_samples,). Forwarded to _fit only when supported.

None

Returns:

Name Type Description
CustomClassifierMixinEstimator CustomClassifierMixinEstimator

The fitted estimator (self).

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def fit(
    self, X: np.ndarray, y: np.ndarray = None, sample_weight: np.ndarray = None
) -> "CustomClassifierMixinEstimator":
    """Fit the classifier to the training data.

    Validates ``X`` (and ``y`` when provided) with scikit-learn's
    ``check_X_y`` / ``check_array`` and then delegates to the subclass
    implementation in ``_fit``. ``sample_weight`` is forwarded only when
    the subclass's ``_fit`` declares it.

    Args:
        X (np.ndarray):
            Training input samples of shape ``(n_samples, n_features)``.
        y (np.ndarray, optional):
            Target class labels of shape ``(n_samples,)``.
        sample_weight (np.ndarray, optional):
            Per-sample weights of shape ``(n_samples,)``. Forwarded to
            ``_fit`` only when supported.

    Returns:
        CustomClassifierMixinEstimator: The fitted estimator (``self``).
    """
    if y is None:
        return self._check_array_decorator(self._fit)(X)
    _fit_sig = inspect.signature(self._fit)
    if "sample_weight" in _fit_sig.parameters:
        return self._check_X_y_decorator(self._fit)(
            X, y, sample_weight=sample_weight
        )
    else:
        return self._check_X_y_decorator(self._fit)(X, y)

predict(X)

Predict class labels for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict.

Parameters:

Name Type Description Default
X ndarray

Input samples of shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

np.ndarray: Predicted class labels of shape (n_samples,).

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def predict(self, X: np.ndarray) -> np.ndarray:
    """Predict class labels for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Predicted class labels of shape ``(n_samples,)``.
    """
    return self._check_array_decorator(self._predict)(X)

predict_proba(X)

Predict class probabilities for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict_proba.

Parameters:

Name Type Description Default
X ndarray

Input samples of shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

np.ndarray: Class probabilities of shape (n_samples, n_classes),

ndarray

with columns ordered as in self.classes_.

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def predict_proba(self, X: np.ndarray) -> np.ndarray:
    """Predict class probabilities for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict_proba``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Class probabilities of shape ``(n_samples, n_classes)``,
        with columns ordered as in ``self.classes_``.
    """
    return self._check_array_decorator(self._predict_proba)(X)