Lexico Gradient Boosting Classifier¶

Abstract of LexicoGradientBoostingClassifier

Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".

Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.

Adapted and integrated into a Gradient Boosting framework, this estimator boosts LexicoDecisionTreeRegressors as base learners, so each successive tree applies the lexicographic split-selection procedure above while fitting the residuals of the previous iterations.

See More In References

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

LexicoGradientBoostingClassifier ¶

Bases: CustomClassifierMixinEstimator

Lexico Gradient Boosting Classifier for longitudinal data analysis.

This classifier extends scikit-learn's GradientBoostingClassifier for longitudinal data by integrating a lexicographic optimisation approach within each base learner (a LexicoDecisionTreeRegressor). Splits are evaluated with a bi-objective rule: the primary objective minimises the loss (friedman_mse criterion), and the secondary objective favours features from more recent waves whenever competing loss reductions are within threshold_gain. Boosting aggregates these decisions over successive iterations by fitting residuals.

Parameters:

Name	Type	Description	Default
`threshold_gain`	`float, default=0.0015`	Threshold for comparing loss reductions during split selection. Lower values enforce stricter recency preference; higher values allow more flexibility.	`0.0015`
`features_group`	`List[List[int]]`	Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal functionality.	`None`
`criterion`	`str, default="friedman_mse"`	The split quality metric. Fixed to "friedman_mse"; do not modify.	`'friedman_mse'`
`splitter`	`str, default="lexicoRF"`	The split strategy. Fixed to "lexicoRF"; do not modify.	`'lexicoRF'`
`max_depth`	`Optional[int], default=3`	Maximum depth of each tree.	`3`
`min_samples_split`	`int, default=2`	Minimum samples required to split an internal node.	`2`
`min_samples_leaf`	`int, default=1`	Minimum samples required at a leaf node.	`1`
`min_weight_fraction_leaf`	`float, default=0.0`	Minimum weighted fraction of total sample weight at a leaf.	`0.0`
`max_features`	`Optional[Union[int, str]], default=None`	Number of features to consider for splits (e.g., "sqrt", "log2", int).	`None`
`random_state`	`Optional[int], default=None`	Seed for random number generation.	`None`
`max_leaf_nodes`	`Optional[int], default=None`	Maximum number of leaf nodes per tree.	`None`
`min_impurity_decrease`	`float, default=0.0`	Minimum impurity decrease required for a split.	`0.0`
`ccp_alpha`	`float, default=0.0`	Complexity parameter for pruning; non-negative.	`0.0`
`n_estimators`	`int, default=100`	Number of boosting stages (trees) to perform.	`100`
`learning_rate`	`float, default=0.1`	Learning rate shrinks the contribution of each tree. There is a trade-off between `learning_rate` and `n_estimators`.	`0.1`

Attributes:

Name	Type	Description
`_lexico_gradient_boosting`	`GradientBoostingClassifier`	The underlying gradient boosting model.
`classes_`	`ndarray`	The class labels.

Examples:

Basic Usage

from sklearn.metrics import accuracy_score
from scikit_longitudinal.estimators.ensemble import LexicoGradientBoostingClassifier
import numpy as np
from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)

clf = LexicoGradientBoostingClassifier(features_group=dataset.feature_groups())
clf.fit(dataset.X_train, dataset.y_train)
y_pred = clf.predict(dataset.X_test)
print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred)}")

Advanced: tuning learning rate and threshold gain

# ... Similar setup as above ...

clf = LexicoGradientBoostingClassifier(
    features_group=[[0, 1], [2, 3]],
    threshold_gain=0.001, # Adjusted for hyperparameter tuning
    learning_rate=0.01, # Lower learning rate for more gradual learning
    n_estimators=200 # Increased number of estimators for better performance
)
clf.fit(X, y)
y_pred = clf.predict(X)
print(f"Accuracy: {accuracy_score(y, y_pred)}")

# ... Similar evaluation as above ...

Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_gradient_boosting.py

class LexicoGradientBoostingClassifier(CustomClassifierMixinEstimator):
    """
    Lexico Gradient Boosting Classifier for longitudinal data analysis.

    This classifier extends scikit-learn's `GradientBoostingClassifier` for longitudinal data by integrating a
    lexicographic optimisation approach within each base learner (a `LexicoDecisionTreeRegressor`). Splits are
    evaluated with a bi-objective rule: the primary objective minimises the loss (`friedman_mse` criterion), and
    the secondary objective favours features from more recent waves whenever competing loss reductions are within
    `threshold_gain`. Boosting aggregates these decisions over successive iterations by fitting residuals.

    Args:
        threshold_gain (float, default=0.0015):
            Threshold for comparing loss reductions during split selection. Lower values enforce stricter recency
            preference; higher values allow more flexibility.
        features_group (List[List[int]], optional):
            Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal
            functionality.
        criterion (str, default="friedman_mse"):
            The split quality metric. Fixed to "friedman_mse"; do not modify.
        splitter (str, default="lexicoRF"):
            The split strategy. Fixed to "lexicoRF"; do not modify.
        max_depth (Optional[int], default=3):
            Maximum depth of each tree.
        min_samples_split (int, default=2):
            Minimum samples required to split an internal node.
        min_samples_leaf (int, default=1):
            Minimum samples required at a leaf node.
        min_weight_fraction_leaf (float, default=0.0):
            Minimum weighted fraction of total sample weight at a leaf.
        max_features (Optional[Union[int, str]], default=None):
            Number of features to consider for splits (e.g., "sqrt", "log2", int).
        random_state (Optional[int], default=None):
            Seed for random number generation.
        max_leaf_nodes (Optional[int], default=None):
            Maximum number of leaf nodes per tree.
        min_impurity_decrease (float, default=0.0):
            Minimum impurity decrease required for a split.
        ccp_alpha (float, default=0.0):
            Complexity parameter for pruning; non-negative.
        n_estimators (int, default=100):
            Number of boosting stages (trees) to perform.
        learning_rate (float, default=0.1):
            Learning rate shrinks the contribution of each tree. There is a trade-off between `learning_rate` and
            `n_estimators`.

    Attributes:
        _lexico_gradient_boosting (GradientBoostingClassifier):
            The underlying gradient boosting model.
        classes_ (ndarray):
            The class labels.

    Examples:
        !!! example "Basic Usage"

            ```python
            from sklearn.metrics import accuracy_score
            from scikit_longitudinal.estimators.ensemble import LexicoGradientBoostingClassifier
            import numpy as np
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load dataset
            dataset = LongitudinalDataset('./stroke_longitudinal.csv')
            dataset.load_data()
            dataset.load_target(target_column="stroke_w2")
            dataset.setup_features_group("elsa")
            dataset.load_train_test_split(test_size=0.2, random_state=42)

            clf = LexicoGradientBoostingClassifier(features_group=dataset.feature_groups())
            clf.fit(dataset.X_train, dataset.y_train)
            y_pred = clf.predict(dataset.X_test)
            print(f"Accuracy: {accuracy_score(dataset.y_test, y_pred)}")
            ```

        !!! example "Advanced: tuning learning rate and threshold gain"

            ```python
            # ... Similar setup as above ...

            clf = LexicoGradientBoostingClassifier(
                features_group=[[0, 1], [2, 3]],
                threshold_gain=0.001, # Adjusted for hyperparameter tuning
                learning_rate=0.01, # Lower learning rate for more gradual learning
                n_estimators=200 # Increased number of estimators for better performance
            )
            clf.fit(X, y)
            y_pred = clf.predict(X)
            print(f"Accuracy: {accuracy_score(y, y_pred)}")

            # ... Similar evaluation as above ...
            ```
    """

    def __init__(
        self,
        threshold_gain: float = 0.0015,
        features_group: List[List[int]] = None,
        criterion: str = "friedman_mse",  # Do not change this value
        splitter: str = "lexicoRF",  # Do not change this value
        max_depth: Optional[int] = 3,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        min_weight_fraction_leaf: float = 0.0,
        max_features: Optional[Union[int, str]] = None,
        random_state: Optional[int] = None,
        max_leaf_nodes: Optional[int] = None,
        min_impurity_decrease: float = 0.0,
        ccp_alpha: float = 0.0,
        n_estimators: int = 100,
        learning_rate: float = 0.1,
    ):
        self.threshold_gain = threshold_gain
        self.features_group = features_group
        self.criterion = criterion
        self.splitter = splitter
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.max_features = max_features
        self.max_leaf_nodes = max_leaf_nodes
        self.min_impurity_decrease = min_impurity_decrease
        self.ccp_alpha = ccp_alpha
        self.random_state = random_state
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate

        self._lexico_gradient_boosting = None
        self.classes_ = None

    @ensure_valid_state
    @override
    def _fit(
        self,
        X: np.ndarray,
        y: np.ndarray,
        sample_weight=None,
    ) -> "LexicoGradientBoostingClassifier":
        """Fit the Lexico Gradient Boosting Classifier model according to the given training data.

        Args:
            X (np.ndarray):
                The training input samples.
            y (np.ndarray):
                The target values (class labels).

        Returns:
            LexicoGradientBoostingClassifier: The fitted classifier.

        Raises:
            ValueError:
                If there are less than or equal to 1 feature group.

        !!! tip "Tuning Tip"
            Adjust `n_estimators` and `learning_rate` to balance model complexity and convergence speed. A lower
            `learning_rate` with more `n_estimators` can improve generalization but increases computation time.
        """
        self._lexico_gradient_boosting = GradientBoostingClassifier(
            splitter=self.splitter,
            threshold_gain=self.threshold_gain,
            features_group=self.features_group,
        )
        self._lexico_gradient_boosting.fit(X, y, sample_weight=sample_weight)
        self.classes_ = getattr(
            self._lexico_gradient_boosting, "classes_", unique_labels(y)
        )
        return self

    @ensure_valid_state
    @override
    def _predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class labels for each input sample.

        !!! tip "Quick Predictions"
            After fitting, use this method to generate predictions efficiently. It leverages the boosted ensemble for
            accurate classification.
        """
        return self._lexico_gradient_boosting.predict(X)

    @ensure_valid_state
    @override
    def _predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class probabilities for each input sample.
        """
        return self._lexico_gradient_boosting.predict_proba(X)

    @property
    def feature_importances_(self) -> np.ndarray:
        """Return the feature importances.

        Returns:
            np.ndarray:
                The feature importances.

        !!! note
            Feature importances are calculated based on the impurity decrease across all trees in the ensemble.
        """
        return self._lexico_gradient_boosting.feature_importances_

`feature_importances_` `property` ¶

Return the feature importances.

Returns:

Type	Description
`ndarray`	np.ndarray: The feature importances.

Note

Feature importances are calculated based on the impurity decrease across all trees in the ensemble.

`fit(X, y=None, sample_weight=None)` ¶

Fit the classifier to the training data.

Validates X (and y when provided) with scikit-learn's check_X_y / check_array and then delegates to the subclass implementation in _fit. sample_weight is forwarded only when the subclass's _fit declares it.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Training input samples of shape `(n_samples, n_features)`.	required
`y`	`ndarray`	Target class labels of shape `(n_samples,)`.	`None`
`sample_weight`	`ndarray`	Per-sample weights of shape `(n_samples,)`. Forwarded to `_fit` only when supported.	`None`

Returns:

Name	Type	Description
`CustomClassifierMixinEstimator`	`CustomClassifierMixinEstimator`	The fitted estimator (`self`).

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py

@final
def fit(
    self, X: np.ndarray, y: np.ndarray = None, sample_weight: np.ndarray = None
) -> "CustomClassifierMixinEstimator":
    """Fit the classifier to the training data.

    Validates ``X`` (and ``y`` when provided) with scikit-learn's
    ``check_X_y`` / ``check_array`` and then delegates to the subclass
    implementation in ``_fit``. ``sample_weight`` is forwarded only when
    the subclass's ``_fit`` declares it.

    Args:
        X (np.ndarray):
            Training input samples of shape ``(n_samples, n_features)``.
        y (np.ndarray, optional):
            Target class labels of shape ``(n_samples,)``.
        sample_weight (np.ndarray, optional):
            Per-sample weights of shape ``(n_samples,)``. Forwarded to
            ``_fit`` only when supported.

    Returns:
        CustomClassifierMixinEstimator: The fitted estimator (``self``).
    """
    if y is None:
        return self._check_array_decorator(self._fit)(X)
    _fit_sig = inspect.signature(self._fit)
    if "sample_weight" in _fit_sig.parameters:
        return self._check_X_y_decorator(self._fit)(
            X, y, sample_weight=sample_weight
        )
    else:
        return self._check_X_y_decorator(self._fit)(X, y)

`predict(X)` ¶

Predict class labels for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Input samples of shape `(n_samples, n_features)`.	required

Returns:

Type	Description
`ndarray`	np.ndarray: Predicted class labels of shape `(n_samples,)`.

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py

@final
def predict(self, X: np.ndarray) -> np.ndarray:
    """Predict class labels for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Predicted class labels of shape ``(n_samples,)``.
    """
    return self._check_array_decorator(self._predict)(X)

`predict_proba(X)` ¶

Predict class probabilities for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict_proba.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Input samples of shape `(n_samples, n_features)`.	required

Returns:

Type	Description
`ndarray`	np.ndarray: Class probabilities of shape `(n_samples, n_classes)`,
`ndarray`	with columns ordered as in `self.classes_`.

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py

@final
def predict_proba(self, X: np.ndarray) -> np.ndarray:
    """Predict class probabilities for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict_proba``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Class probabilities of shape ``(n_samples, n_classes)``,
        with columns ordered as in ``self.classes_``.
    """
    return self._check_array_decorator(self._predict_proba)(X)

Lexico Gradient Boosting Classifier¶

LexicoGradientBoostingClassifier ¶

feature_importances_ property ¶

fit(X, y=None, sample_weight=None) ¶

predict(X) ¶

predict_proba(X) ¶

`feature_importances_` `property` ¶

`fit(X, y=None, sample_weight=None)` ¶

`predict(X)` ¶

`predict_proba(X)` ¶