Skip to content

Lexico Deep Forest Classifier

Abstract of LexicoDeepForestClassifier

Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".

Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.

Adapted and integrated into a Deep Forest cascade, this estimator stacks layers of LexicoRandomForestClassifiers (and optional diversity learners) so that each layer applies the lexicographic split-selection procedure above while propagating wave-aware predictions through the cascade.

See More In References

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

  • features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
  • non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

LexicoDeepForestClassifier

Bases: CustomClassifierMixinEstimator

Lexico Deep Forest Classifier for longitudinal data analysis.

This classifier extends the Deep Forest framework for longitudinal data by stacking layers of longitudinal-adapted base estimators (typically LexicoRandomForestClassifier) so each layer's predictions become additional features for the next. Every base tree applies a lexicographic split-selection rule: the primary objective maximises the information-gain ratio (entropy criterion), and the secondary objective favours features from more recent waves whenever competing gain ratios are within threshold_gain. For more information on Deep Forest, see DF21.

Parameters:

Name Type Description Default
features_group List[List[int]]

Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal functionality.

None
longitudinal_base_estimators Optional[List[LongitudinalEstimatorConfig]]

List of configurations for longitudinal base estimators. Each config specifies the classifier type, count, and optional hyperparameters. Available types: LEXICO_RF, COMPLETE_RANDOM_LEXICO_RF.

None
non_longitudinal_features List[Union[int, str]]

Indices of non-longitudinal features. Defaults to None.

None
diversity_estimators bool, default=True

Whether to include diversity estimators (weak learners) in the ensemble. If True, two completely random LexicoRandomForestClassifier instances are added.

True
class_weight Optional[Union[dict, List[dict], str]]

Class weights passed to each longitudinal base estimator unless explicitly provided in the estimator's hyperparameters.

None
random_state int

Seed for random number generation. Defaults to None.

None
single_classifier_type Optional[Union[LongitudinalClassifierType, str]]

Type of a single classifier to use if longitudinal_base_estimators is not provided.

None
single_count Optional[int]

Number of instances of the single classifier type.

None
max_layers int, default=5

Maximum number of cascade layers in the deep forest.

5

Attributes:

Name Type Description
_deep_forest CascadeForestClassifier

The underlying deep forest model.

classes_ ndarray

The class labels.

Examples:

Basic Usage

from scikit_longitudinal.estimators.ensemble.lexicographical.lexico_deep_forest import LexicoDeepForestClassifier,                 LongitudinalEstimatorConfig, LongitudinalClassifierType
import numpy as np
from sklearn.metrics import accuracy_score
from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)


# Configure base estimators
lexico_rf_config = LongitudinalEstimatorConfig(
    classifier_type=LongitudinalClassifierType.LEXICO_RF,
    count=3,
)

clf = LexicoDeepForestClassifier(
    features_group=dataset.feature_groups(),
    longitudinal_base_estimators=[lexico_rf_config],
)

clf.fit(dataset.X_train, dataset.y_train)
y_pred = clf.predict(dataset.X_train)
print(f"Predictions: {y_pred}")

Advanced: multiple estimator types

# ... Similar setup as above ...

complete_random_lexico_rf = LongitudinalEstimatorConfig(
    classifier_type=LongitudinalClassifierType.COMPLETE_RANDOM_LEXICO_RF,
    count=2,
)
clf = LexicoDeepForestClassifier(
    features_group=features_group,
    longitudinal_base_estimators=[lexico_rf_config, complete_random_lexico_rf],
)
clf.fit(X, y)

# ... Similar prediction and evaluation as above ...

Advanced: disabling diversity estimators

# ... Similar setup as above ...

clf = LexicoDeepForestClassifier(
    features_group=features_group,
    longitudinal_base_estimators=[lexico_rf_config],
    diversity_estimators=False, # Disable diversity estimators
)
clf.fit(X, y)

# ... Similar prediction and evaluation as above ...
Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_deep_forest.py
class LexicoDeepForestClassifier(CustomClassifierMixinEstimator):
    """
    Lexico Deep Forest Classifier for longitudinal data analysis.

    This classifier extends the Deep Forest framework for longitudinal data by stacking layers of
    longitudinal-adapted base estimators (typically `LexicoRandomForestClassifier`) so each layer's predictions
    become additional features for the next. Every base tree applies a lexicographic split-selection rule: the
    primary objective maximises the information-gain ratio (entropy criterion), and the secondary objective
    favours features from more recent waves whenever competing gain ratios are within `threshold_gain`. For more
    information on Deep Forest, see [DF21](https://deep-forest.readthedocs.io/en/stable/).

    Args:
        features_group (List[List[int]], optional):
            Temporal matrix of feature indices for longitudinal attributes, ordered by recency. Required for longitudinal
            functionality.
        longitudinal_base_estimators (Optional[List[LongitudinalEstimatorConfig]], optional):
            List of configurations for longitudinal base estimators. Each config specifies the classifier type, count,
            and optional hyperparameters. Available types: `LEXICO_RF`, `COMPLETE_RANDOM_LEXICO_RF`.
        non_longitudinal_features (List[Union[int, str]], optional):
            Indices of non-longitudinal features. Defaults to None.
        diversity_estimators (bool, default=True):
            Whether to include diversity estimators (weak learners) in the ensemble. If True, two completely random
            `LexicoRandomForestClassifier` instances are added.
        class_weight (Optional[Union[dict, List[dict], str]]):
            Class weights passed to each longitudinal base estimator unless explicitly provided in the estimator's
            hyperparameters.
        random_state (int, optional):
            Seed for random number generation. Defaults to None.
        single_classifier_type (Optional[Union[LongitudinalClassifierType, str]], optional):
            Type of a single classifier to use if `longitudinal_base_estimators` is not provided.
        single_count (Optional[int], optional):
            Number of instances of the single classifier type.
        max_layers (int, default=5):
            Maximum number of cascade layers in the deep forest.

    Attributes:
        _deep_forest (CascadeForestClassifier):
            The underlying deep forest model.
        classes_ (ndarray):
            The class labels.

    Examples:
        !!! example "Basic Usage"

            ```python
            from scikit_longitudinal.estimators.ensemble.lexicographical.lexico_deep_forest import LexicoDeepForestClassifier, \
                LongitudinalEstimatorConfig, LongitudinalClassifierType
            import numpy as np
            from sklearn.metrics import accuracy_score
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load dataset
            dataset = LongitudinalDataset('./stroke_longitudinal.csv')
            dataset.load_data()
            dataset.load_target(target_column="stroke_w2")
            dataset.setup_features_group("elsa")
            dataset.load_train_test_split(test_size=0.2, random_state=42)


            # Configure base estimators
            lexico_rf_config = LongitudinalEstimatorConfig(
                classifier_type=LongitudinalClassifierType.LEXICO_RF,
                count=3,
            )

            clf = LexicoDeepForestClassifier(
                features_group=dataset.feature_groups(),
                longitudinal_base_estimators=[lexico_rf_config],
            )

            clf.fit(dataset.X_train, dataset.y_train)
            y_pred = clf.predict(dataset.X_train)
            print(f"Predictions: {y_pred}")
            ```

        !!! example "Advanced: multiple estimator types"

            ```python
            # ... Similar setup as above ...

            complete_random_lexico_rf = LongitudinalEstimatorConfig(
                classifier_type=LongitudinalClassifierType.COMPLETE_RANDOM_LEXICO_RF,
                count=2,
            )
            clf = LexicoDeepForestClassifier(
                features_group=features_group,
                longitudinal_base_estimators=[lexico_rf_config, complete_random_lexico_rf],
            )
            clf.fit(X, y)

            # ... Similar prediction and evaluation as above ...
            ```

        !!! example "Advanced: disabling diversity estimators"

            ```python
            # ... Similar setup as above ...

            clf = LexicoDeepForestClassifier(
                features_group=features_group,
                longitudinal_base_estimators=[lexico_rf_config],
                diversity_estimators=False, # Disable diversity estimators
            )
            clf.fit(X, y)

            # ... Similar prediction and evaluation as above ...
            ```
    """

    # pylint: disable=too-many-arguments,invalid-name,signature-differs,no-member
    def __init__(
        self,
        features_group: List[List[int]] = None,
        longitudinal_base_estimators: Optional[
            List[LongitudinalEstimatorConfig]
        ] = None,
        non_longitudinal_features: List[Union[int, str]] = None,
        diversity_estimators: bool = True,
        class_weight: Optional[Union[dict, List[dict], str]] = None,
        random_state: int = None,
        single_classifier_type: Optional[Union[LongitudinalClassifierType, str]] = None,
        single_count: Optional[int] = None,
        max_layers: int = 5,
    ):
        self.features_group = features_group
        self.non_longitudinal_features = non_longitudinal_features
        self.single_classifier_type = single_classifier_type
        self.single_count = single_count
        self.longitudinal_base_estimators = longitudinal_base_estimators
        self.diversity_estimators = diversity_estimators
        self.class_weight = class_weight
        self.random_state = random_state
        self._deep_forest = None
        self.classes_ = None
        self.max_layers = max_layers

    @property
    def base_longitudinal_estimators(self) -> List[ClassifierMixin]:
        estimators: List[ClassifierMixin] = []
        for estimator_info in self.longitudinal_base_estimators:
            base_hyperparameters = estimator_info.hyperparameters or {}
            for _ in range(estimator_info.count):
                estimators.append(
                    self._create_longitudinal_estimator(
                        estimator_info.classifier_type, **dict(base_hyperparameters)
                    )
                )

        if self.diversity_estimators:
            for _ in range(2):
                estimators.append(
                    self._create_longitudinal_estimator(
                        LongitudinalClassifierType.COMPLETE_RANDOM_LEXICO_RF
                    )
                )
        return estimators

    def _create_longitudinal_estimator(
        self,
        classifier_type: Union[str, LongitudinalClassifierType],
        **hyperparameters: Any,
    ) -> ClassifierMixin:
        resolved_hyperparameters = dict(hyperparameters)
        if (
            "class_weight" not in resolved_hyperparameters
            and self.class_weight is not None
        ):
            resolved_hyperparameters["class_weight"] = self.class_weight

        if classifier_type in {
            LongitudinalClassifierType.LEXICO_RF,
            LongitudinalClassifierType.LEXICO_RF.value,
        }:
            return LexicoRandomForestClassifier(
                features_group=self.features_group, **resolved_hyperparameters
            )
        if classifier_type in {
            LongitudinalClassifierType.COMPLETE_RANDOM_LEXICO_RF,
            LongitudinalClassifierType.COMPLETE_RANDOM_LEXICO_RF.value,
        }:
            resolved_hyperparameters.setdefault("max_features", 1)
            return LexicoRandomForestClassifier(
                features_group=self.features_group, **resolved_hyperparameters
            )
        raise ValueError(f"Unsupported classifier type: {classifier_type.value}")

    @ensure_valid_state
    @override
    def _fit(
        self, X: np.ndarray, y: np.ndarray, sample_weight=None
    ) -> "LexicoDeepForestClassifier":
        """Fit the Lexico Deep Forest Classifier model according to the given training data.

        Args:
            X (np.ndarray):
                The training input samples.
            y (np.ndarray):
                The target values (class labels).

        Returns:
            LexicoDeepForestClassifier: The fitted classifier.

        Raises:
            ValueError:
                If there are less than or equal to 1 feature group.

        !!! tip "Configuration Tip"
            Experiment with different combinations of `longitudinal_base_estimators` and `diversity_estimators` to
            find the optimal balance between accuracy and diversity for your dataset.
        """
        if self.single_classifier_type is not None and self.single_count is not None:
            self.longitudinal_base_estimators = [
                LongitudinalEstimatorConfig(
                    classifier_type=self.single_classifier_type,
                    count=self.single_count,
                )
            ]
        elif self.longitudinal_base_estimators is None:
            raise ValueError("longitudinal_base_estimators must be provided.")
        if self.features_group is None or len(self.features_group) <= 1:
            raise ValueError("features_group must contain more than one feature group.")
        self._deep_forest = CascadeForestClassifier(
            random_state=self.random_state,
            max_layers=self.max_layers,
        )
        self._deep_forest.set_estimator(self.base_longitudinal_estimators, n_splits=2)
        self._deep_forest.fit(X, y, sample_weight=sample_weight)
        self.classes_ = getattr(self._deep_forest, "classes_", unique_labels(y))
        return self

    @ensure_valid_state
    @override
    def _predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class labels for each input sample.

        !!! tip "Quick Predictions"
            After fitting, use this method to generate predictions efficiently. It leverages the deep forest ensemble for
            accurate classification.
        """
        return self._deep_forest.predict(X)

    @ensure_valid_state
    @override
    def _predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict class probabilities for samples in X.

        Args:
            X (np.ndarray):
                The input samples.

        Returns:
            np.ndarray:
                The predicted class probabilities for each input sample.
        """
        return self._deep_forest.predict_proba(X)

fit(X, y=None, sample_weight=None)

Fit the classifier to the training data.

Validates X (and y when provided) with scikit-learn's check_X_y / check_array and then delegates to the subclass implementation in _fit. sample_weight is forwarded only when the subclass's _fit declares it.

Parameters:

Name Type Description Default
X ndarray

Training input samples of shape (n_samples, n_features).

required
y ndarray

Target class labels of shape (n_samples,).

None
sample_weight ndarray

Per-sample weights of shape (n_samples,). Forwarded to _fit only when supported.

None

Returns:

Name Type Description
CustomClassifierMixinEstimator CustomClassifierMixinEstimator

The fitted estimator (self).

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def fit(
    self, X: np.ndarray, y: np.ndarray = None, sample_weight: np.ndarray = None
) -> "CustomClassifierMixinEstimator":
    """Fit the classifier to the training data.

    Validates ``X`` (and ``y`` when provided) with scikit-learn's
    ``check_X_y`` / ``check_array`` and then delegates to the subclass
    implementation in ``_fit``. ``sample_weight`` is forwarded only when
    the subclass's ``_fit`` declares it.

    Args:
        X (np.ndarray):
            Training input samples of shape ``(n_samples, n_features)``.
        y (np.ndarray, optional):
            Target class labels of shape ``(n_samples,)``.
        sample_weight (np.ndarray, optional):
            Per-sample weights of shape ``(n_samples,)``. Forwarded to
            ``_fit`` only when supported.

    Returns:
        CustomClassifierMixinEstimator: The fitted estimator (``self``).
    """
    if y is None:
        return self._check_array_decorator(self._fit)(X)
    _fit_sig = inspect.signature(self._fit)
    if "sample_weight" in _fit_sig.parameters:
        return self._check_X_y_decorator(self._fit)(
            X, y, sample_weight=sample_weight
        )
    else:
        return self._check_X_y_decorator(self._fit)(X, y)

predict(X)

Predict class labels for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict.

Parameters:

Name Type Description Default
X ndarray

Input samples of shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

np.ndarray: Predicted class labels of shape (n_samples,).

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def predict(self, X: np.ndarray) -> np.ndarray:
    """Predict class labels for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Predicted class labels of shape ``(n_samples,)``.
    """
    return self._check_array_decorator(self._predict)(X)

predict_proba(X)

Predict class probabilities for the input samples.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _predict_proba.

Parameters:

Name Type Description Default
X ndarray

Input samples of shape (n_samples, n_features).

required

Returns:

Type Description
ndarray

np.ndarray: Class probabilities of shape (n_samples, n_classes),

ndarray

with columns ordered as in self.classes_.

Source code in scikit_longitudinal/templates/custom_classifier_mixin_estimator.py
@final
def predict_proba(self, X: np.ndarray) -> np.ndarray:
    """Predict class probabilities for the input samples.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_predict_proba``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Class probabilities of shape ``(n_samples, n_classes)``,
        with columns ordered as in ``self.classes_``.
    """
    return self._check_array_decorator(self._predict_proba)(X)

LongitudinalClassifierType

Bases: Enum

Enumeration of classifier types that are adapted for longitudinal data analysis.

This enumeration provides identifiers for longitudinal-adapted classifiers that can be used within the LexicoDeepForestClassifier ensemble.

Attributes:

Name Type Description
LEXICO_RF

Identifier for a Lexico Random Forest Classifier.

COMPLETE_RANDOM_LEXICO_RF

Identifier for a Lexico Random Forest Classifier with complete randomness.

Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_deep_forest.py
class LongitudinalClassifierType(Enum):
    """Enumeration of classifier types that are adapted for longitudinal data analysis.

    This enumeration provides identifiers for longitudinal-adapted classifiers that can be used within the
    LexicoDeepForestClassifier ensemble.

    Attributes:
        LEXICO_RF: Identifier for a Lexico Random Forest Classifier.
        COMPLETE_RANDOM_LEXICO_RF: Identifier for a Lexico Random Forest Classifier with complete randomness.

    """

    LEXICO_RF = "LexicoRandomForestClassifier"
    COMPLETE_RANDOM_LEXICO_RF = "LexicoCompleteRFClassifier"

LongitudinalEstimatorConfig dataclass

Configuration for a longitudinal base estimator within the LexicoDeepForestClassifier ensemble.

This configuration class is used to specify the type of longitudinal classifier, the number of times it should be instantiated within the ensemble, and any hyperparameters for the individual classifiers.

Parameters:

Name Type Description Default
classifier_type LongitudinalClassifierType

The type of longitudinal classifier to be used.

required
count int

The number of times the classifier should be replicated in the ensemble. Defaults to 2.

2
hyperparameters Optional[Dict[str, Any]]

A dictionary of hyperparameters for the classifier. Defaults to None.

None
Source code in scikit_longitudinal/estimators/ensemble/lexicographical/lexico_deep_forest.py
@dataclass
class LongitudinalEstimatorConfig:
    """Configuration for a longitudinal base estimator within the LexicoDeepForestClassifier ensemble.

    This configuration class is used to specify the type of longitudinal classifier, the number of times it should be
    instantiated within the ensemble, and any hyperparameters for the individual classifiers.

    Args:
        classifier_type (LongitudinalClassifierType):
            The type of longitudinal classifier to be used.
        count (int):
            The number of times the classifier should be replicated in the ensemble. Defaults to 2.
        hyperparameters (Optional[Dict[str, Any]]):
            A dictionary of hyperparameters for the classifier. Defaults to None.

    """

    classifier_type: LongitudinalClassifierType
    count: int = 2
    hyperparameters: Optional[Dict[str, Any]] = None