Lexicographical Decision Tree Regressor¶

Abstract of LexicoDecisionTreeRegressor

Extracted from Ribeiro & Freitas (2024), "Lexicographical random forests for longitudinal data classification".

Standard supervised machine learning methods often ignore the temporal information represented in longitudinal data, but that information can lead to more precise predictions in classification tasks. Data preprocessing techniques and classification algorithms can be adapted to cope directly with longitudinal data inputs, making use of temporal information such as the time-index of features and previous measurements of the class variable. In this article, we propose two changes to the classification task of predicting age-related diseases in a real-world dataset created from the English Longitudinal Study of Ageing. First, we explore the addition of previous measurements of the class variable, and estimating the missing data in those added features using intermediate classifiers. Second, we propose a new split-feature selection procedure for a random forest's decision trees, which considers the candidate features' time-indexes, in addition to the information gain ratio. Our experiments compared the proposed approaches to baseline approaches, in 3 prediction scenarios, varying the "time gap" for the prediction - how many years in advance the class (occurrence of an age-related disease) is predicted. The experiments were performed on 10 datasets varying the class variable, and showed that the proposed approaches increased the random forest's predictive accuracy.

Adapted to regression, this estimator applies the same lexicographic split-selection procedure inside DecisionTreeRegressor, replacing information-gain ratio with variance reduction (friedman_mse) as the primary objective while still preferring more recent waves on near-ties.

See More In References

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

LexicoDecisionTreeRegressor ¶

Bases: DecisionTreeRegressor

Lexico Decision Tree Regressor for longitudinal data regression.

This regressor extends scikit-learn's DecisionTreeRegressor for longitudinal data by integrating a lexicographic optimisation approach that prioritises more recent waves during split selection. Splits are evaluated with a bi-objective rule: the primary objective maximises the variance-reduction information gain (friedman_mse criterion), and the secondary objective favours features from more recent waves whenever competing gains are within threshold_gain. This is a powerful tool for modelling time-dependent phenomena like patient health trends or economic forecasts.

Parameters:

Name	Type	Description	Default
`threshold_gain`	`float, default=0.0015`	Threshold for comparing gain ratios during split selection. Lower values prioritize recency more strictly; higher values allow more flexibility in balancing gain and recency.	`0.0015`
`features_group`	`List[List[int]]`	A list of lists where each sublist contains feature indices for a longitudinal attribute, ordered from oldest to most recent wave. Required for longitudinal functionality.	`None`
`criterion`	`str, default="friedman_mse"`	The split quality metric. Fixed to "friedman_mse"; do not modify.	`'friedman_mse'`
`splitter`	`str, default="lexicoRF"`	The split strategy. Fixed to "lexicoRF"; do not modify.	`'lexicoRF'`
`max_depth`	`Optional[int], default=None`	Maximum tree depth. If None, grows until purity or other limits are reached.	`None`
`min_samples_split`	`int, default=2`	Minimum samples required to split a node.	`2`
`min_samples_leaf`	`int, default=1`	Minimum samples required at a leaf node.	`1`
`min_weight_fraction_leaf`	`float, default=0.0`	Minimum weighted fraction of total sample weight at a leaf.	`0.0`
`max_features`	`Optional[Union[int, str]], default=None`	Number of features to consider for splits (e.g., "auto", "sqrt", int).	`None`
`random_state`	`Optional[int], default=None`	Seed for random number generation.	`None`
`max_leaf_nodes`	`Optional[int], default=None`	Maximum number of leaf nodes.	`None`
`min_impurity_decrease`	`float, default=0.0`	Minimum impurity decrease required for a split.	`0.0`
`ccp_alpha`	`float, default=0.0`	Complexity parameter for pruning; non-negative.	`0.0`

Attributes:

Name	Type	Description
`n_features_`	`int`	Number of features in the fitted model.
`n_outputs_`	`int`	Number of outputs (fixed to 1 for regression).
`feature_importances_`	`ndarray of shape (n_features,`	Impurity-based feature importances.
`max_features_`	`int`	Inferred value of `max_features` after fitting.
`tree_`	`Tree object`	The underlying decision tree structure.

Examples:

While Sklong focussed classification tasks only as of now. This regressor model is used by our LexicographicalGradientBoosting primitive. Feel free to experiment with it in your own longitudinal regression tasks but we do not guarantee its performance.

Source code in scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py

class LexicoDecisionTreeRegressor(DecisionTreeRegressor):
    """
    Lexico Decision Tree Regressor for longitudinal data regression.

    This regressor extends scikit-learn's `DecisionTreeRegressor` for longitudinal data by integrating a
    lexicographic optimisation approach that prioritises more recent waves during split selection. Splits are
    evaluated with a bi-objective rule: the primary objective maximises the variance-reduction information gain
    (`friedman_mse` criterion), and the secondary objective favours features from more recent waves whenever
    competing gains are within `threshold_gain`. This is a powerful tool for modelling time-dependent phenomena
    like patient health trends or economic forecasts.


    Args:
        threshold_gain (float, default=0.0015):
            Threshold for comparing gain ratios during split selection. Lower values prioritize recency more strictly;
            higher values allow more flexibility in balancing gain and recency.
        features_group (List[List[int]], optional):
            A list of lists where each sublist contains feature indices for a longitudinal attribute, ordered from
            oldest to most recent wave. Required for longitudinal functionality.
        criterion (str, default="friedman_mse"):
            The split quality metric. Fixed to "friedman_mse"; do not modify.
        splitter (str, default="lexicoRF"):
            The split strategy. Fixed to "lexicoRF"; do not modify.
        max_depth (Optional[int], default=None):
            Maximum tree depth. If None, grows until purity or other limits are reached.
        min_samples_split (int, default=2):
            Minimum samples required to split a node.
        min_samples_leaf (int, default=1):
            Minimum samples required at a leaf node.
        min_weight_fraction_leaf (float, default=0.0):
            Minimum weighted fraction of total sample weight at a leaf.
        max_features (Optional[Union[int, str]], default=None):
            Number of features to consider for splits (e.g., "auto", "sqrt", int).
        random_state (Optional[int], default=None):
            Seed for random number generation.
        max_leaf_nodes (Optional[int], default=None):
            Maximum number of leaf nodes.
        min_impurity_decrease (float, default=0.0):
            Minimum impurity decrease required for a split.
        ccp_alpha (float, default=0.0):
            Complexity parameter for pruning; non-negative.

    Attributes:
        n_features_ (int):
            Number of features in the fitted model.
        n_outputs_ (int):
            Number of outputs (fixed to 1 for regression).
        feature_importances_ (ndarray of shape (n_features,)):
            Impurity-based feature importances.
        max_features_ (int):
            Inferred value of `max_features` after fitting.
        tree_ (Tree object):
            The underlying decision tree structure.

    Examples:
        While `Sklong` focussed classification tasks only as of now. This regressor model is used by
        our LexicographicalGradientBoosting primitive. Feel free to experiment with it in your own
        longitudinal regression tasks but we do not guarantee its performance.
    """

    def __init__(
        self,
        threshold_gain: float = 0.0015,
        features_group: List[List[int]] = None,
        criterion: str = "friedman_mse",  # Do not change this value
        splitter: str = "lexicoRF",  # Do not change this value
        max_depth: Optional[int] = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        min_weight_fraction_leaf: float = 0.0,
        max_features: Optional[Union[int, str]] = None,
        random_state: Optional[int] = None,
        max_leaf_nodes: Optional[int] = None,
        min_impurity_decrease: float = 0.0,
        ccp_alpha: float = 0.0,
        store_leaf_values: bool = False,
        monotonic_cst: Optional[List[int]] = None,
    ):
        self.threshold_gain = threshold_gain
        self.features_group = features_group

        super().__init__(
            criterion=criterion,
            threshold_gain=threshold_gain,
            features_group=self.features_group,
            splitter=splitter,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            min_weight_fraction_leaf=min_weight_fraction_leaf,
            max_features=max_features,
            random_state=random_state,
            max_leaf_nodes=max_leaf_nodes,
            min_impurity_decrease=min_impurity_decrease,
            ccp_alpha=ccp_alpha,
            store_leaf_values=store_leaf_values,
            monotonic_cst=monotonic_cst,
        )

    def fit(self, X, y, *args, **kwargs):
        """
        Fit the Lexico Decision Tree Regressor to the data.

        Args:
            X (array-like of shape (n_samples, n_features)): Training input samples.
            y (array-like of shape (n_samples,)): Target values.
            *args: Additional positional arguments for the superclass `fit`.
            **kwargs: Additional keyword arguments for the superclass `fit`.

        Returns:
            self: Fitted regressor instance.

        Raises:
            ValueError: If `features_group` is not provided.

        !!! tip "Data Prep Tip"
            Ensure `X` matches the `features_group` structure for accurate temporal modeling.
        """
        if self.features_group is None:
            raise ValueError("The features_group parameter must be provided.")

        return super().fit(X, y, *args, **kwargs)

    def predict(self, X, check_input=True):
        """Predict regression target values for the input samples.

        Inherited from scikit-learn's `DecisionTreeRegressor`. The Lexico tree only customises
        split selection at fit time; prediction is the standard tree-traversal routine.

        Args:
            X (array-like of shape (n_samples, n_features)):
                Input samples.
            check_input (bool, default=True):
                Allow to bypass input validation. Forwarded to scikit-learn.

        Returns:
            np.ndarray: Predicted target values of shape `(n_samples,)`.
        """
        return super().predict(X, check_input=check_input)

`fit(X, y, *args, **kwargs)` ¶

Fit the Lexico Decision Tree Regressor to the data.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Training input samples.	required
`y`	`array-like of shape (n_samples,)`	Target values.	required
`*args`		Additional positional arguments for the superclass `fit`.	`()`
`**kwargs`		Additional keyword arguments for the superclass `fit`.	`{}`

Returns:

Name	Type	Description
`self`		Fitted regressor instance.

Raises:

Type	Description
`ValueError`	If `features_group` is not provided.

Data Prep Tip

Ensure X matches the features_group structure for accurate temporal modeling.

Source code in scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py

def fit(self, X, y, *args, **kwargs):
    """
    Fit the Lexico Decision Tree Regressor to the data.

    Args:
        X (array-like of shape (n_samples, n_features)): Training input samples.
        y (array-like of shape (n_samples,)): Target values.
        *args: Additional positional arguments for the superclass `fit`.
        **kwargs: Additional keyword arguments for the superclass `fit`.

    Returns:
        self: Fitted regressor instance.

    Raises:
        ValueError: If `features_group` is not provided.

    !!! tip "Data Prep Tip"
        Ensure `X` matches the `features_group` structure for accurate temporal modeling.
    """
    if self.features_group is None:
        raise ValueError("The features_group parameter must be provided.")

    return super().fit(X, y, *args, **kwargs)

`predict(X, check_input=True)` ¶

Predict regression target values for the input samples.

Inherited from scikit-learn's DecisionTreeRegressor. The Lexico tree only customises split selection at fit time; prediction is the standard tree-traversal routine.

Parameters:

Name	Type	Description	Default
`X`	`array-like of shape (n_samples, n_features)`	Input samples.	required
`check_input`	`bool, default=True`	Allow to bypass input validation. Forwarded to scikit-learn.	`True`

Returns:

Type	Description
	np.ndarray: Predicted target values of shape `(n_samples,)`.

Source code in scikit_longitudinal/estimators/trees/lexicographical/lexico_decision_tree_regressor.py

def predict(self, X, check_input=True):
    """Predict regression target values for the input samples.

    Inherited from scikit-learn's `DecisionTreeRegressor`. The Lexico tree only customises
    split selection at fit time; prediction is the standard tree-traversal routine.

    Args:
        X (array-like of shape (n_samples, n_features)):
            Input samples.
        check_input (bool, default=True):
            Allow to bypass input validation. Forwarded to scikit-learn.

    Returns:
        np.ndarray: Predicted target values of shape `(n_samples,)`.
    """
    return super().predict(X, check_input=check_input)

Lexicographical Decision Tree Regressor¶

LexicoDecisionTreeRegressor ¶

fit(X, y, *args, **kwargs) ¶

predict(X, check_input=True) ¶

`fit(X, y, *args, **kwargs)` ¶

`predict(X, check_input=True)` ¶