Correlation Based Feature Selection Per Group (CFS Per Group)¶

Abstract of CorrelationBasedFeatureSelectionPerGroup

Extracted from Pomsuwan & Freitas (2017), "Feature selection for the classification of longitudinal human ageing data".

We propose a new variant of the Correlation-based Feature Selection (CFS) method for coping with longitudinal data - where variables are repeatedly measured across different time points. The proposed CFS variant is evaluated on ten datasets created using data from the English Longitudinal Study of Ageing (ELSA), with different age-related diseases used as the class variables to be predicted. The results show that, overall, the proposed CFS variant leads to better predictive performance than the standard CFS and the baseline approach of no feature selection, when using Naïve Bayes and J48 decision tree induction as classification algorithms (although the difference in performance is very small in the results for J4.8). We also report the most relevant features selected by J48 across the datasets.

See More In References

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

CorrelationBasedFeatureSelectionPerGroup ¶

Bases: CustomTransformerMixinEstimator

Correlation-based Feature Selection (CFS) per group (CFS Per Group).

The CorrelationBasedFeatureSelectionPerGroup class implements the CFS-Per-Group algorithm, a longitudinal variant of the standard CFS method. It is designed to handle feature selection in longitudinal datasets by considering temporal variations across multiple waves (time points). The algorithm operates in two phases:

Phase 1: For each longitudinal feature group, CFS with a specified search method (e.g., exhaustive or greedy) is applied to select relevant and non-redundant features across waves. The selected features are then aggregated.
Phase 2: The aggregated features from Phase 1 are combined with non-longitudinal features, and a standard CFS is applied to further refine the selection by removing redundant features.

Parameters:

Name	Type	Description	Default
`non_longitudinal_features`	`Optional[List[int]]`	List of indices for non-longitudinal features. These features are not part of the temporal matrix and are treated separately. Defaults to None.	`None`
`search_method`	`str`	The search method for Phase 1. Options are "exhaustiveSearch" or "greedySearch". Defaults to "greedySearch".	`'greedySearch'`
`features_group`	`Optional[List[List[int]]]`	A temporal matrix where each sublist contains indices of a longitudinal attribute's waves. Required for the longitudinal component. Defaults to None.	`None`
`parallel`	`bool`	Whether to use parallel processing for CFS (useful for exhaustive search with multiple groups). Defaults to False.	`False`
`outer_search_method`	`str`	The search method for Phase 2 (outer search). If None, defaults to `search_method`. Defaults to None.	`None`
`inner_search_method`	`str`	The search method for Phase 1 (inner search). Defaults to "exhaustiveSearch".	`'exhaustiveSearch'`
`version`	`int`	The version of the CFS-Per-Group algorithm to use. Version 1 applies CFS per group without an outer search, while Version 2 includes an outer CFS on the aggregated features. Defaults to 1.	`1`
`num_cpus`	`int`	Number of CPUs for parallel processing. If -1, uses all available CPUs. Defaults to -1.	`-1`

Attributes:

Name	Type	Description
`selected_features_`	`ndarray`	Indices of the selected features after fitting.

Examples:

Below are examples demonstrating the usage of the CorrelationBasedFeatureSelectionPerGroup class.

Basic Usage

from scikit_longitudinal.preprocessors.feature_selection.correlation_feature_selection import CorrelationBasedFeatureSelectionPerGroup
from scikit_longitudinal.data_preparation import LongitudinalDataset
from scikit_longitudinal.estimators.ensemble.longitudinal_voting.longitudinal_voting import LongitudinalEnsemblingStrategy


# Load dataset
dataset = LongitudinalDataset('./stroke_longitudinal.csv')
dataset.load_data()
dataset.load_target(target_column="stroke_w2")
dataset.setup_features_group("elsa")
dataset.load_train_test_split(test_size=0.2, random_state=42)

# Initialize CFS-Per-Group
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
    features_group=dataset.feature_groups(),
    non_longitudinal_features=dataset.non_longitudinal_features()
)

# Fit to data
cfs_longitudinal.fit(dataset.X_train, dataset.y_train)

# Transform data
X_selected = cfs_longitudinal.apply_selected_features_and_rename(dataset.X_train, cfs_longitudinal.selected_features_)
print(X_selected)

Advanced: parallel processing

# ... Same as above, but with parallel processing enabled ...

# Initialize with parallel processing
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
    features_group=features_group,
    search_method="exhaustiveSearch",
    parallel=True, # Enable parallel processing
    num_cpus=4 # Specify number of CPUs to use, -1 for all available CPUs
)

# ... Same as above, but with parallel processing enabled ...

Advanced: version 2 with outer search

# ... Same as above, but with parallel processing enabled ...


# Initialize with version 2 and outer search method
cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
    features_group=features_group,
    non_longitudinal_features=non_longitudinal_features,
    version=2, # Use version 2 of CFS-Per-Group
    outer_search_method="greedySearch" # Specify outer search method
)

# ... Same as above, but with parallel processing enabled ...

Source code in scikit_longitudinal/preprocessors/feature_selection/correlation_feature_selection/cfs_per_group.py

class CorrelationBasedFeatureSelectionPerGroup(CustomTransformerMixinEstimator):
    """Correlation-based Feature Selection (CFS) per group (CFS Per Group).

    The `CorrelationBasedFeatureSelectionPerGroup` class implements the CFS-Per-Group algorithm, a longitudinal variant
    of the standard CFS method. It is designed to handle feature selection in longitudinal datasets by considering
    temporal variations across multiple waves (time points). The algorithm operates in two phases:

    1. **Phase 1**: For each longitudinal feature group, CFS with a specified search method (e.g., exhaustive or greedy)
       is applied to select relevant and non-redundant features across waves. The selected features are then aggregated.
    2. **Phase 2**: The aggregated features from Phase 1 are combined with non-longitudinal features, and a standard CFS
       is applied to further refine the selection by removing redundant features.

    Args:
        non_longitudinal_features (Optional[List[int]], optional): List of indices for non-longitudinal features.
            These features are not part of the temporal matrix and are treated separately. Defaults to None.
        search_method (str, optional): The search method for Phase 1. Options are "exhaustiveSearch" or "greedySearch".
            Defaults to "greedySearch".
        features_group (Optional[List[List[int]]], optional): A temporal matrix where each sublist contains indices of a
            longitudinal attribute's waves. Required for the longitudinal component. Defaults to None.
        parallel (bool, optional): Whether to use parallel processing for CFS (useful for exhaustive search with multiple
            groups). Defaults to False.
        outer_search_method (str, optional): The search method for Phase 2 (outer search). If None, defaults to
            `search_method`. Defaults to None.
        inner_search_method (str, optional): The search method for Phase 1 (inner search). Defaults to "exhaustiveSearch".
        version (int, optional): The version of the CFS-Per-Group algorithm to use. Version 1 applies CFS per group
            without an outer search, while Version 2 includes an outer CFS on the aggregated features. Defaults to 1.
        num_cpus (int, optional): Number of CPUs for parallel processing. If -1, uses all available CPUs. Defaults to -1.

    Attributes:
        selected_features_ (ndarray): Indices of the selected features after fitting.

    Examples:
        Below are examples demonstrating the usage of the `CorrelationBasedFeatureSelectionPerGroup` class.

        !!! example "Basic Usage"
            ```python
            from scikit_longitudinal.preprocessors.feature_selection.correlation_feature_selection import CorrelationBasedFeatureSelectionPerGroup
            from scikit_longitudinal.data_preparation import LongitudinalDataset
            from scikit_longitudinal.estimators.ensemble.longitudinal_voting.longitudinal_voting import LongitudinalEnsemblingStrategy


            # Load dataset
            dataset = LongitudinalDataset('./stroke_longitudinal.csv')
            dataset.load_data()
            dataset.load_target(target_column="stroke_w2")
            dataset.setup_features_group("elsa")
            dataset.load_train_test_split(test_size=0.2, random_state=42)

            # Initialize CFS-Per-Group
            cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
                features_group=dataset.feature_groups(),
                non_longitudinal_features=dataset.non_longitudinal_features()
            )

            # Fit to data
            cfs_longitudinal.fit(dataset.X_train, dataset.y_train)

            # Transform data
            X_selected = cfs_longitudinal.apply_selected_features_and_rename(dataset.X_train, cfs_longitudinal.selected_features_)
            print(X_selected)
            ```

        !!! example "Advanced: parallel processing"
            ```python
            # ... Same as above, but with parallel processing enabled ...

            # Initialize with parallel processing
            cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
                features_group=features_group,
                search_method="exhaustiveSearch",
                parallel=True, # Enable parallel processing
                num_cpus=4 # Specify number of CPUs to use, -1 for all available CPUs
            )

            # ... Same as above, but with parallel processing enabled ...
            ```

        !!! example "Advanced: version 2 with outer search"
            ```python
            # ... Same as above, but with parallel processing enabled ...


            # Initialize with version 2 and outer search method
            cfs_longitudinal = CorrelationBasedFeatureSelectionPerGroup(
                features_group=features_group,
                non_longitudinal_features=non_longitudinal_features,
                version=2, # Use version 2 of CFS-Per-Group
                outer_search_method="greedySearch" # Specify outer search method
            )

            # ... Same as above, but with parallel processing enabled ...
            ```
    """

    # pylint: disable=too-many-arguments,invalid-name,signature-differs,no-member
    def __init__(
        self,
        non_longitudinal_features: Optional[List[int]] = None,
        search_method: str = "greedySearch",
        features_group: Optional[List[List[int]]] = None,
        parallel: bool = False,
        outer_search_method: str = None,
        inner_search_method: str = "exhaustiveSearch",
        version=1,
        num_cpus: int = -1,
    ):
        assert search_method in {
            "exhaustiveSearch",
            "greedySearch",
        }, "search_method must be: 'exhaustiveSearch', or 'greedySearch'"

        self.search_method = search_method
        self.features_group = features_group
        self.parallel = parallel
        self.outer_search_method = (
            self.search_method if outer_search_method is None else outer_search_method
        )
        self.inner_search_method = inner_search_method
        self.non_longitudinal_features = non_longitudinal_features
        self.num_cpus = num_cpus
        self.version = version
        self.selected_features_ = []
        self.selected_longitudinal_features_ = []

    @override
    def _fit(
        self, X: np.ndarray, y: np.ndarray
    ) -> "CorrelationBasedFeatureSelectionPerGroup":
        """Fit the CFS-Per-Group algorithm to the data.

        This method applies the CFS-Per-Group algorithm, selecting features that are highly correlated with the target
        while minimizing redundancy within feature groups.

        Args:
            X (np.ndarray): Input data of shape (n_samples, n_features).
            y (np.ndarray): Target variable of shape (n_samples,).

        Returns:
            CorrelationBasedFeatureSelectionPerGroup: The fitted instance.
        """
        ray = get_ray_for_parallel(
            self.parallel and self.features_group is not None, self.num_cpus
        )

        # TODO: Make sure to rework the too many branches warning
        if self.features_group is not None:
            self.search_method = self.inner_search_method
            group_features_copy, group_selected_features = (
                (self.features_group.copy(), []) if self.features_group else ([], [])
            )

            self.features_group = None

            if self.parallel and ray is not None:
                remote_fit_subset = ray.remote(_fit_subset_remote)
                futures = [
                    remote_fit_subset.remote(self, X, y, group)
                    for group in group_features_copy
                ]
                while futures:
                    ready_futures, remaining_futures = ray.wait(futures)
                    result = ray.get(ready_futures[0])
                    group_selected_features.append(result)
                    futures = remaining_futures
            else:
                group_selected_features = [
                    self._fit_subset(X, y, group) for group in group_features_copy
                ]

            if self.version == 2:
                combined_features = [
                    index for sublist in group_selected_features for index in sublist
                ] + (self.non_longitudinal_features or [])
                self.search_method = self.outer_search_method
                selected_indices = self._fit(
                    X[:, combined_features], y
                ).selected_features_
                flattened_list = np.array(combined_features)
                self.selected_features_ = flattened_list[selected_indices].tolist()
            elif self.version == 1:
                flattened_list = np.array(
                    [index for sublist in group_selected_features for index in sublist]
                )
                self.selected_features_ = (flattened_list.tolist() or []) + (
                    self.non_longitudinal_features or []
                )
            else:
                raise ValueError(
                    f"Version {self.version} is not supported. Please choose version 1 or 2."
                )
        else:
            if self.search_method == "exhaustiveSearch":
                self.selected_features_ = _exhaustive_search(X, y)
            elif self.search_method == "greedySearch":
                self.selected_features_ = _greedy_search(X, y)
            else:
                raise ValueError(
                    f"Search method {self.search_method} is not supported."
                )

        return self

    @override
    def _transform(self, X: np.ndarray) -> np.ndarray:
        """Transform the data by selecting the chosen features.

        This method is overridden from `CustomTransformerMixinEstimator` and selects the features based on
        `selected_features_`.

        !!! warning "Usage Note"
            Not to be used directly. Use the `apply_selected_features_and_rename` method instead.
            CFS Per Group has a specific behavior for longitudinal features, and this method does not
            account for that. It is recommended to use the `apply_selected_features_and_rename` method
            for proper handling of longitudinal features.

        Args:
            X (np.ndarray): Input data of shape (n_samples, n_features).

        Returns:
            np.ndarray: Transformed data with selected features.
        """
        return X

    def _fit_subset(self, X: np.ndarray, y: np.ndarray, group: Tuple[int]) -> List[int]:
        """Fit CFS on a specific feature group.

        This method applies the CFS algorithm to a subset of features defined by the group, selecting the most relevant
        features within that group.

        Args:
            X (np.ndarray): Input data.
            y (np.ndarray): Target variable.
            group (Tuple[int]): Indices of features in the group.

        Returns:
            List[int]: Selected feature indices from the group.
        """
        X_group = X[:, group]
        self._fit(X_group, y)
        return [group[i] for i in self.selected_features_]

    # pylint: disable=W9016
    @staticmethod
    def apply_selected_features_and_rename(
        df: pd.DataFrame, selected_features: List, regex_match=r"^(.+)_w(\d+)$"
    ) -> [pd.DataFrame, None]:
        """Apply selected features to the DataFrame and rename non-longitudinal features.

        This method selects the specified features from the DataFrame and renames any features that, after selection,
        appear as single-wave features (i.e., non-longitudinal). This ensures that such features are not misinterpreted
        as longitudinal in future processing.

        !!! warning "Usage Note"
            This method should be used instead of the standard `transform` method to handle both feature selection and
            renaming in one step, especially in pipelines where the temporal structure needs to be preserved.

        !!! question "Regex Match, what is that all about?"
            The regex match is used to identify features that are longitudinal in nature. The default pattern
            `^(.+)_w(\d+)$` captures features with a base name followed by a wave number (e.g., `feature_w1`, `feature_w2`).
            Working by default with the ELSA databases in a nutshell.

            The first group `(.+)` captures the base name of the feature, while the second group `(\d+)` captures the wave
            number. This allows the method to identify and rename features that are longitudinal in nature, ensuring that
            they are treated correctly in subsequent analyses.

            Why is that important? Because we want to make sure that the features are not misinterpreted as longitudinal
            when they are actually single-wave features. This is particularly important in longitudinal datasets where
            features are collected over multiple time points.

        Args:
            df (pd.DataFrame): Input DataFrame.
            selected_features (List): List of selected feature indices.
            regex_match (str, optional): Regex pattern to identify wave-based features. Defaults to "^(.+)_w(\d+)$".

        Returns:
            pd.DataFrame: DataFrame with selected features and renamed non-longitudinal features.
        """
        # Apply selected features
        if selected_features:
            df = df.iloc[:, selected_features].copy()

        # Rename non-longitudinal features
        non_longitudinal_features: Dict[str, List[Tuple[str, str]]] = defaultdict(list)
        for col in df.columns:
            if not isinstance(col, str):
                continue
            if match := re.match(regex_match, col):
                feature_base_name, wave_number = match.groups()
                non_longitudinal_features[feature_base_name].append((col, wave_number))

        for base_name, columns in non_longitudinal_features.items():
            if len(columns) == 1:
                old_name, wave_number = columns[0]
                new_name = f"{base_name}_wave{wave_number}"
                df.rename(columns={old_name: new_name}, inplace=True)
        return df

`fit(X, y=None)` ¶

Fit the transformer to the input data.

Validates X (and y when provided) with scikit-learn's check_X_y / check_array and then delegates to the subclass implementation in _fit.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Training input samples of shape `(n_samples, n_features)`.	required
`y`	`ndarray`	Target values of shape `(n_samples,)`.	`None`

Returns:

Name	Type	Description
`CustomTransformerMixinEstimator`	`CustomTransformerMixinEstimator`	The fitted transformer (`self`).

Source code in scikit_longitudinal/templates/custom_transformer_mixin_estimator.py

@final
def fit(
    self, X: np.ndarray, y: np.ndarray = None
) -> "CustomTransformerMixinEstimator":
    """Fit the transformer to the input data.

    Validates ``X`` (and ``y`` when provided) with scikit-learn's
    ``check_X_y`` / ``check_array`` and then delegates to the subclass
    implementation in ``_fit``.

    Args:
        X (np.ndarray):
            Training input samples of shape ``(n_samples, n_features)``.
        y (np.ndarray, optional):
            Target values of shape ``(n_samples,)``.

    Returns:
        CustomTransformerMixinEstimator: The fitted transformer (``self``).
    """
    if y is None:
        return self._check_array_decorator(self._fit)(X)
    return self._check_X_y_decorator(self._fit)(X, y)

`transform(X)` ¶

Apply the transformation to the input data.

Validates X with scikit-learn's check_array and delegates to the subclass implementation in _transform.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Input samples of shape `(n_samples, n_features)`.	required

Returns:

Type	Description
`ndarray`	np.ndarray: Transformed array.

Source code in scikit_longitudinal/templates/custom_transformer_mixin_estimator.py

@final
def transform(self, X: np.ndarray) -> np.ndarray:
    """Apply the transformation to the input data.

    Validates ``X`` with scikit-learn's ``check_array`` and delegates to
    the subclass implementation in ``_transform``.

    Args:
        X (np.ndarray):
            Input samples of shape ``(n_samples, n_features)``.

    Returns:
        np.ndarray: Transformed array.
    """
    return self._check_array_decorator(self._transform)(X)

`apply_selected_features_and_rename(df, selected_features, regex_match='^(.+)_w(\\d+)$')` `staticmethod` ¶

Apply selected features to the DataFrame and rename non-longitudinal features.

This method selects the specified features from the DataFrame and renames any features that, after selection, appear as single-wave features (i.e., non-longitudinal). This ensures that such features are not misinterpreted as longitudinal in future processing.

Usage Note

This method should be used instead of the standard transform method to handle both feature selection and renaming in one step, especially in pipelines where the temporal structure needs to be preserved.

Regex Match, what is that all about?

The regex match is used to identify features that are longitudinal in nature. The default pattern ^(.+)_w(\d+)$ captures features with a base name followed by a wave number (e.g., feature_w1, feature_w2). Working by default with the ELSA databases in a nutshell.

The first group (.+) captures the base name of the feature, while the second group (\d+) captures the wave number. This allows the method to identify and rename features that are longitudinal in nature, ensuring that they are treated correctly in subsequent analyses.

Why is that important? Because we want to make sure that the features are not misinterpreted as longitudinal when they are actually single-wave features. This is particularly important in longitudinal datasets where features are collected over multiple time points.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`selected_features`	`List`	List of selected feature indices.	required
`regex_match`	`str`	Regex pattern to identify wave-based features. Defaults to "^(.+)_w(\d+)$".	`'^(.+)_w(\\d+)$'`

Returns:

Type	Description
`[DataFrame, None]`	pd.DataFrame: DataFrame with selected features and renamed non-longitudinal features.

Source code in scikit_longitudinal/preprocessors/feature_selection/correlation_feature_selection/cfs_per_group.py

@staticmethod
def apply_selected_features_and_rename(
    df: pd.DataFrame, selected_features: List, regex_match=r"^(.+)_w(\d+)$"
) -> [pd.DataFrame, None]:
    """Apply selected features to the DataFrame and rename non-longitudinal features.

    This method selects the specified features from the DataFrame and renames any features that, after selection,
    appear as single-wave features (i.e., non-longitudinal). This ensures that such features are not misinterpreted
    as longitudinal in future processing.

    !!! warning "Usage Note"
        This method should be used instead of the standard `transform` method to handle both feature selection and
        renaming in one step, especially in pipelines where the temporal structure needs to be preserved.

    !!! question "Regex Match, what is that all about?"
        The regex match is used to identify features that are longitudinal in nature. The default pattern
        `^(.+)_w(\d+)$` captures features with a base name followed by a wave number (e.g., `feature_w1`, `feature_w2`).
        Working by default with the ELSA databases in a nutshell.

        The first group `(.+)` captures the base name of the feature, while the second group `(\d+)` captures the wave
        number. This allows the method to identify and rename features that are longitudinal in nature, ensuring that
        they are treated correctly in subsequent analyses.

        Why is that important? Because we want to make sure that the features are not misinterpreted as longitudinal
        when they are actually single-wave features. This is particularly important in longitudinal datasets where
        features are collected over multiple time points.

    Args:
        df (pd.DataFrame): Input DataFrame.
        selected_features (List): List of selected feature indices.
        regex_match (str, optional): Regex pattern to identify wave-based features. Defaults to "^(.+)_w(\d+)$".

    Returns:
        pd.DataFrame: DataFrame with selected features and renamed non-longitudinal features.
    """
    # Apply selected features
    if selected_features:
        df = df.iloc[:, selected_features].copy()

    # Rename non-longitudinal features
    non_longitudinal_features: Dict[str, List[Tuple[str, str]]] = defaultdict(list)
    for col in df.columns:
        if not isinstance(col, str):
            continue
        if match := re.match(regex_match, col):
            feature_base_name, wave_number = match.groups()
            non_longitudinal_features[feature_base_name].append((col, wave_number))

    for base_name, columns in non_longitudinal_features.items():
        if len(columns) == 1:
            old_name, wave_number = columns[0]
            new_name = f"{base_name}_wave{wave_number}"
            df.rename(columns={old_name: new_name}, inplace=True)
    return df

Correlation Based Feature Selection Per Group (CFS Per Group)¶

CorrelationBasedFeatureSelectionPerGroup ¶

fit(X, y=None) ¶

transform(X) ¶

apply_selected_features_and_rename(df, selected_features, regex_match='^(.+)_w(\\d+)$') staticmethod ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`apply_selected_features_and_rename(df, selected_features, regex_match='^(.+)_w(\\d+)$')` `staticmethod` ¶