Skip to content

Longitudinal Dataset

LongitudinalDataset

LongitudinalDataset is the entry-point to manage longitudinal datasets for machine learning tasks in Sklong.

The LongitudinalDataset class handles longitudinal data, offering robust data management and transformation tools to support machine learning algorithms designed for longitudinal classification tasks. Recall, Longitudinal datasets are, yes tabular, but in a sense that temporal information exists and live throughout. Therefore, the class is designed to manage this temporal information and provide a clean interface for machine learning tasks throughout the Sklong library.

What are features_group and non_longitudinal_features?

Two key attributes, features_group and non_longitudinal_features, enable algorithms to interpret the temporal structure of longitudinal data.

  • features_group: A list of lists where each sublist contains indices of a longitudinal attribute's waves, ordered from oldest to most recent. This captures temporal dependencies.
  • non_longitudinal_features: A list of indices for static, non-temporal features excluded from the temporal matrix.

Proper setup of these attributes is critical for leveraging temporal patterns effectively.

See More In Temporal Dependency Guide

Parameters:

Name Type Description Default
file_path Union[str, Path]

Path to the dataset file (supports ARFF and CSV formats).

required
data_frame Optional[DataFrame]

If provided, uses this DataFrame as the dataset, ignoring file_path.

None

Attributes:

Name Type Description
data DataFrame

Read-only access to the loaded dataset.

target Series

Read-only access to the target variable.

X_train ndarray

Read-only access to the training data.

X_test ndarray

Read-only access to the test data.

y_train Series

Read-only access to the training target.

y_test Series

Read-only access to the test target.

Examples:

Below are examples illustrating the class's usage.

Basic Usage

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Initialize with a file path
dataset = LongitudinalDataset('./data/stroke.csv') # Replace with your file path

# Load the data
dataset.load_data()

# Load the target variable
dataset.load_target(target_column="stroke_w2")

# Set up feature groups with the "elsa" strategy
dataset.setup_features_group("elsa")

# Split into train and test sets –– Uses sklearn's train_test_split
dataset.load_train_test_split(test_size=0.2, random_state=42)

Advanced: custom feature groups

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Initialize with a file path
dataset = LongitudinalDataset('./data/stroke.csv')

# Load data and target in one step
dataset.load_data_target_train_test_split(target_column="stroke_w2", test_size=0.2, random_state=42)

# Define custom feature groups
custom_groups = [[0, 1], [2, 3]]  # Indices for smoke and cholesterol waves

# Set up feature groups
dataset.setup_features_group(custom_groups)

Advanced: converting file formats

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Initialize and load an ARFF file
dataset = LongitudinalDataset('./data/elsa_core_dd.arff')
dataset.load_data()

# Convert to CSV
dataset.convert('./data/elsa_core_dd.csv')
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
class LongitudinalDataset:
    """LongitudinalDataset is the entry-point to manage longitudinal datasets for machine learning tasks in `Sklong`.

    The `LongitudinalDataset` class handles longitudinal data, offering robust data management and transformation tools
    to support machine learning algorithms designed for longitudinal classification tasks. Recall,
    Longitudinal datasets are, yes tabular, but in a sense that temporal information exists and live throughout.
    Therefore, the class is designed to manage this temporal information and provide a clean interface for
    machine learning tasks throughout the `Sklong` library.

    !!! question "What are features_group and non_longitudinal_features?"
        Two key attributes, `features_group` and `non_longitudinal_features`, enable algorithms to interpret the
        temporal structure of longitudinal data.

        - **features_group**: A list of lists where each sublist contains indices of a longitudinal attribute's
          waves, ordered from oldest to most recent. This captures temporal dependencies.
        - **non_longitudinal_features**: A list of indices for static, non-temporal features excluded from the
          temporal matrix.

        Proper setup of these attributes is critical for leveraging temporal patterns effectively.

        [See More In Temporal Dependency Guide :fontawesome-solid-timeline:](../../tutorials/temporal_dependency.md){ .md-button }

    Args:
        file_path (Union[str, Path]): Path to the dataset file (supports ARFF and CSV formats).
        data_frame (Optional[pd.DataFrame], optional): If provided, uses this DataFrame as the dataset, ignoring
            `file_path`.

    Attributes:
        data (pd.DataFrame): Read-only access to the loaded dataset.
        target (pd.Series): Read-only access to the target variable.
        X_train (np.ndarray): Read-only access to the training data.
        X_test (np.ndarray): Read-only access to the test data.
        y_train (pd.Series): Read-only access to the training target.
        y_test (pd.Series): Read-only access to the test target.

    Examples:
        Below are examples illustrating the class's usage.

        !!! example "Basic Usage"
            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Initialize with a file path
            dataset = LongitudinalDataset('./data/stroke.csv') # Replace with your file path

            # Load the data
            dataset.load_data()

            # Load the target variable
            dataset.load_target(target_column="stroke_w2")

            # Set up feature groups with the "elsa" strategy
            dataset.setup_features_group("elsa")

            # Split into train and test sets –– Uses sklearn's train_test_split
            dataset.load_train_test_split(test_size=0.2, random_state=42)
            ```

        !!! example "Advanced: custom feature groups"
            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Initialize with a file path
            dataset = LongitudinalDataset('./data/stroke.csv')

            # Load data and target in one step
            dataset.load_data_target_train_test_split(target_column="stroke_w2", test_size=0.2, random_state=42)

            # Define custom feature groups
            custom_groups = [[0, 1], [2, 3]]  # Indices for smoke and cholesterol waves

            # Set up feature groups
            dataset.setup_features_group(custom_groups)
            ```

        !!! example "Advanced: converting file formats"
            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Initialize and load an ARFF file
            dataset = LongitudinalDataset('./data/elsa_core_dd.arff')
            dataset.load_data()

            # Convert to CSV
            dataset.convert('./data/elsa_core_dd.csv')
            ```
    """

    @check_extension([".csv", ".arff"])
    def __init__(
        self, file_path: Union[str, Path], data_frame: Optional[pd.DataFrame] = None
    ):
        if data_frame is not None:
            self._data = data_frame
            self.file_path = None  # type: ignore
        else:
            self.file_path = Path(file_path) if file_path is not None else None  # type: ignore
            self._data = None
            if self.file_path and not self.file_path.is_file():
                raise FileNotFoundError(f"File not found: {self.file_path}")

        self._target = None
        self._feature_groups = None
        self._non_longitudinal_features = None
        self._X_train = None
        self._X_test = None
        self._y_train = None
        self._y_test = None

        if self._data is None and self.file_path is None:
            raise ValueError("Either file_path or data_frame must be provided.")

    def load_data(self) -> None:
        """Load data from the file into a pandas DataFrame.

        Supports `ARFF` and `CSV` formats. If a DataFrame was provided at initialization, this method does nothing.

        Raises:
            ValueError: If the file format is unsupported (only ARFF and CSV are allowed).
            FileNotFoundError: If the file specified in `file_path` does not exist.
        """
        if self._data is not None:
            return

        file_ext = self.file_path.suffix.lower()

        if file_ext == ".arff":
            self._data = self._arff_to_csv(self.file_path)  # type: ignore
        elif file_ext == ".csv":
            self._data = pd.read_csv(self.file_path)  # type: ignore
        else:
            raise ValueError(
                f"Unsupported file format: {file_ext}. Only ARFF and CSV are supported."
            )

    def load_target(
        self,
        target_column: str,
        target_wave_prefix: str = "class_",
        remove_target_waves: bool = False,
    ) -> None:
        """Extract the target variable from the dataset.

        Optionally removes other target-related columns (e.g., from different waves) if specified.

        Args:
            target_column (str): Name of the target column in the dataset.
            target_wave_prefix (str, optional): Prefix for target columns across waves. Defaults to "class_".
            remove_target_waves (bool, optional): If True, removes all columns with `target_wave_prefix` except
                `target_column`. Defaults to False.

        Raises:
            ValueError: If no data is loaded or `target_column` is not in the dataset.
        """
        if self._data is None:
            raise ValueError("No data is loaded. Load data first.")

        if target_column not in self._data.columns:
            raise ValueError(
                f"Target column '{target_column}' not found in the dataset."
            )

        if remove_target_waves:
            self._data = self._data[
                [
                    col
                    for col in self._data.columns
                    if not (col.startswith(target_wave_prefix) and col != target_column)
                ]
            ]

        self._target = self._data[target_column]
        self._data.drop(columns=[target_column], inplace=True)

    def load_train_test_split(
        self, test_size: float = 0.2, random_state: int = None
    ) -> None:
        """Split the dataset into training and testing sets.

        Utilises `sklearn.model_selection.train_test_split` for the split.

        Args:
            test_size (float, optional): Proportion of data for the test set. Defaults to 0.2.
            random_state (int, optional): Seed for reproducible splitting. Defaults to None.

        Raises:
            ValueError: If data or target is not loaded.
        """
        if self._data is None or self._target is None:
            raise ValueError("No data or target is loaded. Load them first.")

        self._X_train, self._X_test, self._y_train, self._y_test = train_test_split(
            self._data, self._target, test_size=test_size, random_state=random_state
        )

    def load_data_target_train_test_split(
        self,
        target_column: str,
        target_wave_prefix: str = "class_",
        remove_target_waves: bool = False,
        test_size: float = 0.2,
        random_state: int = None,
    ) -> None:
        """Load data, extract target, and split into train/test sets in one call.

        Combines `load_data`, `load_target`, and `load_train_test_split` for streamlined setup.

        Args:
            target_column (str): Name of the target column.
            target_wave_prefix (str, optional): Prefix for target columns across waves. Defaults to "class_".
            remove_target_waves (bool, optional): If True, removes other target wave columns. Defaults to False.
            test_size (float, optional): Proportion of data for the test set. Defaults to 0.2.
            random_state (int, optional): Seed for reproducible splitting. Defaults to None.

        Raises:
            ValueError: If data or target loading fails due to invalid inputs.
        """
        self.load_data()
        self.load_target(target_column, target_wave_prefix, remove_target_waves)
        self.load_train_test_split(test_size, random_state)

    @property
    def data(self) -> pd.DataFrame:
        """Access the loaded dataset.

        Returns:
            pd.DataFrame: The dataset as a pandas DataFrame.
        """
        return self._data

    @property
    def target(self) -> pd.Series:
        """Access the target variable.

        Returns:
            pd.Series: The target variable as a pandas Series.
        """
        return self._target

    @property
    def X_train(self) -> np.ndarray:
        """Access the training data.

        Returns:
            np.ndarray: The training data as a NumPy array.
        """
        return self._X_train

    @property
    def X_test(self) -> np.ndarray:
        """Access the test data.

        Returns:
            np.ndarray: The test data as a NumPy array.
        """
        return self._X_test

    @property
    def y_train(self) -> pd.Series:
        """Access the training target.

        Returns:
            pd.Series: The training target as a pandas Series.
        """
        return self._y_train

    @property
    def y_test(self) -> pd.Series:
        """Access the test target.

        Returns:
            pd.Series: The test target as a pandas Series.
        """
        return self._y_test

    @staticmethod
    def _arff_to_csv(input_path: Union[str, Path]) -> pd.DataFrame:
        """Convert an ARFF file to a pandas DataFrame.

        !!! note "Disclaimer: This is handmade"
            If new libraries handle such conversion, we highly recommend using them instead of this
            handmade conversion. This is a neat and quick solution, but it is not the most efficient one, in
            our humble opinion.

        Args:
            input_path (Union[str, Path]): Path to the ARFF file.

        Returns:
            pd.DataFrame: The converted DataFrame.
        """

        def parse_row(line: str, row_len: int) -> List[Any]:
            line = line.strip()  # Strip the newline character
            if "{" in line and "}" in line:
                # Sparse data row
                line = line.replace("{", "").replace("}", "")
                row = np.zeros(row_len, dtype=object)
                for data in line.split(","):
                    index, value = data.split()
                    try:
                        row[int(index)] = float(value)
                    except ValueError:
                        row[int(index)] = np.nan if value == "?" else value.strip("'")
            else:
                # Dense data row
                row = [
                    (
                        float(value)
                        if value.replace(".", "", 1).isdigit()
                        else (np.nan if value == "?" else value.strip("'"))
                    )
                    for value in line.split(",")
                ]

            return row

        def extract_columns_and_data_start_index(
            file_content: List[str],
        ) -> Tuple[List[str], int]:
            columns = []
            len_attr = len("@attribute")

            for i, line in enumerate(file_content):
                if line.lower().startswith("@attribute "):
                    col_name = line[len_attr:].split()[0]
                    columns.append(col_name)
                elif line.lower().startswith("@data"):
                    return columns, i

            return columns, 0

        with open(input_path, "r") as fp:
            file_content = fp.readlines()

        columns, data_index = extract_columns_and_data_start_index(file_content)
        len_row = len(columns)
        rows = [parse_row(line, len_row) for line in file_content[data_index + 1 :]]
        return pd.DataFrame(data=rows, columns=columns)

    @staticmethod
    def _csv_to_arff(df: pd.DataFrame, relation_name: str) -> dict:  # pragma: no cover
        """Convert a DataFrame to an ARFF dictionary.

        !!! note "Disclaimer: This is handmade"
            If new libraries handle such conversion, we highly recommend using them instead of this
            handmade conversion. This is a neat and quick solution, but it is not the most efficient one, in
            our humble opinion.

        Args:
            df (pd.DataFrame): Input DataFrame.
            relation_name (str): Name for the ARFF relation.

        Returns:
            dict: ARFF dictionary with relation, attributes, and data.
        """
        df.fillna("?", inplace=True)

        return {
            "relation": relation_name,
            "attributes": [(col, df[col].dtype.name) for col in df.columns],
            "data": df.values.tolist(),
        }

    @ensure_data_loaded
    @check_extension([".csv", ".arff"])
    def convert(self, output_path: Union[str, Path]) -> None:  # pragma: no cover
        """Convert the dataset to another format (ARFF or CSV).

        !!! note "Disclaimer: This is handmade"
            If new libraries handle such conversion, we highly recommend using them instead of this
            handmade conversion. This is a neat and quick solution, but it is not the most efficient one, in
            our humble opinion.

        Args:
            output_path (Union[str, Path]): Path to save the converted file.

        Raises:
            ValueError: If no data is loaded or the output format is unsupported.
        """
        if self._data is None:
            raise ValueError("No data to convert. Load data first.")

        file_ext = Path(output_path).suffix.lower()

        if file_ext == ".arff":
            arff_data = self._csv_to_arff(self._data, self.file_path.stem)
            with open(output_path, "w") as f:
                arff.dump(
                    {
                        "description": "",
                        "relation": arff_data["relation"],
                        "attributes": arff_data["attributes"],
                        "data": arff_data["data"],
                    },
                    f,
                )
        elif file_ext == ".csv":
            self._data.to_csv(output_path, index=False, na_rep="")
        else:
            raise ValueError(f"Unsupported file format: {file_ext}")

    @ensure_data_loaded
    def save_data(self, output_path: Union[str, Path]) -> None:  # pragma: no cover
        """Save the dataset to a file.

        Wraps the `convert` method to save in the specified format.

        Args:
            output_path (Union[str, Path]): Path to save the file.

        Raises:
            ValueError: If no data is loaded.
        """
        if self._data is None:
            raise ValueError("No data to save. Load or convert data first.")

        self.convert(output_path)

    @validate_feature_groups
    def setup_features_group(
        self, input_data: Union[str, List[List[Union[str, int]]]]
    ) -> None:
        """Configure feature groups and non-longitudinal features for longitudinal analysis.

        This method defines how features are grouped to capture temporal dependencies across waves. It supports three
        distinct input types, each suited to different use cases, with detailed examples and explanations below.

        === "Using `elsa` for Automatic Grouping"

            Automatically groups features based on wave suffixes (e.g., "_w1", "_w2") found in column
            names. This is ideal for datasets like the English Longitudinal Study of Ageing (ELSA), where features are
            consistently named with wave indicators. The ELSA dataset, focused on individuals aged 50+ in England,
            includes longitudinal data such as "smoke_w1", "smoke_w2", etc., which this method organizes into groups.

            !!! tip "Where to find those datasets"
                To find those datasets, feel free to open an issue and question us!

                [Open An Issue! :fontawesome-brands-square-github:](https://scikit-longitudinal.readthedocs.io/latest//issues){ .md-button }

            How It Works:

            - [x] Identifies base feature names (e.g., "smoke") and their wave suffixes (e.g., "_w1", "_w2").
            - [x] Creates groups with indices ordered from oldest to most recent wave, padding with -1 for missing waves.
            - [x] Non-longitudinal features (e.g., "age_wave8") are excluded from groups unless explicitly renamed.

            Example:

            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load an ELSA dataset
            dataset = LongitudinalDataset('./data/elsa_core.csv')
            dataset.load_data()
            dataset.load_target("stroke_w2")

            # Automatically group features by wave suffixes
            dataset.setup_features_group("elsa")

            # Resulting groups might look like:
            # [[0, 1], [2, 3, -1]]  # e.g., [smoke_w1, smoke_w2], [chol_w1, chol_w2, N/A]
            print(dataset.feature_groups())  # Indices
            print(dataset.feature_groups(names=True))  # Names
            ```

            Use Case:

            Best for ELSA or similarly structured datasets with clear wave-based naming conventions.

        === "List of Lists of Integers for Direct Indices"

            Allows manual specification of feature indices for each group. This provides precise control
            over which columns are grouped together, useful when wave patterns are irregular or known in advance.

            How It Works:

            - [x] Each sublist contains integer indices corresponding to columns in the DataFrame.
            - [x] Order matters: indices should reflect temporal sequence (oldest to newest).
            - [x] Use -1 to pad groups if waves are missing, ensuring alignment across groups.

            Example:

            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load a dataset
            dataset = LongitudinalDataset('./data/health.csv')
            dataset.load_data()

            # Define groups manually with indices
            custom_groups = [[0, 1], [2, 3, -1]]  # e.g., [bp_w1, bp_w2], [weight_w1, weight_w2, N/A]
            dataset.setup_features_group(custom_groups)

            # Verify setup
            print(dataset.feature_groups())  # [[0, 1], [2, 3, -1]]
            ```

            Use Case:

            Ideal when you have specific knowledge of column indices and need fine-grained control.

        === "List of Lists of Strings for Feature Names"

            Specifies feature groups using column names, which are then converted to indices. This is
            intuitive for users familiar with dataset feature names, enhancing readability and reducing errors.

            How It Works:

            - [x] Each sublist contains strings matching DataFrame column names.
            - [x] Names are mapped to their respective indices internally.
            - [x] No padding is needed in the input; alignment is handled post-conversion.

            Example:

            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            # Load a dataset
            dataset = LongitudinalDataset('./data/stroke.csv')
            dataset.load_data()

            # Define groups with feature names
            custom_names = [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
            dataset.setup_features_group(custom_names)

            # Verify setup
            print(dataset.feature_groups(names=True))  # [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
            print(dataset.feature_groups())  # Corresponding indices, e.g., [[0, 1], [2, 3]]
            ```

            Use Case:

            Perfect for datasets where feature names are meaningful and users prefer working with names
            over indices.

        ––––––––––––––––––––––––––––

        !!! question "Want more automatic handlers?"
            If you want more automatic handlers, like for the ELSA databases, feel free to open an issue and
            question us!

            [Open An Issue! :fontawesome-brands-square-github:](https://scikit-longitudinal.readthedocs.io/latest//issues){ .md-button }

        Args:
            input_data (Union[str, List[List[Union[str, int]]]]): Input to define feature groups:

                - [x] "elsa": Auto-groups based on wave suffixes.
                - [x] List[List[int]]: Feature indices.
                - [x] List[List[str]]: Feature names.

        Raises:
            ValueError: If `input_data` is invalid, feature names are missing, or groups lack sufficient waves.
        """
        if isinstance(input_data, str) and input_data.lower() == "elsa":
            self._feature_groups = self._create_elsa_feature_groups()
        elif isinstance(input_data, list) and all(
            isinstance(group, list) for group in input_data
        ):
            if all(isinstance(item, int) for group in input_data for item in group):
                self._feature_groups = input_data
            elif all(isinstance(item, str) for group in input_data for item in group):
                self._feature_groups = self._convert_feature_names_to_indices(
                    input_data
                )

        if self._feature_groups is None:
            raise ValueError(
                f"Invalid input data: {input_data} or unknown error has occurred."
            )

        for group in self._feature_groups:
            if len(group) == 1 or (len(group) == 2 and -1 in group):
                raise ValueError(
                    "A longitudinally represented feature should be in at least two waves: ",
                    group,
                )
        # Populate non_longitudinal_features
        feature_group_names = [
            name for group in self.feature_groups(names=True) for name in group
        ]
        non_longitudinal_feature_names = set(self._data.columns) - set(
            feature_group_names
        )
        self._non_longitudinal_features = [
            self._data.columns.get_loc(name) for name in non_longitudinal_feature_names
        ]

    @validate_feature_groups
    def _convert_feature_names_to_indices(
        self, feature_groups: List[List[str]]
    ) -> List[List[int]]:
        """Convert feature names to their column indices.

        Args:
            feature_groups (List[List[str]]): Feature groups as lists of feature names.

        Returns:
            List[List[int]]: Feature groups as lists of indices.

        Raises:
            ValueError: If a feature name is not in the dataset.
        """
        column_indices = {col: i for i, col in enumerate(self._data.columns)}
        index_groups = []
        for group in feature_groups:
            index_group = []
            for feature_name in group:
                if feature_name not in column_indices:
                    raise ValueError(
                        f"Feature name not found in dataset: {feature_name}"
                    )
                index_group.append(column_indices[feature_name])
            index_groups.append(index_group)

        return index_groups

    def _create_elsa_feature_groups(self) -> List[List[int]]:
        """Generate feature groups for the "elsa" strategy.

        Groups features by base name and wave suffix (e.g., "_w1", "_w2"), padding with -1 for alignment.

        Returns:
            List[List[int]]: Feature groups as indices, padded where necessary.
        """
        wave_columns = {}
        wave_suffix_pattern = re.compile(r"_w(\d+)$")
        max_wave = 0

        for idx, col_name in enumerate(self._data.columns):
            if match := wave_suffix_pattern.search(col_name):
                wave_num = int(match[1])
                base_name = col_name[: match.start()]
                if base_name not in wave_columns:
                    wave_columns[base_name] = []
                wave_columns[base_name].append((wave_num, idx))
                if wave_num > max_wave:
                    max_wave = wave_num

        feature_groups = []
        for columns in wave_columns.values():
            sorted_columns = sorted(columns, key=lambda x: x[0])
            padded_group = [-1] * max_wave
            for wave_num, idx in sorted_columns:
                padded_group[wave_num - 1] = idx
            feature_groups.append(padded_group)

        return feature_groups

    def feature_groups(self, names: bool = False) -> List[List[Union[int, str]]]:
        """Retrieve the feature groups.

        Returns -1 placeholders as "N/A" when `names=True`.

        Args:
            names (bool, optional): If True, returns feature names instead of indices. Defaults to False.

        Returns:
            List[List[Union[int, str]]]: Feature groups as indices or names.
        """
        if names:
            return [
                [self._data.columns[i] if i != -1 else "N/A" for i in group]
                for group in self._feature_groups
            ]
        return self._feature_groups

    def non_longitudinal_features(self, names: bool = False) -> List[Union[int, str]]:
        """Retrieve the non-longitudinal features.

        Args:
            names (bool, optional): If True, returns feature names instead of indices. Defaults to False.

        Returns:
            List[Union[int, str]]: Non-longitudinal features as indices or names.
        """
        if names:
            return [self._data.columns[i] for i in self._non_longitudinal_features]
        return self._non_longitudinal_features

    def set_data(self, data: pd.DataFrame) -> None:
        """Sets the data attribute.

        Args:
            data (pd.DataFrame):
                The data.

        """
        self._data = data

    def set_target(self, target: pd.Series) -> None:
        """Sets the target attribute.

        Args:
            target (pd.Series):
                The target.

        """
        self._target = target

    def setX_train(self, X_train: pd.DataFrame) -> None:
        """Set the training data attribute.

        Args:
            X_train (pd.DataFrame):
                The training data.

        """
        self._X_train = X_train

    def setX_test(self, X_test: pd.DataFrame) -> None:
        """Set the test data attribute.

        Args:
            X_test (pd.DataFrame):
                The test data.

        """
        self._X_test = X_test

    def sety_train(self, y_train: pd.Series) -> None:
        """Set the training target data attribute.

        Args:
            y_train (pd.Series):
                The training target data.

        """
        self._y_train = y_train

    def sety_test(self, y_test: pd.Series) -> None:
        """Set the test target data attribute.

        Args:
            y_test (pd.Series):
                The test target data.

        """
        self._y_test = y_test

    def to_wide(
        self,
        *,
        id_col: str,
        time_col: str,
        longitudinal_columns: List[str],
        static_columns: List[str] = (),
        wave_format: str = "{feature}_w{wave}",
        output_path: Optional[Union[str, Path]] = None,
    ) -> pd.DataFrame:
        """Pivot the dataset's long-format data into wide format.

        One row per `(subject, time)` becomes one row per subject, with each
        longitudinal column expanded into one column per observed wave
        (oldest \u2192 newest). The dataset's `data`, `feature_groups`, and
        `non_longitudinal_features` are updated in place to match the new
        layout.

        Args:
            id_col (str): Subject identifier column.
            time_col (str): Observation time column.
            longitudinal_columns (List[str]): Columns expanded across waves.
            static_columns (List[str], optional): Columns kept as-is per subject.
                Must be constant within each subject. Defaults to `()`.
            wave_format (str): Python `str.format` template that names every wide column
                produced from a longitudinal source column. Two placeholders are substituted
                per cell: `{feature}` is the original long-format column name (e.g. `"bp"`),
                and `{wave}` is the value taken straight from `time_col` for that observation
                (rendered with its native type — typically an int like `0`, `1`, `2`, but
                strings such as `"2008"` work too). Waves are emitted in sorted order, so the
                template fully determines the wide schema: the default `"{feature}_w{wave}"`
                yields `bp_w0, bp_w1, bp_w2, chol_w0, ...`; `"{feature}.t{wave}"` yields
                `bp.t0, bp.t1, ...`; and `"{wave}_{feature}"` flips the order to
                `0_bp, 1_bp, ...`. The template must contain both placeholders and must
                produce unique column names across `(feature, wave)` pairs. Defaults to
                `"{feature}_w{wave}"`.
            output_path (Optional[Union[str, Path]]): If set, also write the wide dataframe
                to a CSV file at this path.

        Returns:
            pd.DataFrame: The newly stored wide-format dataframe.

        Raises:
            ValueError: If no data is loaded, if `(id_col, time_col)` rows are duplicated,
                if a static column varies within a subject, or if any referenced column
                is missing.

        Examples:
            !!! example "Basic Usage"
                ```python
                from scikit_longitudinal.data_preparation import LongitudinalDataset

                dataset = LongitudinalDataset(file_path=None, data_frame=long_df)
                dataset.to_wide(
                    id_col="pid",
                    time_col="wave",
                    longitudinal_columns=["bp", "chol"],
                    static_columns=["sex"],
                    output_path="./wide.csv",
                )
                ```
        """
        if self._data is None:
            raise ValueError("No data is loaded. Load data first.")

        wide_df, feature_groups, non_long = long_to_wide(
            self._data,
            id_col=id_col,
            time_col=time_col,
            longitudinal_columns=list(longitudinal_columns),
            static_columns=list(static_columns),
            wave_format=wave_format,
        )
        self._data = wide_df
        self._feature_groups = feature_groups
        self._non_longitudinal_features = non_long
        if output_path is not None:
            wide_df.to_csv(output_path, index=True, na_rep="")
        return wide_df

    def to_long(
        self,
        *,
        feature_base_names: Optional[List[str]] = None,
        id_col: str = "subject_id",
        time_col: str = "wave",
        keep_static: bool = True,
        output_path: Optional[Union[str, Path]] = None,
    ) -> pd.DataFrame:
        """Reshape the dataset's wide-format data into long format.

        Uses the dataset's own `feature_groups` (set via `setup_features_group`)
        to drive the reshape. The dataset's `data` is replaced with the long
        dataframe and `feature_groups` / `non_longitudinal_features` are
        cleared, since they no longer apply to a long layout.

        Args:
            feature_base_names (Optional[List[str]]): Names for the long-format feature
                columns, one per group. Defaults to `["feature_0", "feature_1", ...]`.
            id_col (str): Output id column name. Defaults to `"subject_id"`.
            time_col (str): Output wave column name. Defaults to `"wave"`.
            keep_static (bool): Whether to repeat static columns on every long row.
                Defaults to `True`.
            output_path (Optional[Union[str, Path]]): If set, also write the long dataframe
                to a CSV file at this path.

        Returns:
            pd.DataFrame: The newly stored long-format dataframe.

        Raises:
            ValueError: If no data is loaded or `setup_features_group(...)` has not been
                called yet.

        Examples:
            !!! example "Basic Usage"
                ```python
                from scikit_longitudinal.data_preparation import LongitudinalDataset

                dataset = LongitudinalDataset('./stroke.csv')
                dataset.load_data()
                dataset.setup_features_group("elsa")
                dataset.to_long(feature_base_names=["bp", "chol"], output_path="./long.csv")
                ```
        """
        if self._data is None:
            raise ValueError("No data is loaded. Load data first.")
        if self._feature_groups is None:
            raise ValueError(
                "setup_features_group(...) must be called before to_long()."
            )

        long_df = wide_to_long(
            self._data,
            features_group=self._feature_groups,
            non_longitudinal_features=self._non_longitudinal_features,
            feature_base_names=feature_base_names,
            id_col=id_col,
            time_col=time_col,
            keep_static=keep_static,
        )
        self._data = long_df
        self._feature_groups = None
        self._non_longitudinal_features = None
        if output_path is not None:
            long_df.to_csv(output_path, index=False, na_rep="")
        return long_df

load_data()

Load data from the file into a pandas DataFrame.

Supports ARFF and CSV formats. If a DataFrame was provided at initialization, this method does nothing.

Raises:

Type Description
ValueError

If the file format is unsupported (only ARFF and CSV are allowed).

FileNotFoundError

If the file specified in file_path does not exist.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def load_data(self) -> None:
    """Load data from the file into a pandas DataFrame.

    Supports `ARFF` and `CSV` formats. If a DataFrame was provided at initialization, this method does nothing.

    Raises:
        ValueError: If the file format is unsupported (only ARFF and CSV are allowed).
        FileNotFoundError: If the file specified in `file_path` does not exist.
    """
    if self._data is not None:
        return

    file_ext = self.file_path.suffix.lower()

    if file_ext == ".arff":
        self._data = self._arff_to_csv(self.file_path)  # type: ignore
    elif file_ext == ".csv":
        self._data = pd.read_csv(self.file_path)  # type: ignore
    else:
        raise ValueError(
            f"Unsupported file format: {file_ext}. Only ARFF and CSV are supported."
        )

load_target(target_column, target_wave_prefix='class_', remove_target_waves=False)

Extract the target variable from the dataset.

Optionally removes other target-related columns (e.g., from different waves) if specified.

Parameters:

Name Type Description Default
target_column str

Name of the target column in the dataset.

required
target_wave_prefix str

Prefix for target columns across waves. Defaults to "class_".

'class_'
remove_target_waves bool

If True, removes all columns with target_wave_prefix except target_column. Defaults to False.

False

Raises:

Type Description
ValueError

If no data is loaded or target_column is not in the dataset.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def load_target(
    self,
    target_column: str,
    target_wave_prefix: str = "class_",
    remove_target_waves: bool = False,
) -> None:
    """Extract the target variable from the dataset.

    Optionally removes other target-related columns (e.g., from different waves) if specified.

    Args:
        target_column (str): Name of the target column in the dataset.
        target_wave_prefix (str, optional): Prefix for target columns across waves. Defaults to "class_".
        remove_target_waves (bool, optional): If True, removes all columns with `target_wave_prefix` except
            `target_column`. Defaults to False.

    Raises:
        ValueError: If no data is loaded or `target_column` is not in the dataset.
    """
    if self._data is None:
        raise ValueError("No data is loaded. Load data first.")

    if target_column not in self._data.columns:
        raise ValueError(
            f"Target column '{target_column}' not found in the dataset."
        )

    if remove_target_waves:
        self._data = self._data[
            [
                col
                for col in self._data.columns
                if not (col.startswith(target_wave_prefix) and col != target_column)
            ]
        ]

    self._target = self._data[target_column]
    self._data.drop(columns=[target_column], inplace=True)

load_train_test_split(test_size=0.2, random_state=None)

Split the dataset into training and testing sets.

Utilises sklearn.model_selection.train_test_split for the split.

Parameters:

Name Type Description Default
test_size float

Proportion of data for the test set. Defaults to 0.2.

0.2
random_state int

Seed for reproducible splitting. Defaults to None.

None

Raises:

Type Description
ValueError

If data or target is not loaded.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def load_train_test_split(
    self, test_size: float = 0.2, random_state: int = None
) -> None:
    """Split the dataset into training and testing sets.

    Utilises `sklearn.model_selection.train_test_split` for the split.

    Args:
        test_size (float, optional): Proportion of data for the test set. Defaults to 0.2.
        random_state (int, optional): Seed for reproducible splitting. Defaults to None.

    Raises:
        ValueError: If data or target is not loaded.
    """
    if self._data is None or self._target is None:
        raise ValueError("No data or target is loaded. Load them first.")

    self._X_train, self._X_test, self._y_train, self._y_test = train_test_split(
        self._data, self._target, test_size=test_size, random_state=random_state
    )

setup_features_group(input_data)

Configure feature groups and non-longitudinal features for longitudinal analysis.

This method defines how features are grouped to capture temporal dependencies across waves. It supports three distinct input types, each suited to different use cases, with detailed examples and explanations below.

Automatically groups features based on wave suffixes (e.g., "_w1", "_w2") found in column names. This is ideal for datasets like the English Longitudinal Study of Ageing (ELSA), where features are consistently named with wave indicators. The ELSA dataset, focused on individuals aged 50+ in England, includes longitudinal data such as "smoke_w1", "smoke_w2", etc., which this method organizes into groups.

Where to find those datasets

To find those datasets, feel free to open an issue and question us!

Open An Issue!

How It Works:

  • Identifies base feature names (e.g., "smoke") and their wave suffixes (e.g., "_w1", "_w2").
  • Creates groups with indices ordered from oldest to most recent wave, padding with -1 for missing waves.
  • Non-longitudinal features (e.g., "age_wave8") are excluded from groups unless explicitly renamed.

Example:

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load an ELSA dataset
dataset = LongitudinalDataset('./data/elsa_core.csv')
dataset.load_data()
dataset.load_target("stroke_w2")

# Automatically group features by wave suffixes
dataset.setup_features_group("elsa")

# Resulting groups might look like:
# [[0, 1], [2, 3, -1]]  # e.g., [smoke_w1, smoke_w2], [chol_w1, chol_w2, N/A]
print(dataset.feature_groups())  # Indices
print(dataset.feature_groups(names=True))  # Names

Use Case:

Best for ELSA or similarly structured datasets with clear wave-based naming conventions.

Allows manual specification of feature indices for each group. This provides precise control over which columns are grouped together, useful when wave patterns are irregular or known in advance.

How It Works:

  • Each sublist contains integer indices corresponding to columns in the DataFrame.
  • Order matters: indices should reflect temporal sequence (oldest to newest).
  • Use -1 to pad groups if waves are missing, ensuring alignment across groups.

Example:

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load a dataset
dataset = LongitudinalDataset('./data/health.csv')
dataset.load_data()

# Define groups manually with indices
custom_groups = [[0, 1], [2, 3, -1]]  # e.g., [bp_w1, bp_w2], [weight_w1, weight_w2, N/A]
dataset.setup_features_group(custom_groups)

# Verify setup
print(dataset.feature_groups())  # [[0, 1], [2, 3, -1]]

Use Case:

Ideal when you have specific knowledge of column indices and need fine-grained control.

Specifies feature groups using column names, which are then converted to indices. This is intuitive for users familiar with dataset feature names, enhancing readability and reducing errors.

How It Works:

  • Each sublist contains strings matching DataFrame column names.
  • Names are mapped to their respective indices internally.
  • No padding is needed in the input; alignment is handled post-conversion.

Example:

from scikit_longitudinal.data_preparation import LongitudinalDataset

# Load a dataset
dataset = LongitudinalDataset('./data/stroke.csv')
dataset.load_data()

# Define groups with feature names
custom_names = [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
dataset.setup_features_group(custom_names)

# Verify setup
print(dataset.feature_groups(names=True))  # [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
print(dataset.feature_groups())  # Corresponding indices, e.g., [[0, 1], [2, 3]]

Use Case:

Perfect for datasets where feature names are meaningful and users prefer working with names over indices.

––––––––––––––––––––––––––––

Want more automatic handlers?

If you want more automatic handlers, like for the ELSA databases, feel free to open an issue and question us!

Open An Issue!

Parameters:

Name Type Description Default
input_data Union[str, List[List[Union[str, int]]]]

Input to define feature groups:

  • "elsa": Auto-groups based on wave suffixes.
  • List[List[int]]: Feature indices.
  • List[List[str]]: Feature names.
required

Raises:

Type Description
ValueError

If input_data is invalid, feature names are missing, or groups lack sufficient waves.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
@validate_feature_groups
def setup_features_group(
    self, input_data: Union[str, List[List[Union[str, int]]]]
) -> None:
    """Configure feature groups and non-longitudinal features for longitudinal analysis.

    This method defines how features are grouped to capture temporal dependencies across waves. It supports three
    distinct input types, each suited to different use cases, with detailed examples and explanations below.

    === "Using `elsa` for Automatic Grouping"

        Automatically groups features based on wave suffixes (e.g., "_w1", "_w2") found in column
        names. This is ideal for datasets like the English Longitudinal Study of Ageing (ELSA), where features are
        consistently named with wave indicators. The ELSA dataset, focused on individuals aged 50+ in England,
        includes longitudinal data such as "smoke_w1", "smoke_w2", etc., which this method organizes into groups.

        !!! tip "Where to find those datasets"
            To find those datasets, feel free to open an issue and question us!

            [Open An Issue! :fontawesome-brands-square-github:](https://scikit-longitudinal.readthedocs.io/latest//issues){ .md-button }

        How It Works:

        - [x] Identifies base feature names (e.g., "smoke") and their wave suffixes (e.g., "_w1", "_w2").
        - [x] Creates groups with indices ordered from oldest to most recent wave, padding with -1 for missing waves.
        - [x] Non-longitudinal features (e.g., "age_wave8") are excluded from groups unless explicitly renamed.

        Example:

        ```python
        from scikit_longitudinal.data_preparation import LongitudinalDataset

        # Load an ELSA dataset
        dataset = LongitudinalDataset('./data/elsa_core.csv')
        dataset.load_data()
        dataset.load_target("stroke_w2")

        # Automatically group features by wave suffixes
        dataset.setup_features_group("elsa")

        # Resulting groups might look like:
        # [[0, 1], [2, 3, -1]]  # e.g., [smoke_w1, smoke_w2], [chol_w1, chol_w2, N/A]
        print(dataset.feature_groups())  # Indices
        print(dataset.feature_groups(names=True))  # Names
        ```

        Use Case:

        Best for ELSA or similarly structured datasets with clear wave-based naming conventions.

    === "List of Lists of Integers for Direct Indices"

        Allows manual specification of feature indices for each group. This provides precise control
        over which columns are grouped together, useful when wave patterns are irregular or known in advance.

        How It Works:

        - [x] Each sublist contains integer indices corresponding to columns in the DataFrame.
        - [x] Order matters: indices should reflect temporal sequence (oldest to newest).
        - [x] Use -1 to pad groups if waves are missing, ensuring alignment across groups.

        Example:

        ```python
        from scikit_longitudinal.data_preparation import LongitudinalDataset

        # Load a dataset
        dataset = LongitudinalDataset('./data/health.csv')
        dataset.load_data()

        # Define groups manually with indices
        custom_groups = [[0, 1], [2, 3, -1]]  # e.g., [bp_w1, bp_w2], [weight_w1, weight_w2, N/A]
        dataset.setup_features_group(custom_groups)

        # Verify setup
        print(dataset.feature_groups())  # [[0, 1], [2, 3, -1]]
        ```

        Use Case:

        Ideal when you have specific knowledge of column indices and need fine-grained control.

    === "List of Lists of Strings for Feature Names"

        Specifies feature groups using column names, which are then converted to indices. This is
        intuitive for users familiar with dataset feature names, enhancing readability and reducing errors.

        How It Works:

        - [x] Each sublist contains strings matching DataFrame column names.
        - [x] Names are mapped to their respective indices internally.
        - [x] No padding is needed in the input; alignment is handled post-conversion.

        Example:

        ```python
        from scikit_longitudinal.data_preparation import LongitudinalDataset

        # Load a dataset
        dataset = LongitudinalDataset('./data/stroke.csv')
        dataset.load_data()

        # Define groups with feature names
        custom_names = [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
        dataset.setup_features_group(custom_names)

        # Verify setup
        print(dataset.feature_groups(names=True))  # [['smoke_w1', 'smoke_w2'], ['chol_w1', 'chol_w2']]
        print(dataset.feature_groups())  # Corresponding indices, e.g., [[0, 1], [2, 3]]
        ```

        Use Case:

        Perfect for datasets where feature names are meaningful and users prefer working with names
        over indices.

    ––––––––––––––––––––––––––––

    !!! question "Want more automatic handlers?"
        If you want more automatic handlers, like for the ELSA databases, feel free to open an issue and
        question us!

        [Open An Issue! :fontawesome-brands-square-github:](https://scikit-longitudinal.readthedocs.io/latest//issues){ .md-button }

    Args:
        input_data (Union[str, List[List[Union[str, int]]]]): Input to define feature groups:

            - [x] "elsa": Auto-groups based on wave suffixes.
            - [x] List[List[int]]: Feature indices.
            - [x] List[List[str]]: Feature names.

    Raises:
        ValueError: If `input_data` is invalid, feature names are missing, or groups lack sufficient waves.
    """
    if isinstance(input_data, str) and input_data.lower() == "elsa":
        self._feature_groups = self._create_elsa_feature_groups()
    elif isinstance(input_data, list) and all(
        isinstance(group, list) for group in input_data
    ):
        if all(isinstance(item, int) for group in input_data for item in group):
            self._feature_groups = input_data
        elif all(isinstance(item, str) for group in input_data for item in group):
            self._feature_groups = self._convert_feature_names_to_indices(
                input_data
            )

    if self._feature_groups is None:
        raise ValueError(
            f"Invalid input data: {input_data} or unknown error has occurred."
        )

    for group in self._feature_groups:
        if len(group) == 1 or (len(group) == 2 and -1 in group):
            raise ValueError(
                "A longitudinally represented feature should be in at least two waves: ",
                group,
            )
    # Populate non_longitudinal_features
    feature_group_names = [
        name for group in self.feature_groups(names=True) for name in group
    ]
    non_longitudinal_feature_names = set(self._data.columns) - set(
        feature_group_names
    )
    self._non_longitudinal_features = [
        self._data.columns.get_loc(name) for name in non_longitudinal_feature_names
    ]

feature_groups(names=False)

Retrieve the feature groups.

Returns -1 placeholders as "N/A" when names=True.

Parameters:

Name Type Description Default
names bool

If True, returns feature names instead of indices. Defaults to False.

False

Returns:

Type Description
List[List[Union[int, str]]]

List[List[Union[int, str]]]: Feature groups as indices or names.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def feature_groups(self, names: bool = False) -> List[List[Union[int, str]]]:
    """Retrieve the feature groups.

    Returns -1 placeholders as "N/A" when `names=True`.

    Args:
        names (bool, optional): If True, returns feature names instead of indices. Defaults to False.

    Returns:
        List[List[Union[int, str]]]: Feature groups as indices or names.
    """
    if names:
        return [
            [self._data.columns[i] if i != -1 else "N/A" for i in group]
            for group in self._feature_groups
        ]
    return self._feature_groups

non_longitudinal_features(names=False)

Retrieve the non-longitudinal features.

Parameters:

Name Type Description Default
names bool

If True, returns feature names instead of indices. Defaults to False.

False

Returns:

Type Description
List[Union[int, str]]

List[Union[int, str]]: Non-longitudinal features as indices or names.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def non_longitudinal_features(self, names: bool = False) -> List[Union[int, str]]:
    """Retrieve the non-longitudinal features.

    Args:
        names (bool, optional): If True, returns feature names instead of indices. Defaults to False.

    Returns:
        List[Union[int, str]]: Non-longitudinal features as indices or names.
    """
    if names:
        return [self._data.columns[i] for i in self._non_longitudinal_features]
    return self._non_longitudinal_features

to_wide(*, id_col, time_col, longitudinal_columns, static_columns=(), wave_format='{feature}_w{wave}', output_path=None)

Pivot the dataset's long-format data into wide format.

One row per (subject, time) becomes one row per subject, with each longitudinal column expanded into one column per observed wave (oldest → newest). The dataset's data, feature_groups, and non_longitudinal_features are updated in place to match the new layout.

Parameters:

Name Type Description Default
id_col str

Subject identifier column.

required
time_col str

Observation time column.

required
longitudinal_columns List[str]

Columns expanded across waves.

required
static_columns List[str]

Columns kept as-is per subject. Must be constant within each subject. Defaults to ().

()
wave_format str

Python str.format template that names every wide column produced from a longitudinal source column. Two placeholders are substituted per cell: {feature} is the original long-format column name (e.g. "bp"), and {wave} is the value taken straight from time_col for that observation (rendered with its native type — typically an int like 0, 1, 2, but strings such as "2008" work too). Waves are emitted in sorted order, so the template fully determines the wide schema: the default "{feature}_w{wave}" yields bp_w0, bp_w1, bp_w2, chol_w0, ...; "{feature}.t{wave}" yields bp.t0, bp.t1, ...; and "{wave}_{feature}" flips the order to 0_bp, 1_bp, .... The template must contain both placeholders and must produce unique column names across (feature, wave) pairs. Defaults to "{feature}_w{wave}".

'{feature}_w{wave}'
output_path Optional[Union[str, Path]]

If set, also write the wide dataframe to a CSV file at this path.

None

Returns:

Type Description
DataFrame

pd.DataFrame: The newly stored wide-format dataframe.

Raises:

Type Description
ValueError

If no data is loaded, if (id_col, time_col) rows are duplicated, if a static column varies within a subject, or if any referenced column is missing.

Examples:

Basic Usage

from scikit_longitudinal.data_preparation import LongitudinalDataset

dataset = LongitudinalDataset(file_path=None, data_frame=long_df)
dataset.to_wide(
    id_col="pid",
    time_col="wave",
    longitudinal_columns=["bp", "chol"],
    static_columns=["sex"],
    output_path="./wide.csv",
)
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def to_wide(
    self,
    *,
    id_col: str,
    time_col: str,
    longitudinal_columns: List[str],
    static_columns: List[str] = (),
    wave_format: str = "{feature}_w{wave}",
    output_path: Optional[Union[str, Path]] = None,
) -> pd.DataFrame:
    """Pivot the dataset's long-format data into wide format.

    One row per `(subject, time)` becomes one row per subject, with each
    longitudinal column expanded into one column per observed wave
    (oldest \u2192 newest). The dataset's `data`, `feature_groups`, and
    `non_longitudinal_features` are updated in place to match the new
    layout.

    Args:
        id_col (str): Subject identifier column.
        time_col (str): Observation time column.
        longitudinal_columns (List[str]): Columns expanded across waves.
        static_columns (List[str], optional): Columns kept as-is per subject.
            Must be constant within each subject. Defaults to `()`.
        wave_format (str): Python `str.format` template that names every wide column
            produced from a longitudinal source column. Two placeholders are substituted
            per cell: `{feature}` is the original long-format column name (e.g. `"bp"`),
            and `{wave}` is the value taken straight from `time_col` for that observation
            (rendered with its native type — typically an int like `0`, `1`, `2`, but
            strings such as `"2008"` work too). Waves are emitted in sorted order, so the
            template fully determines the wide schema: the default `"{feature}_w{wave}"`
            yields `bp_w0, bp_w1, bp_w2, chol_w0, ...`; `"{feature}.t{wave}"` yields
            `bp.t0, bp.t1, ...`; and `"{wave}_{feature}"` flips the order to
            `0_bp, 1_bp, ...`. The template must contain both placeholders and must
            produce unique column names across `(feature, wave)` pairs. Defaults to
            `"{feature}_w{wave}"`.
        output_path (Optional[Union[str, Path]]): If set, also write the wide dataframe
            to a CSV file at this path.

    Returns:
        pd.DataFrame: The newly stored wide-format dataframe.

    Raises:
        ValueError: If no data is loaded, if `(id_col, time_col)` rows are duplicated,
            if a static column varies within a subject, or if any referenced column
            is missing.

    Examples:
        !!! example "Basic Usage"
            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            dataset = LongitudinalDataset(file_path=None, data_frame=long_df)
            dataset.to_wide(
                id_col="pid",
                time_col="wave",
                longitudinal_columns=["bp", "chol"],
                static_columns=["sex"],
                output_path="./wide.csv",
            )
            ```
    """
    if self._data is None:
        raise ValueError("No data is loaded. Load data first.")

    wide_df, feature_groups, non_long = long_to_wide(
        self._data,
        id_col=id_col,
        time_col=time_col,
        longitudinal_columns=list(longitudinal_columns),
        static_columns=list(static_columns),
        wave_format=wave_format,
    )
    self._data = wide_df
    self._feature_groups = feature_groups
    self._non_longitudinal_features = non_long
    if output_path is not None:
        wide_df.to_csv(output_path, index=True, na_rep="")
    return wide_df

to_long(*, feature_base_names=None, id_col='subject_id', time_col='wave', keep_static=True, output_path=None)

Reshape the dataset's wide-format data into long format.

Uses the dataset's own feature_groups (set via setup_features_group) to drive the reshape. The dataset's data is replaced with the long dataframe and feature_groups / non_longitudinal_features are cleared, since they no longer apply to a long layout.

Parameters:

Name Type Description Default
feature_base_names Optional[List[str]]

Names for the long-format feature columns, one per group. Defaults to ["feature_0", "feature_1", ...].

None
id_col str

Output id column name. Defaults to "subject_id".

'subject_id'
time_col str

Output wave column name. Defaults to "wave".

'wave'
keep_static bool

Whether to repeat static columns on every long row. Defaults to True.

True
output_path Optional[Union[str, Path]]

If set, also write the long dataframe to a CSV file at this path.

None

Returns:

Type Description
DataFrame

pd.DataFrame: The newly stored long-format dataframe.

Raises:

Type Description
ValueError

If no data is loaded or setup_features_group(...) has not been called yet.

Examples:

Basic Usage

from scikit_longitudinal.data_preparation import LongitudinalDataset

dataset = LongitudinalDataset('./stroke.csv')
dataset.load_data()
dataset.setup_features_group("elsa")
dataset.to_long(feature_base_names=["bp", "chol"], output_path="./long.csv")
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def to_long(
    self,
    *,
    feature_base_names: Optional[List[str]] = None,
    id_col: str = "subject_id",
    time_col: str = "wave",
    keep_static: bool = True,
    output_path: Optional[Union[str, Path]] = None,
) -> pd.DataFrame:
    """Reshape the dataset's wide-format data into long format.

    Uses the dataset's own `feature_groups` (set via `setup_features_group`)
    to drive the reshape. The dataset's `data` is replaced with the long
    dataframe and `feature_groups` / `non_longitudinal_features` are
    cleared, since they no longer apply to a long layout.

    Args:
        feature_base_names (Optional[List[str]]): Names for the long-format feature
            columns, one per group. Defaults to `["feature_0", "feature_1", ...]`.
        id_col (str): Output id column name. Defaults to `"subject_id"`.
        time_col (str): Output wave column name. Defaults to `"wave"`.
        keep_static (bool): Whether to repeat static columns on every long row.
            Defaults to `True`.
        output_path (Optional[Union[str, Path]]): If set, also write the long dataframe
            to a CSV file at this path.

    Returns:
        pd.DataFrame: The newly stored long-format dataframe.

    Raises:
        ValueError: If no data is loaded or `setup_features_group(...)` has not been
            called yet.

    Examples:
        !!! example "Basic Usage"
            ```python
            from scikit_longitudinal.data_preparation import LongitudinalDataset

            dataset = LongitudinalDataset('./stroke.csv')
            dataset.load_data()
            dataset.setup_features_group("elsa")
            dataset.to_long(feature_base_names=["bp", "chol"], output_path="./long.csv")
            ```
    """
    if self._data is None:
        raise ValueError("No data is loaded. Load data first.")
    if self._feature_groups is None:
        raise ValueError(
            "setup_features_group(...) must be called before to_long()."
        )

    long_df = wide_to_long(
        self._data,
        features_group=self._feature_groups,
        non_longitudinal_features=self._non_longitudinal_features,
        feature_base_names=feature_base_names,
        id_col=id_col,
        time_col=time_col,
        keep_static=keep_static,
    )
    self._data = long_df
    self._feature_groups = None
    self._non_longitudinal_features = None
    if output_path is not None:
        long_df.to_csv(output_path, index=False, na_rep="")
    return long_df

convert(output_path)

Convert the dataset to another format (ARFF or CSV).

Disclaimer: This is handmade

If new libraries handle such conversion, we highly recommend using them instead of this handmade conversion. This is a neat and quick solution, but it is not the most efficient one, in our humble opinion.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the converted file.

required

Raises:

Type Description
ValueError

If no data is loaded or the output format is unsupported.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
@ensure_data_loaded
@check_extension([".csv", ".arff"])
def convert(self, output_path: Union[str, Path]) -> None:  # pragma: no cover
    """Convert the dataset to another format (ARFF or CSV).

    !!! note "Disclaimer: This is handmade"
        If new libraries handle such conversion, we highly recommend using them instead of this
        handmade conversion. This is a neat and quick solution, but it is not the most efficient one, in
        our humble opinion.

    Args:
        output_path (Union[str, Path]): Path to save the converted file.

    Raises:
        ValueError: If no data is loaded or the output format is unsupported.
    """
    if self._data is None:
        raise ValueError("No data to convert. Load data first.")

    file_ext = Path(output_path).suffix.lower()

    if file_ext == ".arff":
        arff_data = self._csv_to_arff(self._data, self.file_path.stem)
        with open(output_path, "w") as f:
            arff.dump(
                {
                    "description": "",
                    "relation": arff_data["relation"],
                    "attributes": arff_data["attributes"],
                    "data": arff_data["data"],
                },
                f,
            )
    elif file_ext == ".csv":
        self._data.to_csv(output_path, index=False, na_rep="")
    else:
        raise ValueError(f"Unsupported file format: {file_ext}")

save_data(output_path)

Save the dataset to a file.

Wraps the convert method to save in the specified format.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path to save the file.

required

Raises:

Type Description
ValueError

If no data is loaded.

Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
@ensure_data_loaded
def save_data(self, output_path: Union[str, Path]) -> None:  # pragma: no cover
    """Save the dataset to a file.

    Wraps the `convert` method to save in the specified format.

    Args:
        output_path (Union[str, Path]): Path to save the file.

    Raises:
        ValueError: If no data is loaded.
    """
    if self._data is None:
        raise ValueError("No data to save. Load or convert data first.")

    self.convert(output_path)

set_data(data)

Sets the data attribute.

Parameters:

Name Type Description Default
data DataFrame

The data.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def set_data(self, data: pd.DataFrame) -> None:
    """Sets the data attribute.

    Args:
        data (pd.DataFrame):
            The data.

    """
    self._data = data

set_target(target)

Sets the target attribute.

Parameters:

Name Type Description Default
target Series

The target.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def set_target(self, target: pd.Series) -> None:
    """Sets the target attribute.

    Args:
        target (pd.Series):
            The target.

    """
    self._target = target

setX_train(X_train)

Set the training data attribute.

Parameters:

Name Type Description Default
X_train DataFrame

The training data.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def setX_train(self, X_train: pd.DataFrame) -> None:
    """Set the training data attribute.

    Args:
        X_train (pd.DataFrame):
            The training data.

    """
    self._X_train = X_train

setX_test(X_test)

Set the test data attribute.

Parameters:

Name Type Description Default
X_test DataFrame

The test data.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def setX_test(self, X_test: pd.DataFrame) -> None:
    """Set the test data attribute.

    Args:
        X_test (pd.DataFrame):
            The test data.

    """
    self._X_test = X_test

sety_train(y_train)

Set the training target data attribute.

Parameters:

Name Type Description Default
y_train Series

The training target data.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def sety_train(self, y_train: pd.Series) -> None:
    """Set the training target data attribute.

    Args:
        y_train (pd.Series):
            The training target data.

    """
    self._y_train = y_train

sety_test(y_test)

Set the test target data attribute.

Parameters:

Name Type Description Default
y_test Series

The test target data.

required
Source code in scikit_longitudinal/data_preparation/longitudinal_dataset.py
def sety_test(self, y_test: pd.Series) -> None:
    """Set the test target data attribute.

    Args:
        y_test (pd.Series):
            The test target data.

    """
    self._y_test = y_test