nnspike.data.preprocess

This module provides functions for preprocessing and preparing datasets of image frames with sensor data.

The module contains utilities for creating labeled datasets from image files, balancing data distributions, sorting by frame numbers, and merging sensor status data with image metadata. It’s designed to work with robot control data that includes motor positions, sensor readings, and image frames.

Functions:
balance_dataset(df: pd.DataFrame, col_name: str, max_samples: int, num_bins: int) -> pd.DataFrame:

Balances the dataset by limiting the number of samples in each bin of a specified column. Uses histogram binning to ensure uniform distribution across value ranges.

sort_by_frames_number(df: pd.DataFrame) -> pd.DataFrame:

Sorts a DataFrame by the frame number extracted from the ‘image_path’ column and adds the frame_number column as the second column in the DataFrame.

create_label_dataframe(path_pattern: str, course: str) -> pd.DataFrame:

Creates a comprehensive DataFrame with image paths and associated metadata columns including motor speeds, positions, sensor readings, and labeling fields. Initializes all sensor columns with default values for subsequent population.

set_spike_status(label_df: pd.DataFrame, status_df: pd.DataFrame) -> pd.DataFrame:

Merges comprehensive sensor and motor data from status_df into label_df based on matching frame numbers. Updates multiple columns including motor speeds, positions, and various sensor readings (distance, color sensors).

Functions

balance_dataset(df, col_name, max_samples, ...)

Balances the dataset by limiting the number of samples in each bin of a specified column.

create_label_dataframe(path_pattern, course)

Creates a comprehensive DataFrame with image paths and associated metadata for labeling tasks.

set_spike_status(label_df, status_df)

Merges comprehensive sensor and motor data from status_df into label_df based on matching frame numbers.

sort_by_frames_number(df)

Sorts a DataFrame by the frame number extracted from the 'image_path' column.

nnspike.data.preprocess.balance_dataset(df, col_name, max_samples, num_bins)[source]

Balances the dataset by limiting the number of samples in each bin of a specified column.

This function creates a histogram of the specified column and ensures that no bin has more than max_samples samples. If a bin exceeds this limit, excess samples are randomly removed to balance the dataset.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing the data to be balanced.

  • col_name (str) – The name of the column to be used for creating bins.

  • max_samples (int) – The maximum number of samples allowed per bin.

  • num_bins (int) – The number of bins to divide the column into.

Returns:

A DataFrame with the dataset balanced according to the specified column and bin limits.

Return type:

pd.DataFrame

Note

Make sure the column does not have
  1. None/Nan

  2. empty string

Otherwise, ValueError: autodetected range of [nan, nan] is not finite may raise

nnspike.data.preprocess.sort_by_frames_number(df)[source]

Sorts a DataFrame by the frame number extracted from the ‘image_path’ column.

This function extracts the frame number from the ‘image_path’ column of the DataFrame, sorts the DataFrame based on these frame numbers, and keeps the ‘frame_number’ column as the 2nd column in the DataFrame.

Parameters:

df (pd.DataFrame) – The input DataFrame containing an ‘image_path’ column with file paths that include frame numbers in the format ‘frame_<number>’.

Returns:

The sorted DataFrame with rows ordered by the extracted frame numbers and the frame_number column as the 2nd column.

Return type:

pd.DataFrame

nnspike.data.preprocess.create_label_dataframe(path_pattern, course)[source]

Creates a comprehensive DataFrame with image paths and associated metadata for labeling tasks.

This function searches for image files matching the given path pattern and constructs a DataFrame containing the paths to these images along with multiple columns for sensor data and robot control information. The DataFrame includes columns for: - Image metadata: image_path, course, data_type, use flag - Target and mode information: mode, target_x - Motor data: motor_a_speed, motor_b_speed, motor_a_relative_position, motor_b_relative_position - Sensor readings: distance_sensor, color_reflected, color_ambient, color_value

All sensor and motor columns are initialized with default values (0 for numeric fields, NaN for mode and target_x, True for use flag) to be populated later by other functions.

Parameters:
  • path_pattern (str) – A glob pattern to match image file paths.

  • course (str) – The name of the course associated with the images.

Returns:

A DataFrame containing the image paths and comprehensive metadata columns with default values for subsequent data population.

Return type:

pd.DataFrame

nnspike.data.preprocess.set_spike_status(label_df, status_df)[source]

Merges comprehensive sensor and motor data from status_df into label_df based on matching frame numbers.

This function performs a comprehensive merge of robot sensor and control data from status_df into label_df by matching frame_number values. It updates multiple columns including: - Motor control: motor_a_speed, motor_b_speed, motor_a_relative_position, motor_b_relative_position - Sensor readings: distance_sensor, color_reflected, color_ambient, color_value - Robot mode: mode (preserving manual labels from label_df when available)

The merge is performed as a left join, preserving all rows in label_df and only updating sensor/motor data where matching frame numbers exist in status_df. If label_df doesn’t have a frame_number column, it will be automatically added by calling sort_by_frames_number().

Parameters:
  • label_df (pd.DataFrame) – The label DataFrame containing image paths and metadata. Must have or be able to generate a frame_number column.

  • status_df (pd.DataFrame) – The status DataFrame containing comprehensive sensor and motor data with frame_number column for matching.

Returns:

The updated label DataFrame with sensor and motor data merged from status_df. All original columns are preserved, and sensor data is filled where frame matches exist.

Return type:

pd.DataFrame