{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# vitalDSP Data Loader Tutorial\n", "\n", "This notebook demonstrates the comprehensive data loading capabilities of vitalDSP.\n", "\n", "## Contents\n", "1. Basic Loading\n", "2. Format-Specific Examples\n", "3. Multi-Channel Data\n", "4. Data Validation\n", "5. Metadata Extraction\n", "6. Data Export\n", "7. Advanced Features\n", "8. Real-World Examples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from pathlib import Path\n", "\n", "from vitalDSP.utils.data_processing.data_loader import (\n", " DataLoader,\n", " DataFormat,\n", " SignalType,\n", " load_signal,\n", " load_multi_channel\n", ")\n", "\n", "# Set plotting style\n", "plt.style.use('seaborn-v0_8-darkgrid')\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic Loading\n", "\n", "### Creating Sample Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample ECG data\n", "fs = 250 # Sampling rate\n", "duration = 10 # seconds\n", "t = np.linspace(0, duration, fs * duration)\n", "\n", "# Simulate ECG signal\n", "ecg = np.sin(2 * np.pi * 1.2 * t) + 0.3 * np.sin(2 * np.pi * 2.4 * t)\n", "ecg += 0.1 * np.random.randn(len(t))\n", "\n", "# Create DataFrame\n", "df = pd.DataFrame({\n", " 'time': t,\n", " 'ecg': ecg\n", "})\n", "\n", "# Save to CSV\n", "df.to_csv('sample_ecg.csv', index=False)\n", "\n", "print(f\"Created sample ECG data: {len(df)} samples at {fs} Hz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading CSV Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the CSV file\n", "loader = DataLoader('sample_ecg.csv', sampling_rate=250.0, signal_type='ecg')\n", "data = loader.load(time_column='time')\n", "\n", "print(f\"Loaded {len(data)} samples\")\n", "print(f\"Columns: {list(data.columns)}\")\n", "print(f\"\\nFirst few rows:\")\n", "print(data.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Loading with Convenience Function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Quick load\n", "data_quick = load_signal('sample_ecg.csv', sampling_rate=250)\n", "\n", "print(f\"Quick loaded: {data_quick.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing Loaded Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(14, 5))\n", "plt.plot(data['time'], data['ecg'])\n", "plt.xlabel('Time (s)')\n", "plt.ylabel('ECG (mV)')\n", "plt.title('Loaded ECG Signal')\n", "plt.grid(True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Format-Specific Examples\n", "\n", "### JSON Format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "# Create JSON data with metadata\n", "json_data = {\n", " 'sampling_rate': 250,\n", " 'signal_type': 'ecg',\n", " 'duration': 10,\n", " 'data': [\n", " {'time': float(t), 'ecg': float(e)} \n", " for t, e in zip(data['time'][:100], data['ecg'][:100])\n", " ]\n", "}\n", "\n", "with open('sample_ecg.json', 'w') as f:\n", " json.dump(json_data, f)\n", "\n", "# Load JSON\n", "loader_json = DataLoader('sample_ecg.json')\n", "data_json = loader_json.load()\n", "\n", "print(f\"Loaded from JSON: {len(data_json)} samples\")\n", "print(f\"Metadata: {loader_json.metadata}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NumPy Format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save as .npy\n", "np.save('sample_ecg.npy', data['ecg'].values)\n", "\n", "# Load .npy\n", "loader_npy = DataLoader('sample_ecg.npy')\n", "data_npy = loader_npy.load()\n", "\n", "print(f\"Loaded from .npy: {data_npy.shape}\")\n", "print(f\"Metadata: {loader_npy.metadata}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multi-signal NPZ Format" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create multi-signal data\n", "ppg = 0.5 * np.sin(2 * np.pi * 1.0 * t) + 0.1 * np.random.randn(len(t))\n", "resp = 0.3 * np.sin(2 * np.pi * 0.3 * t) + 0.05 * np.random.randn(len(t))\n", "\n", "# Save as .npz\n", "np.savez('multi_signals.npz', ecg=ecg, ppg=ppg, resp=resp, time=t)\n", "\n", "# Load .npz\n", "loader_npz = DataLoader('multi_signals.npz')\n", "data_npz = loader_npz.load()\n", "\n", "print(f\"Loaded signals: {list(data_npz.keys())}\")\n", "print(f\"ECG shape: {data_npz['ecg'].shape}\")\n", "print(f\"PPG shape: {data_npz['ppg'].shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Multi-Channel Data\n", "\n", "### Creating Multi-Channel Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create multi-channel dataset\n", "df_multi = pd.DataFrame({\n", " 'time': t,\n", " 'ECG': ecg,\n", " 'PPG': ppg,\n", " 'RESP': resp\n", "})\n", "\n", "df_multi.to_csv('multi_channel.csv', index=False)\n", "\n", "print(f\"Created multi-channel data: {df_multi.shape}\")\n", "print(f\"Channels: {list(df_multi.columns)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading Specific Channels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load only ECG and PPG\n", "loader_multi = DataLoader('multi_channel.csv')\n", "data_selected = loader_multi.load(columns=['time', 'ECG', 'PPG'])\n", "\n", "print(f\"Selected channels: {list(data_selected.columns)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using load_multi_channel" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load as dictionary of channels\n", "channels = load_multi_channel('multi_channel.csv', channels=['ECG', 'PPG', 'RESP'])\n", "\n", "for name, signal in channels.items():\n", " print(f\"{name}: {len(signal)} samples, mean={signal.mean():.3f}, std={signal.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing Multi-Channel Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)\n", "\n", "# Plot each channel\n", "axes[0].plot(t, channels['ECG'])\n", "axes[0].set_ylabel('ECG (mV)')\n", "axes[0].set_title('ECG Signal')\n", "axes[0].grid(True)\n", "\n", "axes[1].plot(t, channels['PPG'])\n", "axes[1].set_ylabel('PPG (AU)')\n", "axes[1].set_title('PPG Signal')\n", "axes[1].grid(True)\n", "\n", "axes[2].plot(t, channels['RESP'])\n", "axes[2].set_ylabel('RESP (AU)')\n", "axes[2].set_xlabel('Time (s)')\n", "axes[2].set_title('Respiratory Signal')\n", "axes[2].grid(True)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Data Validation\n", "\n", "### Creating Data with Issues" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create data with NaN and Inf\n", "signal_with_issues = ecg.copy()\n", "signal_with_issues[100:110] = np.nan # Add NaN\n", "signal_with_issues[500] = np.inf # Add Inf\n", "\n", "df_issues = pd.DataFrame({\n", " 'time': t,\n", " 'signal': signal_with_issues\n", "})\n", "\n", "df_issues.to_csv('signal_with_issues.csv', index=False)\n", "\n", "print(f\"Created signal with {df_issues['signal'].isna().sum()} NaN values\")\n", "print(f\"Created signal with {np.isinf(df_issues['signal']).sum()} Inf values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading with Validation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "# Enable validation (will show warnings)\n", "print(\"Loading with validation enabled:\")\n", "loader_validate = DataLoader('signal_with_issues.csv', validate=True)\n", "data_validated = loader_validate.load()\n", "\n", "print(f\"\\nData loaded: {data_validated.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Metadata Extraction\n", "\n", "### Extracting Comprehensive Info" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load with metadata extraction\n", "loader = DataLoader('multi_channel.csv', sampling_rate=250.0, signal_type='ecg')\n", "data = loader.load(time_column='time')\n", "\n", "# Get full info\n", "info = loader.get_info()\n", "\n", "print(\"=== Data Information ===\")\n", "print(f\"File: {info['file_path']}\")\n", "print(f\"Format: {info['format']}\")\n", "print(f\"Signal Type: {info['signal_type']}\")\n", "print(f\"Sampling Rate: {info['sampling_rate']} Hz\")\n", "print(f\"\\nMetadata:\")\n", "for key, value in info['metadata'].items():\n", " print(f\" {key}: {value}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatic Sampling Rate Detection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load without specifying sampling rate\n", "loader_auto = DataLoader('multi_channel.csv')\n", "data_auto = loader_auto.load(time_column='time')\n", "\n", "print(f\"Computed sampling rate: {loader_auto.metadata.get('computed_sampling_rate', 'N/A')} Hz\")\n", "print(f\"Expected: 250 Hz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Data Export\n", "\n", "### Exporting to Multiple Formats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load original data\n", "loader = DataLoader('multi_channel.csv')\n", "data = loader.load()\n", "\n", "# Export to different formats\n", "print(\"Exporting to multiple formats...\")\n", "\n", "loader.export(data, 'exported_data.csv')\n", "print(\"\u2713 Exported to CSV\")\n", "\n", "loader.export(data, 'exported_data.json')\n", "print(\"\u2713 Exported to JSON\")\n", "\n", "loader.export(data, 'exported_data.pkl')\n", "print(\"\u2713 Exported to Pickle\")\n", "\n", "print(\"\\nExport complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Verifying Exported Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load back the exported JSON\n", "loader_verify = DataLoader('exported_data.json')\n", "data_verify = loader_verify.load()\n", "\n", "print(f\"Original shape: {data.shape}\")\n", "print(f\"Exported and reloaded shape: {data_verify.shape}\")\n", "print(f\"\\nData integrity: {'\u2713 PASSED' if data.shape == data_verify.shape else '\u2717 FAILED'}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Advanced Features\n", "\n", "### Loading from NumPy Array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create array\n", "signal_array = np.random.randn(1000)\n", "\n", "# Load from array\n", "loader = DataLoader()\n", "df_from_array = loader.load_from_array(\n", " signal_array,\n", " sampling_rate=250.0,\n", " signal_type='ecg'\n", ")\n", "\n", "print(f\"Loaded from array: {df_from_array.shape}\")\n", "print(f\"Sampling rate: {loader.sampling_rate} Hz\")\n", "print(f\"Signal type: {loader.signal_type.value}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading from DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create DataFrame\n", "df_custom = pd.DataFrame({\n", " 'ecg': np.random.randn(500),\n", " 'ppg': np.random.randn(500)\n", "})\n", "\n", "# Load from DataFrame\n", "loader = DataLoader()\n", "df_loaded = loader.load_from_dataframe(df_custom, sampling_rate=100.0)\n", "\n", "print(f\"Loaded from DataFrame: {df_loaded.shape}\")\n", "print(f\"Metadata: {loader.metadata}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Format Detection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test format detection\n", "test_files = [\n", " 'data.csv',\n", " 'data.json',\n", " 'data.xlsx',\n", " 'data.npy',\n", " 'data.mat'\n", "]\n", "\n", "print(\"=== Format Detection ===\")\n", "for filename in test_files:\n", " # Create empty file\n", " Path(filename).touch()\n", " \n", " loader = DataLoader(filename)\n", " print(f\"{filename:15} -> {loader.format.value}\")\n", " \n", " # Clean up\n", " Path(filename).unlink()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List Supported Formats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get supported formats\n", "formats = DataLoader.list_supported_formats()\n", "\n", "print(\"=== Supported Formats ===\")\n", "for i, fmt in enumerate(formats, 1):\n", " if fmt != 'unknown':\n", " req = DataLoader.get_format_requirements(fmt)\n", " if req:\n", " print(f\"{i}. {fmt.upper():10} - {req.get('description', 'N/A')}\")\n", " else:\n", " print(f\"{i}. {fmt.upper()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Real-World Examples\n", "\n", "### Example 1: ECG Analysis Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Complete ECG analysis workflow\n", "\n", "# 1. Load data\n", "loader = DataLoader('sample_ecg.csv', signal_type='ecg')\n", "df = loader.load(time_column='time')\n", "\n", "# 2. Extract signal\n", "ecg_signal = df['ecg'].values\n", "time = df['time'].values\n", "\n", "# 3. Basic statistics\n", "print(\"=== ECG Analysis ===\")\n", "print(f\"Duration: {time[-1]:.1f} seconds\")\n", "print(f\"Samples: {len(ecg_signal)}\")\n", "print(f\"Mean: {ecg_signal.mean():.3f} mV\")\n", "print(f\"Std Dev: {ecg_signal.std():.3f} mV\")\n", "print(f\"Min: {ecg_signal.min():.3f} mV\")\n", "print(f\"Max: {ecg_signal.max():.3f} mV\")\n", "\n", "# 4. Visualize\n", "plt.figure(figsize=(14, 5))\n", "plt.plot(time, ecg_signal)\n", "plt.xlabel('Time (s)')\n", "plt.ylabel('ECG (mV)')\n", "plt.title('ECG Signal Analysis')\n", "plt.grid(True, alpha=0.3)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 2: Batch Processing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simulate multiple patient recordings\n", "for i in range(3):\n", " # Create different signals\n", " t_patient = np.linspace(0, 5, 1250)\n", " signal = np.sin(2 * np.pi * (1.0 + i * 0.1) * t_patient)\n", " signal += 0.1 * np.random.randn(len(t_patient))\n", " \n", " df_patient = pd.DataFrame({'time': t_patient, 'ecg': signal})\n", " df_patient.to_csv(f'patient_{i+1}.csv', index=False)\n", "\n", "# Batch process all patients\n", "results = []\n", "\n", "for i in range(3):\n", " filename = f'patient_{i+1}.csv'\n", " \n", " # Load\n", " loader = DataLoader(filename, sampling_rate=250.0)\n", " data = loader.load()\n", " \n", " # Analyze\n", " signal = data['ecg'].values\n", " \n", " results.append({\n", " 'patient': f'Patient {i+1}',\n", " 'samples': len(signal),\n", " 'mean': signal.mean(),\n", " 'std': signal.std()\n", " })\n", "\n", "# Display results\n", "results_df = pd.DataFrame(results)\n", "print(\"=== Batch Processing Results ===\")\n", "print(results_df.to_string(index=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example 3: Multi-Signal Comparison" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load multi-channel data\n", "channels = load_multi_channel('multi_channel.csv')\n", "\n", "# Compare signals\n", "print(\"=== Multi-Signal Comparison ===\")\n", "print(f\"{'Signal':<10} {'Mean':<12} {'Std':<12} {'Peak-to-Peak':<15}\")\n", "print(\"-\" * 50)\n", "\n", "for name, signal in channels.items():\n", " if name != 'time':\n", " mean_val = signal.mean()\n", " std_val = signal.std()\n", " p2p = signal.max() - signal.min()\n", " print(f\"{name:<10} {mean_val:<12.4f} {std_val:<12.4f} {p2p:<15.4f}\")\n", "\n", "# Visualize comparison\n", "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n", "\n", "for idx, (name, signal) in enumerate(channels.items()):\n", " if name != 'time':\n", " axes[idx].hist(signal, bins=50, edgecolor='black', alpha=0.7)\n", " axes[idx].set_xlabel('Amplitude')\n", " axes[idx].set_ylabel('Frequency')\n", " axes[idx].set_title(f'{name} Distribution')\n", " axes[idx].grid(True, alpha=0.3)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup\n", "\n", "Remove generated files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# List of files to remove\n", "cleanup_files = [\n", " 'sample_ecg.csv', 'sample_ecg.json', 'sample_ecg.npy',\n", " 'multi_signals.npz', 'multi_channel.csv',\n", " 'signal_with_issues.csv',\n", " 'exported_data.csv', 'exported_data.json', 'exported_data.pkl',\n", " 'patient_1.csv', 'patient_2.csv', 'patient_3.csv'\n", "]\n", "\n", "for filename in cleanup_files:\n", " if os.path.exists(filename):\n", " os.remove(filename)\n", " print(f\"Removed: {filename}\")\n", "\n", "print(\"\\nCleanup complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This notebook demonstrated:\n", "\n", "1. \u2713 Loading data from multiple formats (CSV, JSON, NumPy, etc.)\n", "2. \u2713 Working with multi-channel physiological signals\n", "3. \u2713 Data validation and quality checks\n", "4. \u2713 Metadata extraction and sampling rate detection\n", "5. \u2713 Data export to various formats\n", "6. \u2713 Advanced features (array/DataFrame loading, format detection)\n", "7. \u2713 Real-world analysis workflows\n", "\n", "### Next Steps\n", "\n", "- Explore preprocessing with `vitalDSP.preprocess`\n", "- Try feature extraction with `vitalDSP.physiological_features`\n", "- Analyze signal quality with `vitalDSP.signal_quality_assessment`\n", "\n", "### Resources\n", "\n", "- [Data Loader Documentation](../data_loader_guide.rst)\n", "- [vitalDSP GitHub](https://github.com/Oucru-Innovations/vital-DSP)\n", "- [API Reference](../api/index.rst)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }