workflow_dataset_to_rkns

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
workflow_dataset_to_rkns [2025/10/20 12:41] fabricioworkflow_dataset_to_rkns [2025/10/20 14:03] (current) – [AI-Assisted Mapping Generation] fabricio
Line 1: Line 1:
 ====== Turning Datasets into RKNS Format ====== ====== Turning Datasets into RKNS Format ======
 +
  
 This guide describes how to convert a structured dataset (e.g., BIDS) into the **RKNS format** using the [[https://code.rekonas.com/rekonas/airflow-template-dataset-to-rkns|airflow-template-dataset-to-rkns]] template repository. The process is divided into three main phases: This guide describes how to convert a structured dataset (e.g., BIDS) into the **RKNS format** using the [[https://code.rekonas.com/rekonas/airflow-template-dataset-to-rkns|airflow-template-dataset-to-rkns]] template repository. The process is divided into three main phases:
Line 11: Line 12:
 ---- ----
  
-===== 1. Transform Logic (main.py) =====+====== 1. Transform Logic (main.py) ======
  
 The core conversion logic lives in ''main.py''. It reads: The core conversion logic lives in ''main.py''. It reads:
Line 22: Line 23:
 and outputs a validated ''*.rkns*'' file (a Zarr zip store). and outputs a validated ''*.rkns*'' file (a Zarr zip store).
  
-===== CLI Interface =====+==== CLI Interface ====
  
 It is recommended to keep the CLI interface unchanged if possible—it is designed for easier Airflow usage: It is recommended to keep the CLI interface unchanged if possible—it is designed for easier Airflow usage:
Line 40: Line 41:
 The ''edf_to_rkns()'' function executes the following steps: The ''edf_to_rkns()'' function executes the following steps:
  
-====== 1.1 Load and Standardize Signals ======+==== 1.1 Load and Standardize Signals ====
 <code python> <code python>
 rkns_obj = rkns.from_external_format( rkns_obj = rkns.from_external_format(
Line 50: Line 51:
 Uses regex mappings in ''assets/replace_channels.json'' to rename EDF channels to standardized names (e.g., "EEG(sec)" → "EEG-C3-A2"). Uses regex mappings in ''assets/replace_channels.json'' to rename EDF channels to standardized names (e.g., "EEG(sec)" → "EEG-C3-A2").
  
-====== 1.2 Add Annotations ======+==== 1.2 Add Annotations ====
 Converts the TSV file to BIDS-compatible format and adds events: Converts the TSV file to BIDS-compatible format and adds events:
 <code python> <code python>
Line 60: Line 61:
 </code> </code>
  
-====== 1.3 Extract and Categorize Metadata ======+==== 1.3 Extract and Categorize Metadata ====
   - Matches participant ID (e.g., "sub-0001") from the EDF filename.   - Matches participant ID (e.g., "sub-0001") from the EDF filename.
   - Looks up the participant in ''participants.tsv''.   - Looks up the participant in ''participants.tsv''.
Line 66: Line 67:
   - Groups metadata by category and adds each to the RKNS object.   - Groups metadata by category and adds each to the RKNS object.
  
-====== 1.4 Finalize and Export ======+==== 1.4 Finalize and Export ====
 <code python> <code python>
 rkns_obj.populate()                     # Build internal structure rkns_obj.populate()                     # Build internal structure
Line 72: Line 73:
 </code> </code>
  
-====== 1.5 Validate ======+==== 1.5 Validate ====
 <code python> <code python>
 validate_rkns_checksum(output_file)     # Verify checksums on all data validate_rkns_checksum(output_file)     # Verify checksums on all data
Line 79: Line 80:
 ---- ----
  
-===== 2. Preprocessing: Generate RKNS-Compatible Event Annotations =====+====== 2. Preprocessing: Generate RKNS-Compatible Event Annotations ======
  
 RKNS requires events in a strict tab-separated (TSV) format with three columns: ''onset'', ''duration'', and ''event''.   RKNS requires events in a strict tab-separated (TSV) format with three columns: ''onset'', ''duration'', and ''event''.  
Line 109: Line 110:
 ---- ----
  
-===== 3. Standardizing Names via Regex Mappings =====+====== 3. Standardizing Names via Regex Mappings ======
  
 Once you have a compliant TSV, normalize channel and event names non-destructively using regex mappings defined in two JSON files. Once you have a compliant TSV, normalize channel and event names non-destructively using regex mappings defined in two JSON files.
  
-====== Channel Names → assets/replace_channels.json ======+===== Channel Names → assets/replace_channels.json =====
  
 EDF channel labels vary wildly (e.g., "EEG C3-A2", "EEG(sec)"). Map them to standardized RKNS names: EDF channel labels vary wildly (e.g., "EEG C3-A2", "EEG(sec)"). Map them to standardized RKNS names:
Line 129: Line 130:
 Keys are regex patterns (applied sequentially), and values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1''). Keys are regex patterns (applied sequentially), and values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1'').
  
-===== Validation =====+==== Validation ====
  
 Use ''test_replace_channels.py'' to validate your mappings: Use ''test_replace_channels.py'' to validate your mappings:
  
-  1. Extract unique channel names from your EDF files:+1. Extract unique channel names from your EDF files:
 <code bash> <code bash>
 find /data -name "*.edf" -exec edf-peek {} \; | grep "signal_labels" | sort | uniq > assets/extracted_channels.txt find /data -name "*.edf" -exec edf-peek {} \; | grep "signal_labels" | sort | uniq > assets/extracted_channels.txt
 </code> </code>
-  2. Run the test:+2. Run the test:
 <code bash> <code bash>
 python test_replace_channels.py python test_replace_channels.py
 </code> </code>
-  3. Inspect the output:+3. Inspect the output:
 <code bash> <code bash>
 cat out/renamed_channels.csv cat out/renamed_channels.csv
 </code> </code>
  
-===== AI-Assisted Mapping Generation =====+==== AI-Assisted Mapping Generation ====
  
 If you have a list of unique channel names, use this prompt with your LLM to accelerate mapping creation: If you have a list of unique channel names, use this prompt with your LLM to accelerate mapping creation:
  
-> I will provide you a list of EEG channels extracted from the original EDFs.   +<code> 
-I require an output in JSON format that maps channel names to new standardized names using grep-style regex replacements.   +I will provide you a list of physiological signal channels (e.g., EEG, EOG, EMG, respiratory, cardiac) extracted from original EDF files.   
->  +I require an output in JSON format that maps each raw channel name to standardized name using **sequential, grep-style regex replacements**.
-> For example, this is a reference mapping. Note that replacements are applied in sequential order: +
-> <code json> +
-> { +
->   "(?i)^ABDO\\sRES\\s*$": "RESP-ABD", +
->   "(?i)^THOR\\sRES\\s*$": "RESP-CHEST", +
->   "(?i)^EEG\\(sec\\)\\s*$": "EEG-C3-A2", +
->   "_": "-" +
-> } +
-> </code> +
->  +
-> The keys are regex patterns (case-insensitive), and the values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1''). +
->  +
-> Based on the above example, generate a similar JSON mapping for the following list of channel names: +
-> ``` +
-> [INSERT YOUR UNIQUE CHANNEL LIST HERE] +
-> ``` +
->  +
-> Provide the output within a JSON code block.+
  
-====== Event Descriptions → assets/event_description_mapping.json ======+Key requirements: 
 +1. **Include early normalization rules** to handle common delimiters (e.g., `_`, spaces, `.`, `/`, parentheses) by converting them to hyphens (`-`), collapsing multiple hyphens, and trimming leading/trailing hyphens. 
 +2. All patterns must be **case-insensitive** (use `(?i)`). 
 +3. Use **physiologically meaningful, NSRR/AASM-aligned names**, such as: 
 +   - `EEG-C3-M2` (not `EEG-C3_A2` or ambiguous forms) 
 +   - `EMG-LLEG` / `EMG-RLEG` for leg EMG (not `LAT`/`RAT` as position) 
 +   - `RESP-AIRFLOW-THERM` or `RESP-AIRFLOW-PRES` (not generic `RESP-NASAL`) 
 +   - `EOG-LOC` / `EOG-ROC` for eye channels 
 +   - `EMG-CHIN` for chin EMG 
 +   - `PULSE` for heart rate or pulse signals (unless raw ECG → `ECG`) 
 +4. **Do not include a final catch-all rule** (e.g., `^(.+)$ → MISC-\1`) unless explicitly requested—most channels in the input list should be known and mapped specifically. 
 +5. Replacements are applied **in order**, with each rule operating on the result of the previous one. 
 + 
 +Example reference snippet: 
 +```json 
 +
 +  "(?i)[\\s_\\./\\(\\),]+": "-", 
 +  "-+": "-", 
 +  "^-|-$": "", 
 +  "(?i)^abdomen$": "RESP-ABD", 
 +  "(?i)^c3_m2$": "EEG-C3-M2", 
 +  "(?i)^lat$": "EMG-LLEG" 
 +
 +``` 
 + 
 +Now, generate a similar JSON mapping for the following list of channel names: 
 +``` 
 +[INSERT YOUR UNIQUE CHANNEL LIST HERE] 
 +``` 
 + 
 +Provide the output **within a JSON code block only**—no explanations. 
 +</code> 
 +===== Event Descriptions → assets/event_description_mapping.json =====
  
 Normalize inconsistent event labels (e.g., "stage_AASM_e1_W" → "sleep_stage_wake"): Normalize inconsistent event labels (e.g., "stage_AASM_e1_W" → "sleep_stage_wake"):
Line 189: Line 204:
 The last pattern acts as a catch-all: unknown events are prefixed with ''[comment]'' to prevent validation errors. The last pattern acts as a catch-all: unknown events are prefixed with ''[comment]'' to prevent validation errors.
  
-===== Validation =====+==== Validation ====
  
 Use ''test_replace_events.py'' to validate your mappings: Use ''test_replace_events.py'' to validate your mappings:
Line 206: Line 221:
 </code> </code>
  
-===== AI-Assisted Mapping Generation =====+==== AI-Assisted Mapping Generation ====
  
 Use this prompt with your LLM to accelerate mapping creation: Use this prompt with your LLM to accelerate mapping creation:
Line 235: Line 250:
 ---- ----
  
-===== 4. Metadata Handling =====+====== 4. Metadata Handling ======
  
 RKNS groups metadata by high-level categories (e.g., ''demographics'', ''clinical'', ''questionnaires'') to organize heterogeneous data sources into a consistent internal structure.   RKNS groups metadata by high-level categories (e.g., ''demographics'', ''clinical'', ''questionnaires'') to organize heterogeneous data sources into a consistent internal structure.  
Line 258: Line 273:
 </code> </code>
  
-====== Category Mapping ======+==== Category Mapping ====
  
 The script uses the ''category_mapping'' dictionary (defined at the top of ''main.py'') to automatically map folder paths to one of these standardized categories: The script uses the ''category_mapping'' dictionary (defined at the top of ''main.py'') to automatically map folder paths to one of these standardized categories:
Line 273: Line 288:
   * ''treatment'' – CPAP, therapy, adherence   * ''treatment'' – CPAP, therapy, adherence
  
-====== How It Works ======+==== How It Works ====
  
   1. The script extracts the participant ID from the EDF filename (e.g., "sub-0001" from "sub-0001_task-sleep_eeg.edf").   1. The script extracts the participant ID from the EDF filename (e.g., "sub-0001" from "sub-0001_task-sleep_eeg.edf").
Line 285: Line 300:
 ---- ----
  
-===== 5. CLI: Python & Docker =====+====== 5. CLI: Python & Docker ======
  
-====== Testing the Development CLI ======+==== Testing the Development CLI ====
  
 Once you've implemented your conversion logic in ''main.py'', test it end-to-end with your actual data: Once you've implemented your conversion logic in ''main.py'', test it end-to-end with your actual data:
Line 308: Line 323:
   5. Output a ''.rkns'' file with 777 permissions.   5. Output a ''.rkns'' file with 777 permissions.
  
-====== Building and Testing the Docker Image ======+==== Building and Testing the Docker Image ====
  
 Build the Docker image: Build the Docker image:
Line 334: Line 349:
 **Security Note:** Build args are visible in intermediate layer history. Avoid storing secrets (API keys, credentials) as build args; use runtime environment variables or mount secrets instead. **Security Note:** Build args are visible in intermediate layer history. Avoid storing secrets (API keys, credentials) as build args; use runtime environment variables or mount secrets instead.
  
-====== Testing the Docker-CLI ======+==== Testing the Docker-CLI ====
  
 Build with ''./build.sh'' (credentials stay in builder stage) and try running it on the example data with the same arguments as ''run_example.sh''. Build with ''./build.sh'' (credentials stay in builder stage) and try running it on the example data with the same arguments as ''run_example.sh''.
Line 342: Line 357:
 ---- ----
  
-===== 6. Orchestration: Airflow DAG =====+====== 6. Orchestration: Airflow DAG ======
  
 The provided ''edf_migration_dag.py'' serves as a template for processing all EDF files in your dataset at scale using Apache Airflow. The provided ''edf_migration_dag.py'' serves as a template for processing all EDF files in your dataset at scale using Apache Airflow.
  
-====== DAG Overview ======+==== DAG Overview ====
  
 The DAG: The DAG:
Line 354: Line 369:
   4. Collects results and validates completion.   4. Collects results and validates completion.
  
-====== Required Adaptations ======+==== Required Adaptations ====
  
-===== a) Input Dataset Path =====+=== a) Input Dataset Path ===
  
 Update ''base_path'' to point to your root BIDS-like directory: Update ''base_path'' to point to your root BIDS-like directory:
Line 365: Line 380:
 </code> </code>
  
-===== b) Annotation File Naming Logic =====+=== b) Annotation File Naming Logic ===
  
 The DAG assumes each EDF file has a corresponding annotation file. The default logic replaces ''.edf'' with ''_events.tsv'': The DAG assumes each EDF file has a corresponding annotation file. The default logic replaces ''.edf'' with ''_events.tsv'':
Line 387: Line 402:
 </code> </code>
  
-===== c) Metadata File Paths =====+=== c) Metadata File Paths ===
  
 The DAG uses global ''participants.tsv'' and ''participants.json'' files. Ensure they exist at the repository root: The DAG uses global ''participants.tsv'' and ''participants.json'' files. Ensure they exist at the repository root:
Line 400: Line 415:
   * Validate that files exist before passing to the Docker task.   * Validate that files exist before passing to the Docker task.
  
-===== d) Docker Image Name =====+=== d) Docker Image Name ===
  
 Update the image parameter to match your built container: Update the image parameter to match your built container:
Line 408: Line 423:
 </code> </code>
  
-===== e) Output Directory =====+=== e) Output Directory ===
  
 The output path specifies where ''.rkns'' files are written: The output path specifies where ''.rkns'' files are written:
Line 421: Line 436:
   * Sufficient disk space is available.   * Sufficient disk space is available.
  
-===== f) Volume Mounts =====+=== f) Volume Mounts ===
  
 The DAG binds host directories into the container. Update both the host path and container mount point: The DAG binds host directories into the container. Update both the host path and container mount point:
Line 448: Line 463:
 </code> </code>
  
-===== g) Example: Full Adaptation =====+=== g) Example: Full Adaptation ===
  
 Here's a complete example for a BIDS dataset stored at ''/mnt/bids'': Here's a complete example for a BIDS dataset stored at ''/mnt/bids'':
Line 492: Line 507:
 ---- ----
  
-===== 7. Project Structure =====+====== 7. Project Structure ======
  
 <code> <code>
Line 518: Line 533:
 ---- ----
  
-===== 8. Workflow Summary =====+====== 8. Workflow Summary ======
  
-====== End-to-End Process ======+==== End-to-End Process ====
  
   1. **Prepare Annotations**   1. **Prepare Annotations**
  • workflow_dataset_to_rkns.1760964092.txt.gz
  • Last modified: 2025/10/20 12:41
  • by fabricio