Differences

This shows you the differences between two versions of the page.

--- workflow_dataset_to_rkns [2025/10/20 12:35] – created fabricio
+++ workflow_dataset_to_rkns [2025/10/20 14:03] (current) – [AI-Assisted Mapping Generation] fabricio
@@ Line 1: / Line 1: @@
 ====== Turning Datasets into RKNS Format ======
 This guide describes how to convert a structured dataset (e.g., BIDS) into the **RKNS format** using the [[https://code.rekonas.com/rekonas/airflow-template-dataset-to-rkns|airflow-template-dataset-to-rkns]] template repository. The process is divided into three main phases:
-. **Transform Logic** – Implement the conversion from raw data to RKNS (via ''main.py'').
+  - **Transform Logic** – Implement the conversion from raw data to RKNS (via ''main.py'').
-. **Containerization** – Package the logic into a portable Docker image (''build.sh'' + ''Dockerfile'').
+  - **Containerization** – Package the logic into a portable Docker image (''build.sh'' + ''Dockerfile'').
-. **Orchestration** – Run the transformation at scale using Apache Airflow.
+  - **Orchestration** – Run the transformation at scale using Apache Airflow.
 Below, we detail each phase with practical guidance.
@@ Line 11: / Line 12: @@
 ----
-===== 1. Transform Logic (main.py) =====
+====== 1. Transform Logic (main.py) ======
 The core conversion logic lives in ''main.py''. It reads:
@@ Line 22: / Line 23: @@
 and outputs a validated ''*.rkns*'' file (a Zarr zip store).
-===== CLI Interface =====
+==== CLI Interface ====
 It is recommended to keep the CLI interface unchanged if possible—it is designed for easier Airflow usage:
@@ Line 40: / Line 41: @@
 The ''edf_to_rkns()'' function executes the following steps:
-====== 1.1 Load and Standardize Signals ======
+==== 1.1 Load and Standardize Signals ====
 <code python>
 rkns_obj = rkns.from_external_format(
@@ Line 50: / Line 51: @@
 Uses regex mappings in ''assets/replace_channels.json'' to rename EDF channels to standardized names (e.g., "EEG(sec)" → "EEG-C3-A2").
-====== 1.2 Add Annotations ======
+==== 1.2 Add Annotations ====
 Converts the TSV file to BIDS-compatible format and adds events:
 <code python>
@@ Line 60: / Line 61: @@
 </code>
-====== 1.3 Extract and Categorize Metadata ======
+==== 1.3 Extract and Categorize Metadata ====
   - Matches participant ID (e.g., "sub-0001") from the EDF filename.
   - Looks up the participant in ''participants.tsv''.
@@ Line 66: / Line 67: @@
   - Groups metadata by category and adds each to the RKNS object.
-====== 1.4 Finalize and Export ======
+==== 1.4 Finalize and Export ====
 <code python>
 rkns_obj.populate()                     # Build internal structure
@@ Line 72: / Line 73: @@
 </code>
-====== 1.5 Validate ======
+==== 1.5 Validate ====
 <code python>
 validate_rkns_checksum(output_file)     # Verify checksums on all data
@@ Line 79: / Line 80: @@
 ----
-===== 2. Preprocessing: Generate RKNS-Compatible Event Annotations =====
+====== 2. Preprocessing: Generate RKNS-Compatible Event Annotations ======
 RKNS requires events in a strict tab-separated (TSV) format with three columns: ''onset'', ''duration'', and ''event''.
@@ Line 109: / Line 110: @@
 ----
-===== 3. Standardizing Names via Regex Mappings =====
+====== 3. Standardizing Names via Regex Mappings ======
 Once you have a compliant TSV, normalize channel and event names non-destructively using regex mappings defined in two JSON files.
-====== Channel Names → assets/replace_channels.json ======
+===== Channel Names → assets/replace_channels.json =====
 EDF channel labels vary wildly (e.g., "EEG C3-A2", "EEG(sec)"). Map them to standardized RKNS names:
@@ Line 129: / Line 130: @@
 Keys are regex patterns (applied sequentially), and values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1'').
-===== Validation =====
+==== Validation ====
 Use ''test_replace_channels.py'' to validate your mappings:
 . Extract unique channel names from your EDF files:
 <code bash>
 find /data -name "*.edf" -exec edf-peek {} \; | grep "signal_labels" | sort | uniq > assets/extracted_channels.txt
 </code>
 . Run the test:
 <code bash>
 python test_replace_channels.py
 </code>
 . Inspect the output:
 <code bash>
 cat out/renamed_channels.csv
 </code>
-===== AI-Assisted Mapping Generation =====
+==== AI-Assisted Mapping Generation ====
 If you have a list of unique channel names, use this prompt with your LLM to accelerate mapping creation:
-> I will provide you a list of EEG channels extracted from the original EDFs.
+<code>
-> I require an output in JSON format that maps channel names to new standardized names using grep-style regex replacements.
+I will provide you a list of physiological signal channels (e.g., EEG, EOG, EMG, respiratory, cardiac) extracted from original EDF files.
->
+I require an output in JSON format that maps each raw channel name to a standardized name using **sequential, grep-style regex replacements**.
-> For example, this is a reference mapping. Note that replacements are applied in sequential order:
-> <code json>
-> {
->   "(?i)^ABDO\\sRES\\s*$": "RESP-ABD",
->   "(?i)^THOR\\sRES\\s*$": "RESP-CHEST",
->   "(?i)^EEG\\(sec\\)\\s*$": "EEG-C3-A2",
->   "_": "-"
-> }
-> </code>
->
-> The keys are regex patterns (case-insensitive), and the values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1'').
->
-> Based on the above example, generate a similar JSON mapping for the following list of channel names:
-> ```
-> [INSERT YOUR UNIQUE CHANNEL LIST HERE]
-> ```
->
-> Provide the output within a JSON code block.
-====== Event Descriptions → assets/event_description_mapping.json ======
+Key requirements:
+. **Include early normalization rules** to handle common delimiters (e.g., `_`, spaces, `.`, `/`, parentheses) by converting them to hyphens (`-`), collapsing multiple hyphens, and trimming leading/trailing hyphens.
+. All patterns must be **case-insensitive** (use `(?i)`).
+. Use **physiologically meaningful, NSRR/AASM-aligned names**, such as:
+   - `EEG-C3-M2` (not `EEG-C3_A2` or ambiguous forms)
+   - `EMG-LLEG` / `EMG-RLEG` for leg EMG (not `LAT`/`RAT` as position)
+   - `RESP-AIRFLOW-THERM` or `RESP-AIRFLOW-PRES` (not generic `RESP-NASAL`)
+   - `EOG-LOC` / `EOG-ROC` for eye channels
+   - `EMG-CHIN` for chin EMG
+   - `PULSE` for heart rate or pulse signals (unless raw ECG → `ECG`)
+. **Do not include a final catch-all rule** (e.g., `^(.+)$ → MISC-\1`) unless explicitly requested—most channels in the input list should be known and mapped specifically.
+. Replacements are applied **in order**, with each rule operating on the result of the previous one.
+Example reference snippet:
+```json
+{
+  "(?i)[\\s_\\./\\(\\),]+": "-",
+  "-+": "-",
+  "^-|-$": "",
+  "(?i)^abdomen$": "RESP-ABD",
+  "(?i)^c3_m2$": "EEG-C3-M2",
+  "(?i)^lat$": "EMG-LLEG"
+}
+```
+Now, generate a similar JSON mapping for the following list of channel names:
+```
+[INSERT YOUR UNIQUE CHANNEL LIST HERE]
+```
+Provide the output **within a JSON code block only**—no explanations.
+</code>
+===== Event Descriptions → assets/event_description_mapping.json =====
 Normalize inconsistent event labels (e.g., "stage_AASM_e1_W" → "sleep_stage_wake"):
@@ Line 189: / Line 204: @@
 The last pattern acts as a catch-all: unknown events are prefixed with ''[comment]'' to prevent validation errors.
-===== Validation =====
+==== Validation ====
 Use ''test_replace_events.py'' to validate your mappings:
 . Extract unique event names from your annotations:
 <code bash>
 tail -n +2 /data/**/*_events.tsv | cut -f3 | sort | uniq > assets/extracted_annotation_events.txt
 </code>
 . Run the test:
 <code bash>
 python test_replace_events.py
 </code>
 . Inspect the output:
 <code bash>
 cat out/renamed_events.csv
 </code>
-===== AI-Assisted Mapping Generation =====
+==== AI-Assisted Mapping Generation ====
 Use this prompt with your LLM to accelerate mapping creation:
-> I will provide you a list of sleep events extracted from original annotation files.
+<code>
-> I require an output in JSON format that maps event names to standardized names using grep-style regex replacements.
+I will provide you a list of sleep events extracted from original annotation files.
->
+I require an output in JSON format that maps event names to standardized names using grep-style regex replacements.
-> For example, this is a reference mapping. Note that replacements are applied in sequential order:
-> <code json>
-> {
->   "(?i)^stage_AASM_e1_W$": "sleep_stage_wake",
->   "(?i)^stage_AASM_e1_R$": "sleep_stage_rem",
->   "(?i)^arousals_e1_\\('arousal_standard',\\s*'EEG_C3'\\)$": "arousal_eeg_c3",
->   "^\\s*(?!sleep_stage_|arousal|apnea|hypopnea|desaturation|artifact|body_position)\"?([^\"]+)\"?\\s*$": "[comment] \\1"
-> }
-> </code>
->
-> The keys are regex patterns (case-insensitive), and the values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1'').
->
-> Based on the above example, generate a similar JSON mapping for the following list of event names:
-> ```
-> [INSERT YOUR UNIQUE EVENT LIST HERE]
-> ```
->
-> Provide the output within a JSON code block.
+For example, this is a reference mapping. Note that replacements are applied in sequential order:
+```json
+{
+  "(?i)^stage_AASM_e1_W$": "sleep_stage_wake",
+  "(?i)^stage_AASM_e1_R$": "sleep_stage_rem",
+  "(?i)^arousals_e1_\\('arousal_standard',\\s*'EEG_C3'\\)$": "arousal_eeg_c3",
+  "^\\s*(?!sleep_stage_|arousal|apnea|hypopnea|desaturation|artifact|body_position)\"?([^\"]+)\"?\\s*$": "[comment] \\1"
+}
+```
+The keys are regex patterns (case-insensitive), and the values are replacement strings. Groups in the pattern can be referenced in the replacement (e.g., ''\1'').
+Based on the above example, generate a similar JSON mapping for the following list of event names:
+```
+[INSERT YOUR UNIQUE EVENT LIST HERE]
+```
+Provide the output within a JSON code block.
+</code>
 ----
-===== 4. Metadata Handling =====
+====== 4. Metadata Handling ======
 RKNS groups metadata by high-level categories (e.g., ''demographics'', ''clinical'', ''questionnaires'') to organize heterogeneous data sources into a consistent internal structure.
@@ Line 257: / Line 273: @@
 </code>
-====== Category Mapping ======
+==== Category Mapping ====
 The script uses the ''category_mapping'' dictionary (defined at the top of ''main.py'') to automatically map folder paths to one of these standardized categories:
@@ Line 272: / Line 288: @@
   * ''treatment'' – CPAP, therapy, adherence
-====== How It Works ======
+==== How It Works ====
 . The script extracts the participant ID from the EDF filename (e.g., "sub-0001" from "sub-0001_task-sleep_eeg.edf").
@@ Line 284: / Line 300: @@
 ----
-===== 5. CLI: Python & Docker =====
+====== 5. CLI: Python & Docker ======
-====== Testing the Development CLI ======
+==== Testing the Development CLI ====
 Once you've implemented your conversion logic in ''main.py'', test it end-to-end with your actual data:
@@ Line 307: / Line 323: @@
 . Output a ''.rkns'' file with 777 permissions.
-====== Building and Testing the Docker Image ======
+==== Building and Testing the Docker Image ====
 Build the Docker image:
@@ Line 333: / Line 349: @@
 **Security Note:** Build args are visible in intermediate layer history. Avoid storing secrets (API keys, credentials) as build args; use runtime environment variables or mount secrets instead.
-====== Testing the Docker-CLI ======
+==== Testing the Docker-CLI ====
 Build with ''./build.sh'' (credentials stay in builder stage) and try running it on the example data with the same arguments as ''run_example.sh''.
@@ Line 341: / Line 357: @@
 ----
-===== 6. Orchestration: Airflow DAG =====
+====== 6. Orchestration: Airflow DAG ======
 The provided ''edf_migration_dag.py'' serves as a template for processing all EDF files in your dataset at scale using Apache Airflow.
-====== DAG Overview ======
+==== DAG Overview ====
 The DAG:
@@ Line 353: / Line 369: @@
 . Collects results and validates completion.
-====== Required Adaptations ======
+==== Required Adaptations ====
-===== a) Input Dataset Path =====
+=== a) Input Dataset Path ===
 Update ''base_path'' to point to your root BIDS-like directory:
@@ Line 364: / Line 380: @@
 </code>
-===== b) Annotation File Naming Logic =====
+=== b) Annotation File Naming Logic ===
 The DAG assumes each EDF file has a corresponding annotation file. The default logic replaces ''.edf'' with ''_events.tsv'':
@@ Line 386: / Line 402: @@
 </code>
-===== c) Metadata File Paths =====
+=== c) Metadata File Paths ===
 The DAG uses global ''participants.tsv'' and ''participants.json'' files. Ensure they exist at the repository root:
@@ Line 399: / Line 415: @@
   * Validate that files exist before passing to the Docker task.
-===== d) Docker Image Name =====
+=== d) Docker Image Name ===
 Update the image parameter to match your built container:
@@ Line 407: / Line 423: @@
 </code>
-===== e) Output Directory =====
+=== e) Output Directory ===
 The output path specifies where ''.rkns'' files are written:
@@ Line 420: / Line 436: @@
   * Sufficient disk space is available.
-===== f) Volume Mounts =====
+=== f) Volume Mounts ===
 The DAG binds host directories into the container. Update both the host path and container mount point:
@@ Line 447: / Line 463: @@
 </code>
-===== g) Example: Full Adaptation =====
+=== g) Example: Full Adaptation ===
 Here's a complete example for a BIDS dataset stored at ''/mnt/bids'':
@@ Line 491: / Line 507: @@
 ----
-===== 7. Project Structure =====
+====== 7. Project Structure ======
 <code>
@@ Line 517: / Line 533: @@
 ----
-===== 8. Workflow Summary =====
+====== 8. Workflow Summary ======
-====== End-to-End Process ======
+==== End-to-End Process ====
 . **Prepare Annotations**
@@ Line 549: / Line 565: @@
 ----
-===== 9. Troubleshooting =====
+====== 9. Troubleshooting ======
-====== EDF File Not Found ======
+==== EDF File Not Found ====
   - Verify the path in ''--input-edf'' is correct and exists.
   - If using Docker, ensure the path is relative to the container mount point (not the host).
-====== Annotations File Not Found ======
+==== Annotations File Not Found ====
   - Check that the annotation file naming logic matches your dataset.
   - Ensure the TSV is in RKNS-compatible format (three columns: onset, duration, event).
-====== Participant Not Found in participants.tsv ======
+==== Participant Not Found in participants.tsv ====
   - The participant ID extraction regex may not match your filename pattern.
   - Update ''extract_sub_part()'' in ''main.py'' to match your naming convention:
@@ Line 571: / Line 587: @@
 </code>
-====== Metadata Columns Not Recognized ======
+==== Metadata Columns Not Recognized ====
   - Verify that ''participants.json'' contains a ''"folder"'' key for each column.
   - Check that the ''folder'' values are in ''category_mapping''.
   - Add missing mappings to ''category_mapping'' as needed.
-====== Channel or Event Names Not Mapped ======
+==== Channel or Event Names Not Mapped ====
   - Extract raw names and add them to the regex mapping JSON.
   - Test with ''test_replace_channels.py'' or ''test_replace_events.py''.
   - Use LLM prompts to generate regex patterns if needed.
-====== Permission Denied on Output File ======
+==== Permission Denied on Output File ====
   - Ensure the output directory is writable by the process (or Airflow worker).
   - Use ''--create-dirs'' to auto-create the directory with correct permissions.
-====== Docker Build Fails ======
+==== Docker Build Fails ====
   - Check that all dependencies in ''pyproject.toml'' are compatible.
   - Verify the ''Dockerfile'' references the correct base image and Python version.
@@ Line 592: / Line 609: @@
 ----
-===== 10. Key Files Reference =====
+====== 10. Key Files Reference ======
-====== main.py ======
+==== main.py ====
   - **''edf_to_rkns()''** – Main conversion function. Orchestrates all steps.
   - **''validate_rkns_checksum()''** – Validates data integrity by reading all signal blocks.
@@ Line 601: / Line 618: @@
   - **''parse_args()''** – CLI argument parser. Do not modify.
-====== assets/replace_channels.json ======
+==== assets/replace_channels.json ====
   - Regex patterns (keys) → standardized channel names (values).
   - Applied sequentially in order.
   - Examples: "EEG(sec)" → "EEG-C3-A2", "SaO2" → "SPO2".
-====== assets/event_description_mapping.json ======
+==== assets/event_description_mapping.json ====
   - Regex patterns (keys) → standardized event labels (values).
   - Applied sequentially in order.
   - Catch-all pattern: unknown events prefixed with ''[comment]''.
-====== participants.json ======
+==== participants.json ====
   - Column codebook with ''Description'' and **required** ''folder'' keys.
   - Folder values are mapped to categories by ''category_mapping'' in ''main.py''.
-====== participants.tsv ======
+==== participants.tsv ====
   - Subject-level metadata table (BIDS-compatible).
   - Rows = subjects, columns = variables (must match ''participants.json'').
@@ Line 622: / Line 639: @@
 ----
-===== 11. Contributing & Customization =====
+====== 11. Contributing & Customization ======
 This template is intentionally configurable. The main customization points are:
-. **''extract_sub_part()'' in main.py** – Update the regex if your participant ID format differs.
+  - **''extract_sub_part()'' in main.py** – Update the regex if your participant ID format differs.
-. **''category_mapping'' in main.py** – Add or modify mappings if your dataset uses different folder paths.
+  - **''category_mapping'' in main.py** – Add or modify mappings if your dataset uses different folder paths.
-. **''assets/replace_channels.json''** – Adapt channel mappings to your EDF sources.
+  - **''assets/replace_channels.json''** – Adapt channel mappings to your EDF sources.
-. **''assets/event_description_mapping.json''** – Adapt event mappings to your annotation sources.
+  - **''assets/event_description_mapping.json''** – Adapt event mappings to your annotation sources.
-. **''edf_migration_dag.py''** – Adjust discovery logic, paths, and Docker config for your environment.
+  - **''edf_migration_dag.py''** – Adjust discovery logic, paths, and Docker config for your environment.
 For questions or issues, refer to the troubleshooting section or the inline code comments in ''main.py''.