Add MFU tracking documentation to recipe READMEs

gagank1 · claude · gagank1 · commit 27255a169f3b · 2026-04-13T20:30:25.000Z
Document the log_mfu=true flag and flops.py CLI utilities in all 4
native_te recipe READMEs: llama3, esm2, codonfm, opengenome2.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Gagan Kaushik &lt;gkaushik@nvidia.com&gt;
diff --git a/bionemo-recipes/recipes/codonfm_native_te/README.md b/bionemo-recipes/recipes/codonfm_native_te/README.md
@@ -177,6 +177,23 @@ python train_fsdp2.py \
 A final model suitable for uploading to the Hugging Face Hub can be exported at the end of training by setting
 `checkpoint.save_final_model=true`.
 
+## MFU Tracking
+
+Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+
+```bash
+torchrun --nproc_per_node=1 train_fsdp2.py --config-name encodon_1b log_mfu=true
+```
+
+This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture from the model config.
+
+The `flops.py` CLI provides standalone utilities:
+
+```bash
+python flops.py gpu-info                                       # Show GPU and peak TFLOPS
+torchrun --nproc_per_node=2 flops.py bandwidth                # Measure P2P GPU bandwidth
+```
+
 ## Developer Guide
 
 ### Running Tests
diff --git a/bionemo-recipes/recipes/esm2_native_te/README.md b/bionemo-recipes/recipes/esm2_native_te/README.md
@@ -374,6 +374,25 @@ output = model(**inputs)
 
 - [ESM-2 Training with Accelerate](../esm2_accelerate_te/README.md)
 
+## MFU Tracking
+
+Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+
+```bash
+torchrun --nproc_per_node=2 train_fsdp2.py --config-name L1_3B log_mfu=true
+```
+
+This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (MHA, standard FFN, etc.) from the model config.
+
+The `flops.py` CLI provides standalone utilities:
+
+```bash
+python flops.py gpu-info                                                       # Show GPU and peak TFLOPS
+python flops.py flops --config-path ./model_configs/nvidia/esm2_t6_8M_UR50D   # Compute FLOPs
+python flops.py flops --config-path nvidia/esm2_t36_3B_UR50D                   # FLOPs from HF Hub config
+torchrun --nproc_per_node=2 flops.py bandwidth                                # Measure P2P GPU bandwidth
+```
+
 ## Developer Guide
 
 ### Running Tests
diff --git a/bionemo-recipes/recipes/llama3_native_te/README.md b/bionemo-recipes/recipes/llama3_native_te/README.md
@@ -412,6 +412,25 @@ Once converted, the model can be loaded by any library that supports Llama 3, su
 vllm serve path/to/hf_converted_model
 ```
 
+## MFU Tracking
+
+Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+
+```bash
+torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true
+```
+
+This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (GQA, SwiGLU, etc.) from the model config.
+
+The `flops.py` CLI provides standalone utilities:
+
+```bash
+python flops.py gpu-info                                       # Show GPU and peak TFLOPS
+python flops.py flops --config-path ./model_configs/lingua-1B  # Compute FLOPs for a config
+python flops.py cp-comm --config-path ./model_configs/lingua-1B --cp-size 2  # CP comm estimate
+torchrun --nproc_per_node=2 flops.py bandwidth                # Measure P2P GPU bandwidth
+```
+
 ## Developer Guide
 
 ### Running tests
diff --git a/bionemo-recipes/recipes/opengenome2_llama_native_te/README.md b/bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
@@ -411,6 +411,25 @@ Validation logging during training can be enabled with `validation.enabled=true`
 validation data (e.g. a JSONL file). The `og2_7b_thd_gqa` config enables validation by default.
 Control evaluation frequency with `validation.eval_interval` and `validation.num_batches`.This can be helpful when debugging training convergence.
 
+## MFU Tracking
+
+Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+
+```bash
+torchrun --nproc_per_node=2 train_fsdp2_cp.py log_mfu=true
+```
+
+This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (GQA, SwiGLU, etc.) from the model config.
+
+The `flops.py` CLI provides standalone utilities:
+
+```bash
+python flops.py gpu-info                                                          # Show GPU and peak TFLOPS
+python flops.py flops --config-path ./model_configs/meta-llama/Llama-3.1-8B       # Compute FLOPs
+python flops.py cp-comm --config-path ./model_configs/meta-llama/Llama-3.1-8B --cp-size 2  # CP comm estimate
+torchrun --nproc_per_node=2 flops.py bandwidth                                   # Measure P2P GPU bandwidth
+```
+
 ## Developer Guide
 
 ### Running tests