Skip to content

Commit 27255a1

Browse files
gagank1claude
andcommitted
Add MFU tracking documentation to recipe READMEs
Document the log_mfu=true flag and flops.py CLI utilities in all 4 native_te recipe READMEs: llama3, esm2, codonfm, opengenome2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
1 parent 1635aab commit 27255a1

File tree

4 files changed

+74
-0
lines changed

4 files changed

+74
-0
lines changed

bionemo-recipes/recipes/codonfm_native_te/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,23 @@ python train_fsdp2.py \
177177
A final model suitable for uploading to the Hugging Face Hub can be exported at the end of training by setting
178178
`checkpoint.save_final_model=true`.
179179

180+
## MFU Tracking
181+
182+
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
183+
184+
```bash
185+
torchrun --nproc_per_node=1 train_fsdp2.py --config-name encodon_1b log_mfu=true
186+
```
187+
188+
This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture from the model config.
189+
190+
The `flops.py` CLI provides standalone utilities:
191+
192+
```bash
193+
python flops.py gpu-info # Show GPU and peak TFLOPS
194+
torchrun --nproc_per_node=2 flops.py bandwidth # Measure P2P GPU bandwidth
195+
```
196+
180197
## Developer Guide
181198

182199
### Running Tests

bionemo-recipes/recipes/esm2_native_te/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,25 @@ output = model(**inputs)
374374

375375
- [ESM-2 Training with Accelerate](../esm2_accelerate_te/README.md)
376376

377+
## MFU Tracking
378+
379+
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
380+
381+
```bash
382+
torchrun --nproc_per_node=2 train_fsdp2.py --config-name L1_3B log_mfu=true
383+
```
384+
385+
This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (MHA, standard FFN, etc.) from the model config.
386+
387+
The `flops.py` CLI provides standalone utilities:
388+
389+
```bash
390+
python flops.py gpu-info # Show GPU and peak TFLOPS
391+
python flops.py flops --config-path ./model_configs/nvidia/esm2_t6_8M_UR50D # Compute FLOPs
392+
python flops.py flops --config-path nvidia/esm2_t36_3B_UR50D # FLOPs from HF Hub config
393+
torchrun --nproc_per_node=2 flops.py bandwidth # Measure P2P GPU bandwidth
394+
```
395+
377396
## Developer Guide
378397

379398
### Running Tests

bionemo-recipes/recipes/llama3_native_te/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,25 @@ Once converted, the model can be loaded by any library that supports Llama 3, su
412412
vllm serve path/to/hf_converted_model
413413
```
414414

415+
## MFU Tracking
416+
417+
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
418+
419+
```bash
420+
torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true
421+
```
422+
423+
This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (GQA, SwiGLU, etc.) from the model config.
424+
425+
The `flops.py` CLI provides standalone utilities:
426+
427+
```bash
428+
python flops.py gpu-info # Show GPU and peak TFLOPS
429+
python flops.py flops --config-path ./model_configs/lingua-1B # Compute FLOPs for a config
430+
python flops.py cp-comm --config-path ./model_configs/lingua-1B --cp-size 2 # CP comm estimate
431+
torchrun --nproc_per_node=2 flops.py bandwidth # Measure P2P GPU bandwidth
432+
```
433+
415434
## Developer Guide
416435

417436
### Running tests

bionemo-recipes/recipes/opengenome2_llama_native_te/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,6 +411,25 @@ Validation logging during training can be enabled with `validation.enabled=true`
411411
validation data (e.g. a JSONL file). The `og2_7b_thd_gqa` config enables validation by default.
412412
Control evaluation frequency with `validation.eval_interval` and `validation.num_batches`.This can be helpful when debugging training convergence.
413413

414+
## MFU Tracking
415+
416+
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
417+
418+
```bash
419+
torchrun --nproc_per_node=2 train_fsdp2_cp.py log_mfu=true
420+
```
421+
422+
This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture (GQA, SwiGLU, etc.) from the model config.
423+
424+
The `flops.py` CLI provides standalone utilities:
425+
426+
```bash
427+
python flops.py gpu-info # Show GPU and peak TFLOPS
428+
python flops.py flops --config-path ./model_configs/meta-llama/Llama-3.1-8B # Compute FLOPs
429+
python flops.py cp-comm --config-path ./model_configs/meta-llama/Llama-3.1-8B --cp-size 2 # CP comm estimate
430+
torchrun --nproc_per_node=2 flops.py bandwidth # Measure P2P GPU bandwidth
431+
```
432+
414433
## Developer Guide
415434

416435
### Running tests

0 commit comments

Comments
 (0)