diff --git a/docs/source/eager_tutorials/finetuning.rst b/docs/source/eager_tutorials/finetuning.rst index 7332715e24..0de283124c 100644 --- a/docs/source/eager_tutorials/finetuning.rst +++ b/docs/source/eager_tutorials/finetuning.rst @@ -290,20 +290,7 @@ such as regular INT4 or even newer `MXFP4 or NVFP4 `__. -You can also try it out by running the following command, -or refer to their `QLoRA tutorial `__ -for more details. - -.. code:: - - tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device.yaml - -Option 2: HuggingFace PEFT Integration +Option 1: HuggingFace PEFT Integration ====================================== `HuggingFace PEFT `__ @@ -332,18 +319,8 @@ Float8 Quantized Fine-tuning Similar to `pre-training `__, we can also leverage float8 in fine-tuning for higher training throughput with no accuracy degradation and no increase in memory usage. -Float8 training is integrated into TorchTune's distributed -full fine-tuning recipe, leveraging the same APIs as our -integration with TorchTitan. Users can invoke this fine-tuning -recipe as follows: - -.. code:: - - tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config llama3_2/3B_full - enable_fp8_training=true \ - fp8_recipe_name=tensorwise \ - compile=True - +Float8 fine-tuning leverages the same float8 training APIs as our +integration with `TorchTitan `__. Initial experiments saw up to 16.5% throughput improvement for fine-tuning Llama3.2-3B in float8: