Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 3 additions & 26 deletions docs/source/eager_tutorials/finetuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -290,20 +290,7 @@ such as regular INT4 or even newer `MXFP4 or NVFP4 <https://github.com/pytorch/a
targeting Blackwell GPUs to reap similar memory benefits with
varying tradeoffs.

Option 1: TorchTune Integration
===============================

TorchTune incorporates the `NF4Tensor` in its QLoRA fine-tuning
recipe through their implementation of `LoRALinear <https://github.com/pytorch/torchtune/blob/a6290a5b40758f13bca61c386bc8756a49ef417e/torchtune/modules/peft/lora.py#L19>`__.
You can also try it out by running the following command,
or refer to their `QLoRA tutorial <https://docs.pytorch.org/torchtune/stable/tutorials/qlora_finetune.html>`__
for more details.

.. code::

tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device.yaml

Option 2: HuggingFace PEFT Integration
Option 1: HuggingFace PEFT Integration
======================================

`HuggingFace PEFT <https://huggingface.co/docs/peft/main/en/developer_guides/quantization#torchao-pytorch-architecture-optimization>`__
Expand Down Expand Up @@ -332,18 +319,8 @@ Float8 Quantized Fine-tuning
Similar to `pre-training <pretraining.html>`__, we can also
leverage float8 in fine-tuning for higher training throughput
with no accuracy degradation and no increase in memory usage.
Float8 training is integrated into TorchTune's distributed
full fine-tuning recipe, leveraging the same APIs as our
integration with TorchTitan. Users can invoke this fine-tuning
recipe as follows:

.. code::

tune run --nnodes 1 --nproc_per_node 4 full_finetune_distributed --config llama3_2/3B_full
enable_fp8_training=true \
fp8_recipe_name=tensorwise \
compile=True

Float8 fine-tuning leverages the same float8 training APIs as our
integration with `TorchTitan <https://github.com/pytorch/torchtitan>`__.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to the float8 README there specifically?

Initial experiments saw up to 16.5% throughput improvement
for fine-tuning Llama3.2-3B in float8:

Expand Down
Loading