Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
027f439
Move dataset validation to load_trace_rows
SkiHatDuckie Jun 15, 2026
6ec92f5
Rename `TraceColumn` in test file to `TraceColumnGenerator`
SkiHatDuckie Jun 15, 2026
76f43df
Add relative_timestamp column to deserialized dataset
SkiHatDuckie Jun 15, 2026
d79be07
Switch to streaming datasets for synthetic trace
SkiHatDuckie Jun 16, 2026
00ccea7
Update docs
SkiHatDuckie Jun 16, 2026
7de32d6
Add trace_common.py + classes
SkiHatDuckie Jun 16, 2026
6f6e464
Repair broken test files
SkiHatDuckie Jun 17, 2026
9cd8729
Instantiate/Validate/Dispatch formats through TraceFormatArgs
SkiHatDuckie Jun 17, 2026
26b740f
Rework format handling; flatten data args for CLI
SkiHatDuckie Jun 18, 2026
7911a87
Repair tests
SkiHatDuckie Jun 18, 2026
9aea389
Remove TraceDataset from __all__
SkiHatDuckie Jun 18, 2026
27687b2
Move common funcs to trace_common
SkiHatDuckie Jun 18, 2026
73b2eda
Add test_trace_common.py and rearrange tests
SkiHatDuckie Jun 22, 2026
2cdb3b8
Refactor test_trace_synthetic
SkiHatDuckie Jun 22, 2026
04626fe
Rename trace_synthetic to trace_minimal
SkiHatDuckie Jun 22, 2026
8f1ab50
Improve text coverage
SkiHatDuckie Jun 22, 2026
dff58aa
Update inline docs
SkiHatDuckie Jun 22, 2026
342bd0c
Update docs
SkiHatDuckie Jun 22, 2026
37f3b10
Cleanup linting & docs
SkiHatDuckie Jun 22, 2026
7bc2f96
Spread `kind`ness
SkiHatDuckie Jun 24, 2026
e0fe688
Update docs
SkiHatDuckie Jun 24, 2026
955a8f5
Fix: Register formats with deserializer
SkiHatDuckie Jun 24, 2026
8908132
Satisfy linting
SkiHatDuckie Jun 24, 2026
e344e80
Move `timestamps` outside the loop
SkiHatDuckie Jun 25, 2026
1e842a4
Register formats w/ deserializer outside trace_common
SkiHatDuckie Jun 29, 2026
f68217f
Update TraceDataArgs
SkiHatDuckie Jun 29, 2026
73617d1
Move trace_io contents to trace_common
SkiHatDuckie Jun 29, 2026
d6f0e67
Fix typo
SkiHatDuckie Jun 30, 2026
9dd48e4
Remove TraceColumn
SkiHatDuckie Jun 30, 2026
f32a01e
Specify bad path reason
SkiHatDuckie Jun 30, 2026
5c51338
Add comment to create_prompt
SkiHatDuckie Jun 30, 2026
9edb490
Re-register trace_minimal as trace_synthetic
SkiHatDuckie Jun 30, 2026
9a2c7c1
Support more filetypes + update docs
SkiHatDuckie Jun 30, 2026
8346b53
Rename trace_file_formats.md to trace_replay.md
SkiHatDuckie Jun 30, 2026
2741e11
Make margin_of_safety an optional parameter
SkiHatDuckie Jun 30, 2026
5b955d1
Update exception msgs
SkiHatDuckie Jun 30, 2026
00efe7b
Update exception msgs x2
SkiHatDuckie Jun 30, 2026
89ba55d
Merge of #829
mergify[bot] Jun 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/getting-started/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ guidellm run --profile kind=sweep,sweep_size=10,rampup_duration=10,strategy_type

#### Replay Profile

Replays trace events using timestamps from a `trace_synthetic` dataset. See [Trace Replay Benchmarking](#trace-replay-benchmarking-beta) below for data setup.
Replays trace events using timestamps from a trace file dataset. See [Trace Replay Benchmarking](#trace-replay-benchmarking) below for data setup.

```bash
guidellm run --profile kind=replay,time_scale=1.0
Expand Down Expand Up @@ -225,9 +225,9 @@ guidellm run \

You can customize synthetic data generation with additional parameters such as standard deviation, minimum, and maximum values. See the [Datasets Synthetic data documentation](../guides/datasets.md#synthetic-data) for more details.

### Trace Replay Benchmarking (beta)
### Trace Replay Benchmarking

For realistic load testing, replay trace events using each row's timestamp and token lengths. Trace files must be JSONL and are loaded with the `trace_synthetic` data type. By default, each row uses `timestamp`, `input_length`, and `output_length` fields. Timestamps may be absolute or monotonic values; GuideLLM sorts them and converts them to offsets from the first event before scheduling:
For realistic load testing, replay trace events using each row's timestamp and token lengths. Trace files must be JSONL, JSON, CSV, or Parquet and are loaded with a supported [trace file format](../guides/trace_replay.md#supported-formats). Timestamps may be absolute or monotonic values; GuideLLM sorts them and converts them to offsets from the first event before scheduling:

```json
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
Expand All @@ -249,7 +249,7 @@ The replay profile parameter `time_scale` acts as a scaling factor for the inter

GuideLLM orders trace rows by timestamp before scheduling and payload generation, so each scheduled event uses the token lengths from the same sorted row. Use `--data-loader kind=pytorch,samples=1000` to limit how many trace rows are loaded and replayed. `--constraint kind=max_requests,count=1000` remains a runtime completion constraint; it does not truncate the trace dataset.

If your trace uses different column names, include `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in the data config:
Every format by default looks for the columns "timestamp", "input_length", and "output_length". If your trace uses different column names, include `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in the data config:

```bash
guidellm run \
Expand All @@ -258,7 +258,7 @@ guidellm run \
--profile kind=replay,time_scale=1.0
```

For very small prompts (roughly under 15 tokens, depending on the tokenizer), GuideLLM may not have enough room to include the full per-row unique prefix. Different rows can then produce similar or identical prompts, which reduces cache resistance in replay benchmarks.
This functionality extends to columns required by specific formats. These additional columns and other format-specific arguments are described in the [Trace File Formats documentation](../guides/trace_replay.md)

### Working with Real Data

Expand Down
14 changes: 6 additions & 8 deletions docs/guides/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ The following arguments configure datasets and their processing:
- `synthetic_text` — generates synthetic prompts on the fly. Required field: `prompt_tokens`. Optional: `output_tokens`, `turns`, `prefix_tokens`, `prefix_count`, `prefix_buckets`, and distribution controls (`prompt_tokens_stdev`, `output_tokens_stdev`, etc.).
- `huggingface` (alias `hf`) — loads from HuggingFace Hub or a local directory/file. Required field: `source` (dataset ID or path). Pass dataset loading arguments (for example `split`, `name`) via `load_kwargs`.
- `json_file`, `csv_file`, `text_file`, `parquet_file`, `arrow_file`, `hdf5_file`, `db_file`, `tar_file` — loads from a local file. Required field: `path`.
- `trace_synthetic` — loads a JSONL trace file for replay benchmarking. Required field: `path`. Optional: `timestamp_column` (default: `timestamp`), `prompt_tokens_column` (default: `input_length`), `output_tokens_column` (default: `output_length`).
- `trace_synthetic`, `mooncake` — loads a JSONL, JSON, CSV, or Parquet trace file for replay benchmarking. Required field: `path`. Optional: `timestamp_column` (default: `timestamp`), `prompt_tokens_column` (default: `input_length`), `output_tokens_column` (default: `output_length`).

In addition, you can specify additional arguments to the dataset loading with the data argument `loader_kwargs`:
In addition, you can specify additional arguments to the dataset loading with the data argument `load_kwargs`:

- loader_kwargs: Additional arguments to the dataset loading. For example, dataset splits can be specified with `--data '{"kind":"huggingface","source":"my/dataset","loader_kwargs":{"split":"test"}}'`.
- load_kwargs: Additional arguments to the dataset loading. For example, dataset splits can be specified with `--data '{"kind":"huggingface","source":"my/dataset","load_kwargs":{"split":"test"}}'`.

### Data Loader

Expand Down Expand Up @@ -188,7 +188,7 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
{"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
```

- **Trace files (`.jsonl` with `trace_synthetic` type)**: Specialized JSONL files for replay benchmarking with `timestamp`, `input_length`, and `output_length` fields. Used with `--profile kind=replay` to replay trace events using each row's timestamp and token lengths. Timestamps must be numbers expressed in seconds on a shared timeline with any consistent zero point; GuideLLM sorts them and converts them to offsets from the first event before scheduling. Date strings are not parsed yet, so provide timestamps as numbers. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking-beta).
- **Trace files (`.jsonl`, `.json`, `.csv` or `.parquet` with a supported trace file format)**: Specialized files for replay. Used with `--profile kind=replay` to replay trace events using each row's timestamp and token lengths. Timestamps must be numbers expressed in seconds on a shared timeline with any consistent zero point; GuideLLM sorts them and converts them to offsets from the first event before scheduling. Date strings are not parsed yet, so provide timestamps as numbers. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking).

```json
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
Expand All @@ -197,7 +197,7 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,

In this example, the second request is scheduled 0.5 seconds after the first request. Trace rows are ordered by timestamp before GuideLLM schedules requests and generates synthetic payloads. This keeps each scheduled event aligned with the prompt and output token lengths from the same row.

Use `trace_synthetic` to enable trace loading:
Use a supported [trace file format](./trace_replay.md#supported-formats) to enable trace loading:

```bash
guidellm run \
Expand All @@ -206,7 +206,7 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
--data kind=trace_synthetic,path=path/to/trace.jsonl
```

If your trace uses different column names, include `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in the data config:
All trace formats by default look for the columns "timestamp", "input_length", and "output_length". If your trace uses different column names, include `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in the data config:

```bash
guidellm run \
Expand All @@ -217,8 +217,6 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,

For replay, `time_scale` on the profile is a time scale for the intervals between trace events rather than requests per second. Use `--data-loader kind=pytorch,samples=1000` to limit how many trace rows are loaded and replayed. Use `--constraint kind=max_requests,count=<n>` only as a runtime completion constraint; it does not limit the trace rows loaded from the file.

Very small `input_length` values (roughly under 15 tokens, depending on the tokenizer) may not leave enough room for the full per-row unique prefix in the synthetic prompt. This can make prompts more similar across rows and weaken cache resistance. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking) for details.

- **JSON files (`.json`)**: Where the entire dataset is represented as a JSON array of objects nested under a specific key. To surface the correct key to use, a `--data-column-mapper` argument must be passed in of `"field": "NAME"` for where the array exists. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-column-mapper` argument.

```json
Expand Down
44 changes: 44 additions & 0 deletions docs/guides/trace_replay.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Trace File Formats

Many trace files are formatted in ways that need to be specially handled to create an accurate replay. This guide covers all trace file formats currently supported by GuideLLM, along with the format-agnostic and format-specific data arguments.

Detailed use of the replay profile and file-based datasets as a whole is explained in [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking).

## Supported Formats

These are passed to the `--data` argument as `kind=format`:

- `trace_synthetic`: A trace format that does the bare minimum needed to complete a fully functioning trace replay benchmark with synthetic prompt generation
- `mooncake`: The trace format used by the serving platform Mooncake, as defined in [https://doi.org/10.48550/arXiv.2407.00079](https://doi.org/10.48550/arXiv.2407.00079)

## Format-Agnostic Data Arguments

All trace formats can accept the following optional data arguments:

| Argument | Default | Description |
| ---------------------- | --------------- | ----------------------------------------------------- |
| `timestamp_column` | "timestamp" | Column name for timestamps in the trace file |
| `prompt_tokens_column` | "input_length" | Column name for prompt token counts in the trace file |
| `output_tokens_column` | "output_length" | Column name for output token counts in the trace file |

These are passed through the `--data` argument like below:

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile kind=replay \
--data "kind=trace_synthetic,path=replay.jsonl,timestamp_column=ts,prompt_tokens_column=input_tokens,output_tokens_column=generated_tokens"
```

`trace_synthetic` can be thought of as the format-agnostic option, only looking for the timestamp, prompt token count and output token count columns and ignoring all other features contained in a dataset. While primarily used for testing, `trace_synthetic` may be used as a fallback for trace formats not currently supported by GuideLLM.

## Format-Specific Data Arguments

### `mooncake`

The Mooncake format expects an additional column for hash IDs. During prompt generation, hash IDs sharing the same previous ID are required to represent dinstinct blocks of token ids.

| Argument | Default | Description |
| -------------------- | ---------- | --------------------------------------------------- |
| `hash_ids_column` | "hash_ids" | Column name for lists of hash IDs in the trace file |
| `hash_id_block_size` | 512 | Amount of tokens represented by one hash ID |
24 changes: 18 additions & 6 deletions src/guidellm/data/deserializers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,16 @@
SyntheticTextDataset,
SyntheticTextDatasetDeserializer,
)
from .trace_mooncake import TraceMooncakeDataArgs, TraceMooncakeDatasetDeserializer
from .trace_synthetic import TraceSyntheticDataArgs, TraceSyntheticDatasetDeserializer
from .trace_common import (
TraceDataArgs,
TraceDatasetDeserializer,
TraceFormatBase,
TraceFormatRegistry,
decode_prompt,
generate_token_ids,
)
from .trace_minimal import MinimalTraceFormatArgs
from .trace_mooncake import MooncakeTraceFormatArgs

__all__ = [
"ArrowFileDatasetDeserializer",
Expand All @@ -49,14 +57,18 @@
"InMemoryItemListDataArgs",
"InMemoryItemListDatasetDeserializer",
"JSONFileDatasetDeserializer",
"MinimalTraceFormatArgs",
"MooncakeTraceFormatArgs",
"ParquetFileDatasetDeserializer",
"SyntheticTextDataArgs",
"SyntheticTextDataset",
"SyntheticTextDatasetDeserializer",
"TarFileDatasetDeserializer",
"TextFileDatasetDeserializer",
"TraceMooncakeDataArgs",
"TraceMooncakeDatasetDeserializer",
"TraceSyntheticDataArgs",
"TraceSyntheticDatasetDeserializer",
"TraceDataArgs",
"TraceDatasetDeserializer",
"TraceFormatBase",
"TraceFormatRegistry",
"decode_prompt",
"generate_token_ids",
]
Loading
Loading