Skip to content

Add temperature and power monitoring via amdgpu sensor API#183

Open
mgajda wants to merge 8 commits into
clbr:masterfrom
mgajda:add-extended-monitoring
Open

Add temperature and power monitoring via amdgpu sensor API#183
mgajda wants to merge 8 commits into
clbr:masterfrom
mgajda:add-extended-monitoring

Conversation

@mgajda

@mgajda mgajda commented Apr 15, 2026

Copy link
Copy Markdown

Add GPU temperature and power monitoring using amdgpu sensor API. No root required. Temperature and power available on modern amdgpu driver.

mgajda added 8 commits April 14, 2026 23:15
On RDNA, the GRBM_STATUS register layout changed:
- Bit 10 (Event Engine): reserved/undefined
- Bits 16-17 (VGT): replaced by Geometry Engine at bit 21
- Bit 21 (Sequencer Instruction Cache): now Geometry Engine

Repurpose the VGT field to read GE_BUSY (bit 21) and label it
"Geometry Engine" in the UI and dump output. Hide Event Engine
and Sequencer Instruction Cache on RDNA since those blocks no
longer exist. Other bits (TA, SX, SPI, SC, PA, DB, CB, GUI)
are unchanged across all generations.
VCN (Video Core Next) is the video decode/encode block on RDNA GPUs,
replacing UVD/VCE from older architectures. Monitor bit 1 of SRBM_STATUS2
for VCN utilization on NAVI10+.

Changes:
- Add bits.vcn field to track VCN busy status in SRBM_STATUS2 bit 1
- Configure VCN bit in initbits() for RDNA (fam >= NAVI10)
- Collect VCN data in ticks.c and aggregate in result histograms
- Display VCN utilization in ui.c and dump.c output

VCN monitoring is available without root via amdgpu driver
(SRBM_STATUS2 is whitelisted for read access).
Document detailed proposal for extending amdgpu kernel driver register
whitelist to enable unprivileged monitoring of:

1. GRBM_STATUS2 (0x8014) - Extended GPU status bits
2. GC_USER_SHADER_THROTTLE_CONFIG (0x30E4) - Thermal throttle detection
3. GC_USER_RAS_STATUS registers - ECC error monitoring (professional cards)

Includes:
- Register safety analysis and bit layout documentation
- Kernel patch locations and implementation guide
- Integration plan with radeontop code examples
- Risk assessment and mitigation strategies
- Submission strategy for kernel mailing list

This enables tools like radeontop to provide thermal throttle warnings
and extended performance monitoring without root privileges once the
kernel patch is merged.
Add GPU temperature and power monitoring using amdgpu_query_sensor_info
API, available without root on modern amdgpu driver. These sensors are
available on most RDNA and newer APUs/discrete GPUs.

Changes:
- Add temperature and power fields to bits_t struct
- Add gettemp() and getpower() function pointers to amdgpu backend
- Initialize sensor callbacks in amdgpu init (DRM 3.11+)
- Collect temperature (millidegrees C) and power (milliwatts) samples
- Average temperature and power over monitoring interval
- Display temperature in Celsius and power in Watts in UI
- Include temperature and power in CSV/text dump output

Temperature and power are queried via amdgpu_query_sensor_info:
- AMDGPU_INFO_SENSOR_GPU_TEMP - GPU junction temperature
- AMDGPU_INFO_SENSOR_GPU_AVG_POWER - Average power consumption

No root required - data comes from whitelisted amdgpu sensor API.
Gracefully skips if sensors unavailable on older drivers or hardware.
Add APU detection to show appropriate memory labels:
- APU (unified memory): 'Reserved Memory' and 'Unified Memory'
- Discrete GPU: 'VRAM' and 'GTT' (original labels)

Implementation:
- detect.c: Add is_apu flag
- amdgpu.c: Set is_apu based on AMDGPU_IDS_FLAGS_FUSION
- include/radeontop.h: Declare is_apu extern
- ui.c: Conditional labels in ncurses display
- dump.c: Conditional labels in CSV/text output

This helps users understand that APUs use system memory while
discrete GPUs have separate VRAM, and GTT is unified memory on APUs
vs graphics translation table on discrete GPUs.
APUs (integrated graphics) may not report AMDGPU_INFO_SENSOR_GPU_AVG_POWER
through the driver. Skip error reporting for this sensor on APUs since
it's not expected to be available.

Discrete GPUs will still show errors if power sensor is unavailable,
as this is unexpected and should be investigated.
Add has_power_sensor flag to track power sensor availability.
Display power as 0.0 W if sensor is unavailable, rather than hiding
the field entirely. This makes it clear power monitoring is available
but returns 0 W on systems where the sensor is not accessible.

Changes:
- detect.c: Add has_power_sensor flag
- amdgpu.c: Set flag when power sensor is successfully initialized
- include/radeontop.h: Declare has_power_sensor extern
- ui.c: Show power field even if power_avg is 0, if sensor exists
- dump.c: Include power field even if power_avg is 0, if sensor exists

This provides better clarity to users about sensor availability.
For APUs (integrated GPUs), the AMDGPU_INFO_SENSOR_GPU_TEMP sensor reports
the shared die temperature which is more accurate than discrete GPU readings
since CPU and GPU cores share the same thermal management system.

- Add clarifying comments about APU temperature behavior
- Silently skip temperature errors on APUs (sensor may not be separate)
@mgajda

mgajda commented Apr 15, 2026

Copy link
Copy Markdown
Author

@JCoMcL @clbr Any chance to merge this and refresh this excellent application?

@JCoMcL

JCoMcL commented Apr 28, 2026

Copy link
Copy Markdown

@mgajda looks great! Will merge when I get the chance :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants