Add temperature and power monitoring via amdgpu sensor API#183
Open
mgajda wants to merge 8 commits into
Open
Conversation
On RDNA, the GRBM_STATUS register layout changed: - Bit 10 (Event Engine): reserved/undefined - Bits 16-17 (VGT): replaced by Geometry Engine at bit 21 - Bit 21 (Sequencer Instruction Cache): now Geometry Engine Repurpose the VGT field to read GE_BUSY (bit 21) and label it "Geometry Engine" in the UI and dump output. Hide Event Engine and Sequencer Instruction Cache on RDNA since those blocks no longer exist. Other bits (TA, SX, SPI, SC, PA, DB, CB, GUI) are unchanged across all generations.
VCN (Video Core Next) is the video decode/encode block on RDNA GPUs, replacing UVD/VCE from older architectures. Monitor bit 1 of SRBM_STATUS2 for VCN utilization on NAVI10+. Changes: - Add bits.vcn field to track VCN busy status in SRBM_STATUS2 bit 1 - Configure VCN bit in initbits() for RDNA (fam >= NAVI10) - Collect VCN data in ticks.c and aggregate in result histograms - Display VCN utilization in ui.c and dump.c output VCN monitoring is available without root via amdgpu driver (SRBM_STATUS2 is whitelisted for read access).
Document detailed proposal for extending amdgpu kernel driver register whitelist to enable unprivileged monitoring of: 1. GRBM_STATUS2 (0x8014) - Extended GPU status bits 2. GC_USER_SHADER_THROTTLE_CONFIG (0x30E4) - Thermal throttle detection 3. GC_USER_RAS_STATUS registers - ECC error monitoring (professional cards) Includes: - Register safety analysis and bit layout documentation - Kernel patch locations and implementation guide - Integration plan with radeontop code examples - Risk assessment and mitigation strategies - Submission strategy for kernel mailing list This enables tools like radeontop to provide thermal throttle warnings and extended performance monitoring without root privileges once the kernel patch is merged.
Add GPU temperature and power monitoring using amdgpu_query_sensor_info API, available without root on modern amdgpu driver. These sensors are available on most RDNA and newer APUs/discrete GPUs. Changes: - Add temperature and power fields to bits_t struct - Add gettemp() and getpower() function pointers to amdgpu backend - Initialize sensor callbacks in amdgpu init (DRM 3.11+) - Collect temperature (millidegrees C) and power (milliwatts) samples - Average temperature and power over monitoring interval - Display temperature in Celsius and power in Watts in UI - Include temperature and power in CSV/text dump output Temperature and power are queried via amdgpu_query_sensor_info: - AMDGPU_INFO_SENSOR_GPU_TEMP - GPU junction temperature - AMDGPU_INFO_SENSOR_GPU_AVG_POWER - Average power consumption No root required - data comes from whitelisted amdgpu sensor API. Gracefully skips if sensors unavailable on older drivers or hardware.
Add APU detection to show appropriate memory labels: - APU (unified memory): 'Reserved Memory' and 'Unified Memory' - Discrete GPU: 'VRAM' and 'GTT' (original labels) Implementation: - detect.c: Add is_apu flag - amdgpu.c: Set is_apu based on AMDGPU_IDS_FLAGS_FUSION - include/radeontop.h: Declare is_apu extern - ui.c: Conditional labels in ncurses display - dump.c: Conditional labels in CSV/text output This helps users understand that APUs use system memory while discrete GPUs have separate VRAM, and GTT is unified memory on APUs vs graphics translation table on discrete GPUs.
APUs (integrated graphics) may not report AMDGPU_INFO_SENSOR_GPU_AVG_POWER through the driver. Skip error reporting for this sensor on APUs since it's not expected to be available. Discrete GPUs will still show errors if power sensor is unavailable, as this is unexpected and should be investigated.
Add has_power_sensor flag to track power sensor availability. Display power as 0.0 W if sensor is unavailable, rather than hiding the field entirely. This makes it clear power monitoring is available but returns 0 W on systems where the sensor is not accessible. Changes: - detect.c: Add has_power_sensor flag - amdgpu.c: Set flag when power sensor is successfully initialized - include/radeontop.h: Declare has_power_sensor extern - ui.c: Show power field even if power_avg is 0, if sensor exists - dump.c: Include power field even if power_avg is 0, if sensor exists This provides better clarity to users about sensor availability.
For APUs (integrated GPUs), the AMDGPU_INFO_SENSOR_GPU_TEMP sensor reports the shared die temperature which is more accurate than discrete GPU readings since CPU and GPU cores share the same thermal management system. - Add clarifying comments about APU temperature behavior - Silently skip temperature errors on APUs (sensor may not be separate)
Author
|
@mgajda looks great! Will merge when I get the chance :)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add GPU temperature and power monitoring using amdgpu sensor API. No root required. Temperature and power available on modern amdgpu driver.