ROCm 6.2.0 Release
ROCm 6.2.0 release notes
The release notes provide a comprehensive summary of changes since the previous ROCm release.
Release highlights
Operating system and hardware support changes
ROCm components versioning
Detailed component changes
ROCm known issues
ROCm upcoming changes
The Compatibility matrixprovides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.
Release notes for previous ROCm releases are available in earlier versions of the documentation.
See the ROCm documentation release history.Release highlights
This section introduces notable new features and improvements in ROCm 6.2. See the
Detailed component changes for individual component changes.New components
ROCm 6.2.0 introduces the following new components to the ROCm software stack.
Omniperf -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
or a GUI dashboard. For more information, see
Omniperf.Omnitrace -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
which function and line number are executing. For more information, see
Omnitrace.rocPyDecode -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
enabling function calling and data passing between the two languages. Therocpydecode.so
library, a wrapper, uses
rocDecode APIs written primarily in C/C++ within Python. For more information, see
rocPyDecode.ROCprofiler-SDK -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
ROCprofiler-SDK also provides a tool library to help you write your tool implementations.rocprofv3
uses this tool library
to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
For more information, see ROCprofiler-SDK.ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
ROCm Offline Installer Creator introduced
The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
driver, or a combination of the two on a target system without network access. This new tool customizes
multiple unique configurations for use when installing ROCm on a target. Other notable features include:
A lightweight, easy-to-use user interface for configuring the creation of the installer
Support for multiple Linux distributions
Installer support for different ROCm releases and specific ROCm components
Optional driver or driver-only installer creation
Optional post-install preferences
Lightweight installer packages, which are unique to the preconfigured ROCm setup
Resolution and inclusion of dependency packages for offline installation
For more information, see
ROCm Offline Installer Creator.Math libraries default to Clang instead of HIPCC
The default compiler used to build the math libraries on Linux changes from
hipcc
toamdclang++
.
Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
building the libraries. Applications using the libraries can continue to be compiled usinghipcc
oramdclang++
as
described in ROCm compiler reference.
The math libraries can also be built withhipcc
using any of the previously available methods (for example, theCXX
environment variable, theCMAKE_CXX_COMPILER
CMake variable, and so on). This change shouldn't affect performance or
functionality.Framework and library changes
This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.
Additional PyTorch and TensorFlow support
ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.
See Installing PyTorch for ROCmand Installing TensorFlow for ROCmfor installation instructions.
Refer to the
Third-party support matrix
for a comprehensive list of third-party frameworks and libraries supported by ROCm.Optimized framework support for OpenXLA
PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
Installing PyTorch for ROCmand Installing TensorFlow for ROCm.PyTorch support for Autocast (automatic mixed precision)
PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
originalfloat32
linear layers and convolutions with theirfloat16
orbfloat16
variants. For more information, see
Automatic mixed precision.Memory savings for bitsandbytes model quantization
The ROCm-aware bitsandbytes library is a lightweight Python wrapper around HIP
custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
ROCm 6.2.0 introduces the following bitsandbytes changes:
Int8
matrix multiplication is enabled, and it includes the following functions:
extract-outliers
– extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.transform
– row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.igemmlt
– new function for GEMM computation A*B^T. It uses
hipblasLtMatMul and performs 8-bit GEMM operations.dequant_mm
– dequantizes output matrix to original data type using scaling factors from vector-wise quantization.- Blockwise quantization – input tensors are quantized for a fixed block size.
- 4-bit quantization and dequantization functions – normalized
Float4
quantization, quantile estimation, and quantile quantization functions are enabled.- 8-bit and 32-bit optimizers are enabled.
These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and features to run them.
For more information, see Model quantization techniques.
Improved vLLM support
ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
capabilities forFP16
/BF16
precision for LLMs, andFP8
support for Llama.
ROCm 6.2.0 adds support for the following vLLM features:
MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
use--distributed-executor-backed=mp
. The default depends on the commit in flux.FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
The QUARK quantizer currently only supports Llama.Triton Flash Attention:
ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
setting using theVLLM_USE_FLASH_ATTN_TRITON=False
environment variable.PyTorch TunableOp:
Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
For more information about enabling these features, see
vLLM inference.ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
These features include:
FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
computation in large-scale models. This benefits all workloads inFP16
configurations.To enable these experimental new features, see
vLLM inference.
Use therocm/vllm
branch when cloning the GitHub repo. Thevllm/ROCm_performance.md
document outlines
all the accessible features, and thevllm/Dockerfile.rocm
file can be used.Enhanced performance tuning on AMD Instinct accelerators
ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
accelerators for optimal performance. For more information, see
AMD MI300X tuning guides and
AMD MI300A system optimization.Removed clang-ocl
As of version 6.2, ROCm no longer provides the
clang-ocl
package.
See the clang-ocl README.ROCm documentation changes
The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
improves the usability and readability of the documentation. For more information about the ROCm components, see
What is ROCm?.Since the release of ROCm 6.1, the documentation has added some key topics including:
- AMD Instinct MI300X workload tuning guide
- AMD Instinct MI300X system tuning guide
- AMD Instinct MI300A system tuning guide
- Using ROCm for AI
- Using ROCm for HPC
- Fine-tuning LLMs and inference optimization
- LLVM reference documentation
The following topics have been significantly improved, expanded, or both:
All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the [ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
A new version of AMD's compute platform, ROCm, has been released.