ROCm 6.2.0 released

Software 43213 Published 8 months ago by Philipp Esselbach

News

A new version of AMD's compute platform, ROCm, has been released.

ROCm 6.2.0 Release

ROCm 6.2.0 release notes

The release notes provide a comprehensive summary of changes since the previous ROCm release.

Release highlights

Operating system and hardware support changes

ROCm components versioning

Detailed component changes

ROCm known issues

ROCm upcoming changes

The Compatibility matrixprovides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.

Release notes for previous ROCm releases are available in earlier versions of the documentation.
See the ROCm documentation release history.

Release highlights

This section introduces notable new features and improvements in ROCm 6.2. See the
Detailed component changes for individual component changes.

New components

ROCm 6.2.0 introduces the following new components to the ROCm software stack.
Omniperf -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
or a GUI dashboard. For more information, see
Omniperf.

Omnitrace -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
which function and line number are executing. For more information, see
Omnitrace.

rocPyDecode -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
enabling function calling and data passing between the two languages. The rocpydecode.so library, a wrapper, uses
rocDecode APIs written primarily in C/C++ within Python. For more information, see
rocPyDecode.
ROCprofiler-SDK -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
ROCprofiler-SDK also provides a tool library to help you write your tool implementations. rocprofv3 uses this tool library
to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
For more information, see ROCprofiler-SDK.
ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
ROCm Offline Installer Creator introduced

The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
driver, or a combination of the two on a target system without network access. This new tool customizes
multiple unique configurations for use when installing ROCm on a target. Other notable features include:

A lightweight, easy-to-use user interface for configuring the creation of the installer

Support for multiple Linux distributions

Installer support for different ROCm releases and specific ROCm components

Optional driver or driver-only installer creation

Optional post-install preferences

Lightweight installer packages, which are unique to the preconfigured ROCm setup

Resolution and inclusion of dependency packages for offline installation

For more information, see
ROCm Offline Installer Creator.

Math libraries default to Clang instead of HIPCC

The default compiler used to build the math libraries on Linux changes from hipcc to amdclang++.
Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
building the libraries. Applications using the libraries can continue to be compiled using hipcc or amdclang++ as
described in ROCm compiler reference.
The math libraries can also be built with hipcc using any of the previously available methods (for example, the CXX
environment variable, the CMAKE_CXX_COMPILER CMake variable, and so on). This change shouldn't affect performance or
functionality.

Framework and library changes

This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.

Additional PyTorch and TensorFlow support

ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.

See Installing PyTorch for ROCmand Installing TensorFlow for ROCmfor installation instructions.

Refer to the
Third-party support matrix
for a comprehensive list of third-party frameworks and libraries supported by ROCm.

Optimized framework support for OpenXLA

PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
Installing PyTorch for ROCmand Installing TensorFlow for ROCm.

PyTorch support for Autocast (automatic mixed precision)

PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
original float32 linear layers and convolutions with their float16 or bfloat16 variants. For more information, see
Automatic mixed precision.

Memory savings for bitsandbytes model quantization

The ROCm-aware bitsandbytes library is a lightweight Python wrapper around HIP
custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
ROCm 6.2.0 introduces the following bitsandbytes changes:

Int8 matrix multiplication is enabled, and it includes the following functions:
extract-outliers – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.

transform – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.

igemmlt – new function for GEMM computation A*B^T. It uses
hipblasLtMatMul and performs 8-bit GEMM operations.

dequant_mm – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.

Blockwise quantization – input tensors are quantized for a fixed block size.

4-bit quantization and dequantization functions – normalized Float4 quantization, quantile estimation, and quantile quantization functions are enabled.

8-bit and 32-bit optimizers are enabled.
These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and
features to run them.
For more information, see Model quantization techniques.

Improved vLLM support

ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
capabilities for FP16/BF16 precision for LLMs, and FP8 support for Llama.
ROCm 6.2.0 adds support for the following vLLM features:

MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
use --distributed-executor-backed=mp. The default depends on the commit in flux.

FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
The QUARK quantizer currently only supports Llama.

Triton Flash Attention:

ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
setting using the VLLM_USE_FLASH_ATTN_TRITON=False environment variable.

PyTorch TunableOp:

Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.

For more information about enabling these features, see
vLLM inference.

ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
These features include:

FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.

Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
computation in large-scale models. This benefits all workloads in FP16 configurations.

To enable these experimental new features, see
vLLM inference.
Use the rocm/vllm branch when cloning the GitHub repo. The vllm/ROCm_performance.md document outlines
all the accessible features, and the vllm/Dockerfile.rocm file can be used.

Enhanced performance tuning on AMD Instinct accelerators

ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
accelerators for optimal performance. For more information, see
AMD MI300X tuning guides and
AMD MI300A system optimization.

Removed clang-ocl

As of version 6.2, ROCm no longer provides the clang-ocl package.
See the clang-ocl README.

ROCm documentation changes

The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
improves the usability and readability of the documentation. For more information about the ROCm components, see
What is ROCm?.

Since the release of ROCm 6.1, the documentation has added some key topics including:

AMD Instinct MI300X workload tuning guide

AMD Instinct MI300X system tuning guide

AMD Instinct MI300A system tuning guide

Using ROCm for AI

Using ROCm for HPC

Fine-tuning LLMs and inference optimization

LLVM reference documentation

The following topics have been significantly improved, expanded, or both:

HIP documentation

Compatibility matrix
All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the
[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
Release ROCm 6.2.0 Release · ROCm/ROCm

pgmoneta 0.13 released

Liquorix Linux Kernel 6.9-12 released

ROCm 6.2.0 released

ROCm 6.2.0 Release

ROCm 6.2.0 release notes

Release highlights

New components

ROCm Offline Installer Creator introduced

Math libraries default to Clang instead of HIPCC

Framework and library changes

Additional PyTorch and TensorFlow support

Optimized framework support for OpenXLA

PyTorch support for Autocast (automatic mixed precision)

Memory savings for bitsandbytes model quantization

Improved vLLM support

Enhanced performance tuning on AMD Instinct accelerators

Removed clang-ocl

ROCm documentation changes

Windows

Linux

macOS

Community