Software 42626 Published by

A new version of AMD's compute platform, ROCm, has been released.



ROCm 6.2.0 Release

ROCm 6.2.0 release notes

The release notes provide a comprehensive summary of changes since the previous ROCm release.

  • Release highlights

  • Operating system and hardware support changes

  • ROCm components versioning

  • Detailed component changes

  • ROCm known issues

  • ROCm upcoming changes

The  Compatibility matrixprovides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.

Release notes for previous ROCm releases are available in earlier versions of the documentation.
See the  ROCm documentation release history.

Release highlights

This section introduces notable new features and improvements in ROCm 6.2. See the
Detailed component changes for individual component changes.

New components

ROCm 6.2.0 introduces the following new components to the ROCm software stack.

  • Omniperf -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
    running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
    or a GUI dashboard. For more information, see
    Omniperf.

  • Omnitrace -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
    It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
    which function and line number are executing. For more information, see
    Omnitrace.

  • rocPyDecode -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
    enabling function calling and data passing between the two languages. The rocpydecode.so library, a wrapper, uses
    rocDecode APIs written primarily in C/C++ within Python. For more information, see
    rocPyDecode.

  • ROCprofiler-SDK -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
    used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
    efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
    ROCprofiler-SDK also provides a tool library to help you write your tool implementations. rocprofv3 uses this tool library
    to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
    For more information, see  ROCprofiler-SDK.

    ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
    

ROCm Offline Installer Creator introduced

The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
driver, or a combination of the two on a target system without network access. This new tool customizes
multiple unique configurations for use when installing ROCm on a target. Other notable features include:

  • A lightweight, easy-to-use user interface for configuring the creation of the installer

  • Support for multiple Linux distributions

  • Installer support for different ROCm releases and specific ROCm components

  • Optional driver or driver-only installer creation

  • Optional post-install preferences

  • Lightweight installer packages, which are unique to the preconfigured ROCm setup

  • Resolution and inclusion of dependency packages for offline installation

For more information, see
ROCm Offline Installer Creator.

Math libraries default to Clang instead of HIPCC

The default compiler used to build the math libraries on Linux changes from hipcc to amdclang++.
Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
building the libraries. Applications using the libraries can continue to be compiled using hipcc or amdclang++ as
described in  ROCm compiler reference.
The math libraries can also be built with hipcc using any of the previously available methods (for example, the CXX
environment variable, the CMAKE_CXX_COMPILER CMake variable, and so on). This change shouldn't affect performance or
functionality.

Framework and library changes

This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.

Additional PyTorch and TensorFlow support

ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.

See  Installing PyTorch for ROCmand  Installing TensorFlow for ROCmfor installation instructions.

Refer to the
Third-party support matrix
for a comprehensive list of third-party frameworks and libraries supported by ROCm.

Optimized framework support for OpenXLA

PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
Installing PyTorch for ROCmand  Installing TensorFlow for ROCm.

PyTorch support for Autocast (automatic mixed precision)

PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
original float32 linear layers and convolutions with their float16 or bfloat16 variants. For more information, see
Automatic mixed precision.

Memory savings for bitsandbytes model quantization

The  ROCm-aware bitsandbytes library is a lightweight Python wrapper around HIP
custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
ROCm 6.2.0 introduces the following bitsandbytes changes:

  • Int8 matrix multiplication is enabled, and it includes the following functions:
    • extract-outliers – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
    • transform – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.
    • igemmlt – new function for GEMM computation A*B^T. It uses
      hipblasLtMatMul and performs 8-bit GEMM operations.
    • dequant_mm – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
  • Blockwise quantization – input tensors are quantized for a fixed block size.
  • 4-bit quantization and dequantization functions – normalized Float4 quantization, quantile estimation, and quantile quantization functions are enabled.
  • 8-bit and 32-bit optimizers are enabled.
These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and
features to run them.

For more information, see  Model quantization techniques.

Improved vLLM support

ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
capabilities for FP16/BF16 precision for LLMs, and FP8 support for Llama.
ROCm 6.2.0 adds support for the following vLLM features:

  • MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
    use --distributed-executor-backed=mp. The default depends on the commit in flux.

  • FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
    The QUARK quantizer currently only supports Llama.

  • Triton Flash Attention:

    ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
    setting using the VLLM_USE_FLASH_ATTN_TRITON=False environment variable.

  • PyTorch TunableOp:

    Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.

For more information about enabling these features, see
vLLM inference.

ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
These features include:

  • FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
    in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.

  • Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
    computation in large-scale models. This benefits all workloads in FP16 configurations.

To enable these experimental new features, see
vLLM inference.
Use the rocm/vllm branch when cloning the GitHub repo. The vllm/ROCm_performance.md document outlines
all the accessible features, and the vllm/Dockerfile.rocm file can be used.

Enhanced performance tuning on AMD Instinct accelerators

ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
accelerators for optimal performance. For more information, see
AMD MI300X tuning guides and
AMD MI300A system optimization.

Removed clang-ocl

As of version 6.2, ROCm no longer provides the clang-ocl package.
See the  clang-ocl README.

ROCm documentation changes

The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
improves the usability and readability of the documentation. For more information about the ROCm components, see
What is ROCm?.

Since the release of ROCm 6.1, the documentation has added some key topics including:

The following topics have been significantly improved, expanded, or both:

All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the
[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).

Release ROCm 6.2.0 Release · ROCm/ROCm