Software 42868 Published by

The FEX-EMU, which enables the execution of x86 and x86-64 binaries on an AArch64 host, has been updated.



FEX-2403

Read the blog post at FEX-Emu's Site!

Welcome back to another new tagged version of FEX-Emu! This month we have quite a few important bug fixes and optimizations, so let's get right in to
it!

Steam fix

As of Steam's  February 27th update there was a fairly major change to how Steam starts its embedded Chromium instance.
With this most recent change it now is run inside of the Steam Linux Runtime environment. In turn Steam has disabled the sandbox feature of the
Chromium instance because it is incompatible. FEX was already disabling this sandbox and forcibly passing in the argument to disable it.

Chromium really didn't like the argument being passed in twice and it was causing it to crash early. We have now removed our application profile and
let Steam configure the arguments as required.

As a side effect of Steam updating their version of Chromium, some users have noted that they are experiencing problems with GPU acceleration on
Raspberry Pi systems. This is seemingly a video driver problem and unrelated to this crash that was fixed. It is currently unknown if we can fix this
problem, as it is working find on Tegra and Snapdragon systems.

Rootfs images updated

FEX's rootfs images have been updated to include the latest versions of Mesa, gfxreconstruct, and Renderdoc. The major change here is having Mesa
updated to 24.0 as the other two packages are mostly for developers.

Fix a potential hang on forking with memory allocations

We have fixed a known hang that occurs when a process is forking while another thread is allocating memory. This tended to occur as a hang when
running Proton applications. While this fixes one hang, we still have another one that sporatically happens that we haven't tracked down. While the
occurence is relatively rare, it's good to watch the process trees if the program is stuck waiting on a futex.

A bunch of CPU optimizations

As per usual, this last month has added a bunch of CPU optimizations! We have noted up to a 14% performance improvement in one benchmark and an
average of around 4% in  Geekbench. We need to commend our developers for
hammering out these optimizations, even a small optimization can have big impacts on games that abuse a particular feature.

  • Use FlagM SETF8/16 for INC/DEC
  • Optimize LOCK DEC
  • Optimize ADC/SBC
  • Fuse add+cmn in to adds
  • Misc other optimizations
  • Optimize less than 32-bit add and sub

Small timestamp counter scaling

Recently we have found out that some games rely on an x86 CPU's RDTSC instruction operating at Ghz frequencies. While this is not a good idea to make
assumptions, it is relatively common that x86 CPUs have a really high frequency cycle counter frequency. Most laptop CPUs operating in the 1-2Ghz
range while desktop CPUs can go up to 3Ghz in our testing.

Unreal Engine 5 has a new work graph system that spin-loops CPUs for a fixed number of cycles, expecting to not spin for very long. While this is
relatively okay at 1Ghz, since it is only a few nanoseconds, When an ARM CPU's cycle counter gets added to the mix it starts encountering problems.

The primary problem here is that all 64-bit Snapdragon processors ship with a fixed rate 19.2Mhz cycle counter. This continues all the way to their
latest flagship the Snapdragon 8 Gen 3. Additionally other ARM devices we have tested like the Nvidia Jetson ARM boards and Apple M1 also ship a
similarly low cycle counter. So while Unreal engine will only spin-loop for a measily 1597 cycles, on an ARM board this takes ~51,000 nanoseconds but
on an x86 PC it only takes 591 nanoseconds! This was causing games to burn CPU time unnecessarily and run slower than they should!

To compensate for these slow cycle counters on most ARM devices, we are now scaling the value we return to the applications by multiplying the value
by 128 times! This makes snapdragon cycle counters behave more similarly to a 2457 Mhz cycle counter, but with a 128 cycle granularity. This improves
the FPS in Tekken 8 and will also improve performance in all other Unreal Engine 5 games. There may be other games affected as well!

As a side-note, a 1Ghz cycle counter is now mandated by ARMv8.6 and ARMv9.1 spec! So this problem will soon go away as new SoCs get on to the market.

Introduce ARM64EC static register allocation mapping

As part of the ongoing effort to support WINE's Arm64EC code, we have changed the order in which our registers get allocated to more closely match
what the  Arm64EC ABIwants for register layout. Matching what Arm64EC wants for the regster layout means that when the JIT jumps out to some code, we shuffle less data
around which gives a performance improvement. The Linux side of code doesn't need this, so this only happens when building as a WINE module.

32-bit thunking improvements

This last month has had an exciting milestone for 32-bit thunking! We have landed support for thunking Wayland on 32-bit. Which this means with our
previously implemented Vulkan thunking, we can now run some games using Wayland plus Vulkan and Zink thrown in to the mix! In particular we have been
able to test that Super Meat Boy works in this configuration! We still have more work to do before X11 and GLX works with 32-bit thunking so stay
tuned to the future!

Memory leak fix

FEX had an issue with long running processes leaking memory. This showed up in applications that would start hundreds of threads and tear them down
over and over. Steam is one of these long running processes that would starve the system of memory if left open over night. This is because the
program spins up helper threads fairly aggressively and then shuts them down.

We have fixed one major memory leak but we still have a few more to go before its nailed down!

Syscall passthrough optimization

One important thing to be wary of when running games is syscall overhead. Every time an ioctl or other syscall is made, FEX can incur significant
overhead compared to running the application natively. Additionally if we are passing syscalls through to glibc helpers then this can add more
overhead and sometimes introduce bad behaviour.

This month we spent some time looking at how syscalls are handled when we know that we can pass the data directly to the kernel. This allows us to
more quickly add new system calls when the kernel adds them, and ensures they are as fast as possible. With this optimization in-place FEX now
directly emits small syscall handlers per syscall and jumps directly to the kernel if possible. This lowers CPU overhead for the most common syscalls,
thus removing emulation overhead. While FEX's syscalls were already fairly low overhead, this just improves the situation further!

Raw Changes

FEX Release FEX-2403

  • ASM

  • Another sign extend bug in  #3421 ( b7984e8)

  • Arm64

  • Stop moving source in atomic swap ( 98572b9)

  • Arm64Emitter

  • Introduce ARM64EC SRA mappings ( 6ec628f)

  • CMake

  • Define _M_ARM_64EC when building for ARM64EC ( 3e5694b)

  • FEXCore

  • Expose AbsoluteLoopTopAddress to the frontend ( d4be2dc)

  • Add a frontend pointer to InternalThreadState ( 5769ffb)

  • FEXLoader

  • Allocate the second 4GB of virtual memory when executing 32-bit ( 0b34035)

  • FileFormatCheck

  • Fixes FD leak ( 0505b30)

  • InstCountCI

  • enable preserve_all ( 2f9449c)

  • add FMOD block ( cc9c80d)

  • add The Witcher 3 block ( 60e8da0)

  • InstcountCI

  • Add a monster of a game block ( f41674b)

  • Adds a vector addition loop from bytemark ( 44d1502)

  • Adds addressing limitations to instcountci ( cf06799)

  • Linux

  • Converts passthrough syscalls to direct passthrough handlers ( b74de53)

  • More safe stack cleanup for clone ( 9687ac5)

  • Make sure to destroy thread object when thread shuts down ( a7c7fe4)

  • OpcodeDispatcher

  • Don't use AddShift with no shift ( d24446e)

  • RedundantFlagCalculationElimination

  • fix missing NEG case ( 32a4abb)

  • Syscalls

  • Fix SourcecodeMap generation for GDB JIT integration ( 7e2f20c)

  • Misc

  • Removes steamwebhelper config ( d66a83a)

  • Use SETF8/16 for 8/16-bit INC/DEC ( ea7d169)

  • Library Forwarding: Update Vulkan definitions to v1.3.278 ( 2dd922c)

  • Optimize lock dec ( 009ae55)

  • Misc little opts ( f27e224)

  • Fix reserving range check ( 4779fb7)

  • Update xxhash to v0.8.2 ( 139367d)

  • Optimize SBC ( 49e798a)

  • Update vixl ( 8c0d5c6)

  • Capture a 64-bit process trying to jump to 32-bit syscall handler ( 946c805)

  • Track unittest dependencies through to the custom target ( 118b8b2)

  • Adds a unittest for a bug from  #3421 ( aa9d7c5)

  • Optimize ADC ( 5f16f35)

  • Fixes zero register flag generation ( 0ef72bf)

  • Adds MGRR hottest block on render thread ( 49ca0e2)

  • Fuse add + cmn -> adds ( d8a1868)

  • Moves SignalDelegator TLS tracking to the frontend ( 9b93495)

  • Moves JITSymbol allocation ( 59ec88f)

  • Fix instcountci ( 9025673)

  • Simplify CalculateAF ( 5378ae2)

  • Library Forwarding: Add support for 32-bit Wayland ( 66feea9)

  • Implement small TSC scaling ( 2bcd285)

  • Cleanup NZCV metadata ( 9c38332)

  • Fixes VDSO crash in 64-bit code ( bbac014)

  • Library Forwarding: Allocate packed arguments on the guest stack if needed ( 9cab746)

  • Library Forwarding: Disable struct padding for packed arguments ( b888bb5)

  • Optimize 8/16-bit adds/subs ( 4721825)

Release FEX-2403 · FEX-Emu/FEX