Software 42837 Published by

The FEX-EMU, which enables the execution of x86 and x86-64 binaries on an AArch64 host, has been updated.



Release FEX Release FEX-2306

Another interesting month of changes for FEX-Emu! While this release is shorter than last, this also only has a month of work rather than two. We had
some great work done this month, including a bunch of plumbing that most people won't notice. Let's see what changed!

Adds support for hardware TSO memory emulation prctl

Emulating the x86 memory model is the number one thing that slows down FEX emulation today. Apple Silicon supports this memory model in hardware which
is why Rosetta on MacOS can get amazing performance. With some  recent changes from the
Asahi developers, FEX can ask the hardware to enable the TSO emulation bit. If the kernel reports back that hardware TSO memory is enabled, FEX can
take a more lax approach to its memory emulation, getting an automatic speedup for Asahi systems.

Additionally not only is this a speed-up, it's required for correct emulation. When this feature isn't supported by the hardware, FEX needs to emulate
the memory model using atomics and LRCPC instructions. This absolutely demolishes performance so it is usually recommended to disable the emulation to
get "free" performance. This issue with this is that it can crash instability in the most awkward and peculiar of ways. We even found out in this last
month that Unity games with their complex buffer management are highly likely to crash due to old cachelines of data hanging around. The only fix is
to use emulate the memory model using hardware TSO flags or our atomic/LRCPC path. Sadly's ARM Cortex's hardware LRCPC implementation is barely any
faster than atomics.

Steamwebhelper crash fix

New beta versions of Steam has started relying on AT_EXECFN existing. FEX didn't previously emulate this auxv value which was causing it to crash. With this fixed, steamwebhelper is now working again.

More AVX work!

This has been a long time coming and we're almost there finally. After these changes FEX only needs to fix a few implementation bugs in the string
operations, and implement the XSAVE instructions before allowing AVX emulation. In addition to that, FEX is also almost able to supprot AVX2 with the
only instructions that need to be implemented is the gather load instructions!

Implements support for XGETBV

This is a fairly simple instruction as it lets the application query which CPU features are enabled. Necessary for an application to check before
enabling any AVX usage.

Handle PCMPESTRM, PCMPISTRM and AVX variants

These are the remaining string instructions that FEX has implemented. While mostly implemented there are still a couple of edge case behaviours that
aren't quite correct and just need to be fixed.

Implement support for deferring asynchronous signals

This change has been a long time coming to make FEX's JIT faster in the face of handling asynchronous signals. This issue is that FEX needs to enter
code regions that are effectively "uninterruptible" until it is complete. This is basically a reentrancy problem where a piece of code executing could
lock a mutex, or non-atomically updating a container's data, then when a signal occurs it will jump out of the code and potentially come back to this
corrupted state.

As an initial workaround to this problem, FEX would just disable all signals in each of these "signal-deferring regions." This had the overhead that
every region would have a system call going in to it and then another one coming out of it. If how frequently these regions happened was little then
it would be a non-issue; but as is commonly the case, FEX's JIT needs to be wrapped in this signal blocks. If a game is executing a bunch of code,
this means we can be doing thousands of additional system calls per second which adds up as direct overhead.

With this new change, FEX marks that it is in a signal-deferring region with some very cheap memory accesses and if a signal doesn't occur the
overhead is negligible. In the case that a signal does occur, it will get stored to a queue, FEX will finish its signal-deferring region, and then
come back to handle the signal.

This mostly works because asynchronous signals don't have guarantees about the timeliness of the signal being delivered. Sadly this can result in
signal queue depths being subtly incorrect but we are monitoring the situation to know if any game is affected. All in all this finally makes it so
FEX can be straced without being overwhelmed and improves stutter problems!

Grand Theft Auto 5 AVX fix

FEX was accidentally reporting support for BMI1 and BMI2 CPU instructions. These extensions have a requirement that AVX must be implemented for these.
This was causing Grand Theft Auto 5 to crash early trying to use AVX. We will now only report these extensions if emulated AVX is supported, which fixes this game.

Make vfork wait for the process to exit

FEX's previous implementation of vfork actually behaved like fork. The difference between these two syscalls is fairly subtle. In the case of vfork, the parent process will end up sleeping until the child either exits or executes execve.
We were instead treating this like a fork, where the parent continues immediately without waiting. While no known issues were encountered, it is good
to ensure this behaviour is correct for future work.

Getdents optimization

This classic syscall is used for querying directory contents, FEX needs to emulate this syscall since 64-bit applications now use a new syscall called
getdents64. FEX's original implementation was fairly slow due to a misunderstanding as to how this syscall worked. It would create a temporary working
buffer and copy data around a couple of times. With the new implementation it is able to use the buffer provided by the application and doing some
minor fixups to make the overhead fairly light now. This improves performance when an application is doing heavy folder scanning, which mostly means
it improves Proton startup time.

Minor optimizations

There were a handful of minor optimizations that improve performance so minorly that it falls within noise, but is nice to have.

Optimize ARM64 thunk trampolines

This is a very small optimization that changes an indirect load in to a PC relative load, removing a single data dependency.

Minor x87 FCMOV optimization

FEX was duplicating a mask from a GPR in to a vector register using two instruction and now it only uses one instruction.

Optimize ADC/ADD OF flag calculation

This was a small mistake where a bitwise negate was using two instructions instead of one.

Optimize EFLAG unpacking

Each time FEX was unpacking the EFLAG register it was using four instructions per bit of the flag. This has now been improved to only two. Cutting
flag unpacking to 82% of its original size in some edge cases.

Supported emulated Linux kernel version up to 6.2

FEX used to max out the reported kernel version up to 5.18. Now we can report up to 6.2 with this change. 6.3 is going to be harder since it
introduces a new prctl that FEX needs to work around.

Video game showcase

As said previously about Unity engine games having issues without TSO emulation. Here is a clip of Hollow Knight running under FEX full speed on a
Lenovo X13s. Even with the overhead of emulating the x86-TSO memory model, this game runs remarkably well. With x86-TSO emulation disabled this game
would have crashed a few seconds in.

Raw Changes

  • AOTIR

  • Stop passing a mutex around. It's already guarded ( f47caf4)

  • ARM64

  • Fixes SRA disabled codepath ( dc65a5e)

  • CPUID

  • Only enable BMI1 and BMI2 if AVX is supported ( cc7a56b)

  • Context

  • Remove debug namespace ( 5be798e)

  • ELFCodeLoader

  • Fixes missing AT_EXECFN ( 00dc373)

  • FEXConfig

  • Removes Emulated CPU cores option ( 1a9b6a8)

  • FEXCore

  • Implements support for xgetbv ( 737f917)

  • Support Wine syscalls ( 77e8be1)

  • Convert Core and Telemetry over to fextl::file::File ( 5674d3a)

  • Adds support for hardware x86-TSO prctl ( ed69eb9)

  • FEXLoader

  • Allow simulated kernel version up to 6.2 ( ada226b)

  • FEXRootFSFetcher

  • Support rolling release distros ( 9473025)

  • IRDumper

  • Fixes ssa number in arguments. ( 0f4a5ed)

  • InstallFEX

  • Updates helper install script for Ubuntu 23.04 ( 02f15f4)

  • Linux

  • Make vfork act more similar to how it should. ( 95b7592)

  • OpcodeDispatcher

  • Optimize ADC/ADD OF flag calculation ( 5b58082)

  • Optimize EFLAG unpacking ( 69181d4)

  • Handle PCMPESTRM/VPCMPESTRM ( 182010c)

  • Handle PCMPISTRM/VPCMPISTRM ( e924468)

  • Removes a warning that cropped up. ( e03b859)

  • Syscalls

  • Optimize getdents{64,} ( de0f398)

  • Telemetry

  • Save on signal terminate ( c9d1f0d)

  • Thunks

  • Mostly reverts  #2672 ( 6017a91)

  • Optimize ARM64 trampoline ( ce4e991)

  • Tools

  • Removes visual debugger ( 09997cf)

  • X87

  • Super minor FCMOV optimization ( 4e01452)

  • Misc

  • Implement support for deferred asynchronous signals ( 8bc33e9)

  • Wine TestHarnessRunner support ( 0ad6f98)

  • github

  • Updates some actions to v3 ( 52f64a0)

  • unittests

  • Adds step to remove stale SHM regions. ( 0a4bf10)



Release FEX Release FEX-2306 · FEX-Emu/FEX