Software 43201 Published by

A new version of FEX, an x86 emulation frontend, has been released, enabling users to execute x86 and x86-64 binaries on AArch64 hosts. The update addresses the enhancement of 3DNow! reciprocal precision on FEAT_RPRES-compatible hardware, which was formerly restricted to 8-bit precision.

The FEAT_RPRES extension now enhances the precision of reciprocal instructions to 12-bits, similar to x86 standards. The modification occurred in late 2023, with Qualcomm Snapdragon Elite SoC and Apple's M4 now providing support for these extensions. The resolution also tackles a persistent issue with the background startup of FEXServer, which resulted in new users receiving an unclear message regarding client setup failure. The multiblock option is enabled by default, facilitating the JIT to compile a larger volume of code simultaneously, thereby enhancing performance without additional cost. The x86 SHA extension has been enhanced through the utilization of ARM SHA instructions, with SHA1RNDS4, SHA1MSG2, and SHA256MSG2 now refined using approximately equivalent ARM instructions. The sole remaining SHA instruction is SHA256RNDS2, which differs from the two ARM instructions that correspond to its functionality.





The Mangohud FEX profile has been enhanced to incorporate sampling-based statistics for external applications, facilitating a deeper understanding of game performance. This is mainly achieved via a Mangohud configuration option that reveals information about FEX's JIT, facilitating a direct correlation with a game's FPS. The statistics encompass SIGBUS events, SMC events, softfloat events, and JIT time. The updated FEX option necessitates the construction of Mangohud with the new sampling feature activated and the adjustment of a configuration setting.

A Tracy backend has been incorporated, providing a developer-centric option that enables developers to identify where code is allocating time. Tracy is an innovative timeline profiler backend that employs a ring buffer in user space, typically demonstrating lower performance metrics than GPUVis. The updates encompass improvements for a range of tools, including Arm64, ArgumentLoader, Async, CPUBackend, CodeEmitter, CommonTools, Configuration, FEX, FEXCore, JIT, Profiler, FEXServer, FileManagement, InstCountCI, JIT, Linux, OpcodeDispatcher, SIDT/LSL, Misc, and Library Forwarding.

The latest FEX release incorporates fixes for several issues, including warnings related to uninitialized variables, SIDT/LSL, and library forwarding.

FEX-2503

Read the blog post at FEX-Emu's Site!

Here we are again, another month and some more cool changes with FEX. Let's dive in and see what has changed!

Fix 3DNow! reciprocal precision on FEAT_RPRES supporting hardware

This change is kind of fun due to the nature of how reciprocals work on modern x86 hardware. With SSE and AVX, reciprocal and reciprocal square root are two instructions that are designed by specification to only provide 12-bits worth of precision for the initial estimate. If you want more precision then that, you use a Newton-Raphson refinement instruction, or eat the cost of some floating point division instructions.

While this is nice for x86; In ARM64 land our reciprocal instructions are limited to 8-bit precision. In order to emulate x86 precision we must ALWAYS do a refinement step, or eat the cost of the float division. For our initial implementation we had just ate the cost of the float division and called it a day. As we were crying that we needed more precision, ARM decided to bless us with the FEAT_RPRES extension. This extension is perfect for our needs, it increases the precision of the reciprocal instructions to 12-bits just like x86! So we  wired this upin the end of 2023 with the hope that some hardware would support it. Today we finally have the Qualcomm Snapdragon Elite SoC and Apple's M4 that supports these extension! This means those platforms get faster reciprocals for free.

In our quickness to implementing this instruction, we had forgotten one key player in this story. 3DNow! turns out is a special little extension that decided its reciprocals need to have 14-bit or 15-bit precision depending on instruction. This doesn't work with RPRES but due to the implementations being shared in our JIT, we had inadvertently reduced precision on those platforms! Thankfully  Paulo found this problem and hunted it down. Now for 3DNow! we are using an estimate plus Newton-Raphson refinement step in order to get the precision we need! Hopefully all those POD Gold players out there appreciate the precision improvements!

Fix FEXServer background startup

This has been a fairly long running issue that was hard to reproduce. When new users were attempting to use FEX, they would try running their initial application and get a cryptic Failure to setup client message. Turns out how we were detecting that the FEXServer was ready would break in certain conditions when squashfs or erofs images were in use! We fixed this bug and now this problem is finally resolved, hopefully no more confused new users.

Enable multiblock option by default

This feature has been around for quite a while, but due to some bugs in our JIT it wasn't ever quite safe to enable. Plus we had some major JIT performance concerns before optimizations in our JIT landed late last year. With a whole bunch of changes in place, we have now decided to turn this option on by default!

This option basically makes our JIT compile more code at once, allowing the JIT to stretch its legs a bit and gain some free performance. There may still be some small bugs in it, but without actually dogfooding it we'll never find them. Let us know how it goes and may your games run faster than ever before.

Optimize SHA instructions by using ARM SHA instructions

This month there were a few optimizations around the x86 SHA extension. Emulating this extension is a bit peculiar because the instructions between the two architectures don't quite match up. This requires a bit of noodling to figure out exactly how to get the ARM versions of the instructions to behave like the x86 instructions. With these optimizations in place, we now optimize SHA1RNDS4, SHA1MSG2, and SHA256MSG2 using roughly equivalent ARM instructions!

The only SHA instruction remaining that isn't optimized is SHA256RNDS2, which is quite a bit different than the two ARM instructions that match the functionality. If anyone wants a brain teaser to implement this optimization, this can be a fun target to try and optimize!

Add Mangohud FEX profile sampling stats

When testing games running under FEX sometimes it is difficult to understand if a game's performance is due to FEX getting in the way or something unrelated. This month we implemented a sampling based stats mechanism that external applications can read and get some insight in to what is happening. Primarily we have implemented this as a Mangohud config option that exposes details about FEX's JIT. This allows us to directly correlate FEX's JIT with a game's FPS which is quite handy when running things.
In particular we expose a few data points:

  • SIGBUS events - Is the game hammering unaligned memory accesses?
  • SMC events - Is the game constantly invalidating code?
  • Softfloat events - Is the game hammering x87 or other transcendentals?
  • JIT time - Total time spent having FEX's JIT actually emitting code rather than executing

These stats can be seen in the following image, where we can see the game is executing roughly 33 million softfloat events per second, maximizing a CPU core's usage and not able to hit 60FPS. This allows to to see that maybe we should try enabling the x87 reduced precision option to get some performance back.

This does require some work to setup today. It requires building Mangohud with the new FEX option enabled, it requires building FEX with the new sampling option enabled, and it also requires toggling a config option to turn it on as well. But we do invite enthusiasts that know what they're doing to enable these options and see if there's interesting stats in their games

Add a Tracy backend

This option is completely developer focused as Tracy stats requires a developer focusing for some optimizations inside of FEX's code. This is a new timeline profiler backend for developers to see where code is spending time. We had already had  GPUVis support wired up in FEX for quite a while, but Tracy is quite nice because the user interface is easier to handle. Additionally Tracy's per-event overhead is usually lower that GPUVis since it uses a ringbuffer in userspace while GPUVis relies entirely on ftrace.

Raw changes

FEX Release FEX-2503

  • ArchHelpers

    • Arm64
  • ArgumentLoader

  • Async

    • Add framework for multiplexing IO on network sockets and other file descriptors ( 7579330)
  • CPUBackend

    • Move bool to end of JITCodeTail ( d22bd9c)
  • CodeEmitter

    • Adds missing assert checks ( 2435ebe)
    • Minor fixes to SVE fcvtz{u,s} ( da76023)
  • CommonTools

    • Removes ELFSymbolDatabase ( c37dc81)
  • Config

    • Correctly handle relative paths with portable ( 1fc8270)
    • Enable multiblock by default ( a49d30f)
  • FEX

    • Implements new sampling based stats ( 6a39a8d)
  • FEXCore

    • Keep PCMPISTRI arguments in vectors longer ( 1b144ba)

    • Optimize VPCMPISTRX implicit length calculation ( 2943cff)

    • JIT

      • Pass Softfloat arguments as vector registers ( 0d7a9f9)
    • Profiler

      • Implement support for JIT float fallbacks ( 9eccc01)
  • FEXServer

    • Fixes background startup ( dce9de2)
  • FEXServerClient

    • Fix searching for FEXServer ( 7a03681)
  • FEXpidof

    • Fixes searching for wine applications ( 717015b)
  • FileManagement

    • Throw a warning if /proc can't be opened ( a7c6fdb)
  • InstCountCI

  • JIT

    • Optimize packed float min/max if AFP is supported ( 4f46f55)
    • Optimize memory stores with zero ( b46e5d4)
  • Linux

    • Fixes PAGESIZE checks that could return <= 0 ( afa5ad5)
  • OpcodeDispatcher

    • Reuse PSHUFD shuffle mask for sha data shuffling ( 57ed466)
    • Emulate SHA1RNDS4 with ARM sha extensions ( dbb58d1)
    • Implements a few more pshufd masks ( beef9ee)
    • Implement support for SHA1MSG2 using SHA instructions ( 34a274d)
    • Implement SHA256MSG2 using SHA256 operation ( 02d7261)
    • Use offset for LRCPC2 more frequently ( a85cc85)
  • Profiler

    • Fixes potential crash due to uninitialized variables ( 4b17506)
    • Drop accidentally included debugging code ( b09b948)
    • Fixes zeroing of allocated slots. ( 54412f1)
    • Add Tracy backend ( 6651f9e)
  • Seccomp

    • Fix a couple minor things. ( f69ef86)
  • Various

    • Removes warnings about uninitialized variables ( 3964018)
  • Misc

    • Implement SIDT/LSL ( 81434cd)
    • Improve reciprocal estimate and tests ( b7f58e6)
    • Update vixl to ff82b3328c59fa4cf2fe36697b44eae15a650371 ( 530d3d8)
    • Library Forwarding: Add GUI for enabling use of individual host libraries ( caf15a2)
    • Remove unused code in various places ( 6b82664)
    • Library Forwarding: Fix build problems on some platforms ( d14b6e1)
    • Fixes a couple locations where is variable is copied when it could be moved ( b76a296)
    • Fixes some instances of auto usage with unintentional copy ( 42b0fbd)
    • Remove unused IMGui header and obsolete debugger documentation ( 4186b2a)

Release FEX-2503 · FEX-Emu/FEX