Software 42622 Published by

A new version of the FEX-EMU, which allows the execution of x86 and x86-64 binaries on an AArch64 host, has been released. Performance has been significantly improved with the release of FEX-2409, which is mostly attributable to enhancements made to flags.

Because flags are bits that reflect the status of the processor, and because both x86 and Arm have flags, it is possible to emulate x86 functionality on Arm. Calculations involving arithmetic, such as subtracting and verifying the carry flag, are performed with their assistance. In addition to being able to implement additions of 128 bits, the carry flag is also able to indicate whether or not the result overflowed. These two operations are comparable; however, the carry flag is activated whenever there is a borrow, which is the more natural of the two. In most cases, the inverted convention emerges victorious, and FEX has been modified to do inversion carry, which has resulted in a few percent acceleration of typical workloads. FEX's translations of address modes, push/pop, AVX load/stores, and other translations have also been improved as a result of this, which has led to benchmarks being up to ten percent quicker since the previous release. Additionally, the FEXConfig tool, which is a Qt application, has been redesigned in order to enhance its aesthetics, usability, and accessibility.





FEX-2409

Read the blog post at FEX-Emu's Site!

FEX-2409 is now released... with a big performance boost.

I'm tired, carry me.

Little differences between x86 and Arm can cause big performance penalties for emulators. None more so than flags.

What are those?

Flags are bits representing the processor state. For example, if an operation results in zero, the "zero flag" is set. Both x86 and Arm have flags, so for emulating x86 on Arm, we map x86 flags to Arm flags to reduce emulation overhead. That is possible because x86 and Arm have similar flags. By contrast, architectures like RISC-V lack flags, slowing down x86-on-RISC-V emulators.

Many arithmetic operations set flags. Programs can then conditionally jump ("branch") according to the flags. On x86, the flags are thus the building blocks of if statements and loops. To check if two variables are equal, x86 code subtracts them and checks the zero flag. To check if one variable is less than another, x86 code subtracts and checks the negative flag. This pattern -- subtracting, setting flags, and discarding the actual result -- is so common that it has a special instruction: CMP ("CoMPare").

If the story ended here, emulation would be easy. Unfortunately, we need to talk about the carry flag.

After an addition, the carry flag indicates if the result overflowed. Programs can then check the carry flag to detect overflows. The flag can also be input to another addition to implement 128-bit additions.

Subtractions are similar. In hardware, subtractions are additions with an operand negated. Because they are additions in hardware, subtractions set the carry flag. Precisely how is the carry flag defined for subtraction? There are two competing conventions.

The first sets the flag when there is a borrow, by mathematical analogy with addition. x86 uses this "borrow flag" convention, as it seems more natural.The second option sets the flag when there is not a borrow. Isn't that backwards? It turns out that adding a (two's complement) negated operand overflows exactly when the subtraction does not borrow. This "true carry" convention matches actual hardware behaviour, while the "borrow" x86 convention requires extra gates to invert carry. Arm uses the "true carry" convention to save a few gates.

Which convention should FEX use?

We could store the x86 carry flag in the Arm carry flag. Unfortunately, that requires an extra instruction after each subtraction to invert carry to get the borrow flag.

The counter-intuitive alternative is storing the opposite of the x86 flag. That requires an extra invert after every addition, but it eliminates the invert after subtraction.

Either we pay after additions or after subtractions. Which should we pick?

While addition is common, using the flags from an addition is not. Flags are typically used with comparisons, which are subtractions. Therefore, the inverted convention usually wins. This month, Alyssa adjusted FEX to invert carries, speeding up typical workloads by a few percent.

After tackling the carry flag, Alyssa optimized FEX's translations of address modes, push/pop, AVX load/stores, and more. Overall, benchmarks are upwards of 10% faster since the last release.

A Qt change

What about more user-visible changes? If you use the FEXConfig tool to configure the emulator, you're in for a treat. While it works, this ImGui-based tool isn't exactly known for its convenience. In between his work optimizing the [redacted] out of FEX's [redacted], Tony rewrote FEXConfig as a simple Qt application, improving aesthetics, usability, and accessibility all in one go. 

Besides look and feel, we've polished first-time setup for logging, library forwarding, and RootFS images. We've also made tweaking various emulation settings a bit nicer. Users of our Ubuntu PPA can simply update to unlock these improvements without any further action.

But with so much optimization, who needs speedhacks anymore?

Raw Changes

FEX Release FEX-2409

  • ARM64EC

  • Always use the CPU area context for BeginSimulation ( cc589ba)

  • Fix ContextFlag member tests ( 47b3637)

  • AVX128

  • Fixes 256-bit float compares ( 2478abb)

  • Fixes incorrect size usage in AVX128_Vector_CVT_Int_To_Float ( a4db585)

  • Arm64

  • Implement support for strict in-process split-locks ( 74e95df)

  • Ensure 256-bit operations always assert without 256-bit SVE ( 90f7cc9)

  • Fixes SIGBUS handler for FEAT_LRCPC ( 2357253)

  • Allow directly correlating an ARM register back to an x86 register ( 92c951c)

  • Handle backpatching in a thread-safe manner ( 226f5e2)

  • CMake

  • Adds an AArch64 cross-compile toolchain file ( 06497fd)

  • Don't install binfmts for MinGW builds ( f009a00)

  • CPUID

  • add missing Apple core part numbers ( 8296bfc)

  • ConstProp

  • stop pooling inline constants ( 5013b8a)

  • DeadStoreElimination

  • handle PF/AF invalidate ( 03832b2)

  • External

  • Update robin-map from 1.2.1 to 1.3.0 ( 96055cb)

  • Update fmt from 10.1.1 to 11.0.2 ( f1d7879)

  • FEX

  • Moves sigreturn symbols to frontend ( 4baeffe)

  • FEXCore

  • Dynamically scale TSC ( 46a2a06)

  • Splits up atomic enablement checks ( 689b461)

  • Stop installing static library ( 4abac0c)

  • Disable vixl linking if vixl disasm or simulator is disabled ( 5e706df)

  • FEXInterpreter

  • Support portable installs ( 9336e35)

  • FEXLoader

  • don't install FEXUpdateAOTIRCache ( 1e1bcc4)

  • FEXQConfig

  • Add strict split-lock option ( 8fe1e95)

  • FEXQonfig

  • Fix minor saving/loading quirks ( e234e11)

  • Minor followup changes ( c74df6a)

  • FEXRootFSFetcher

  • Allow UI override through options ( 5b65f30)

  • Frontend

  • short-circuit code generation on invalid instructions with multiblock ( 92ddc00)

  • HostFeatures

  • Moves MIDR querying to the frontend ( fbf62f1)

  • Removes vixl usage ( 1caa31c)

  • fix clang reformat constantly with missing comment ( 86e5e1a)

  • IR

  • fix scalar FMA tied sources ( a66fac6)

  • InstCountCI

  • include x86 instruction count ( 50a9cea)

  • InstructionCountCI

  • explicitly enable flagm for multiinst ( 33558e6)

  • LinuxEmulation

  • Implement support for seccomp ( b368223)

  • Moves guest signal frame generation to its own file ( d9544e7)

  • LinuxSyscalls

  • Adds missing header ( 4f40416)

  • Implements less invasive assertion only EFAULT handlers ( ce88f5f)

  • Some minor cleanups ( 27cd399)

  • OpcodeDispatcher

  • Convert more template handlers to Bind handlers ( 2829ad5)

  • Allow x86 code to read CNTVCT on ARM64EC ( e6aa268)

  • fix tso checks ( 2a170cf)

  • Remove old bad assumption in INC/DEC ( c634c53)

  • SignalDelegator

  • Refactor how thread local data is stored ( 114112a)

  • Changes AFmt to ERROR_AND_DIE_FMT ( f897579)

  • Make sure to defer a signal if the guest signal mask desires ( 8617150)

  • SpinWaitLock

  • Update comment about WFE spurious wakeups ( 894aaa9)

  • Thunks

  • Removes global static initializer ( b4a67a6)

  • VDSO

  • Fixes a pretty nasty bug where we were never using the host VDSO ( 0754aff)

  • WOW64

  • Improve exception handling ( 8e1695a)

  • Windows

  • Load per-application configs ( 49b40b7)

  • Fix missing pragma and license text ( e415b94)

  • Fill in per-core MIDR information ( 7a9eb01)

  • Fix some file operations on actual Windows ( b6cb897)

  • Implement CreateDirectoryW CRT function ( eadb502)

  • Install libraries to CMAKE_INSTALL_LIBDIR ( 21f9841)

  • Support CPU feature detection from ID registry keys ( 97c229d)

  • Misc

  • More EFAULT handlers ( 62e1767)

  • Add Qt-based config editor ( 3020a0d)

  • Install GDBSymbol integration in the correct location ( ac814ac)

  • Eliminate a move in 64-bit umul ( 2039950)

  • Adds more syscall memory access checks ( b526c60)

  • Fix multiblock on ARM64EC ( 812224a)

  • Optimize AXFLAG-less systems ( 42f2851)

  • Optimize adc x, 0 ( 43cd897)

  • Optimise test al, 1 ( 0e5f5a2)

  • Optimize test ( cfa2ad8)

  • Rearrange SRA to let us coalesce cmpxchg moves ( 877b2f4)

  • Optimize AVX load/store with ldp/stp ( 5ac7d5d)

  • Optimize BTC ( a7138f2)

  • Improvements from bytemark "huffman" ( 99afd87)

  • binfmt_misc: Support systemd binfmt_misc ( d2c82ba)

  • Library Forwarding: Follow up from  #3964 ( e5149fb)

  • Install libraries in the correct location ( abbd655)

  • Clean up our JSON dependencies ( 00ef1ba)

  • small optimizations for returns ( 933c65d)

  • Support cooperative suspend on ARM64EC ( df0ecad)

  • Add a hack for multiple destinations & make good use of it ( aa5d2ff)

  • Little opcodedispatcher optimizations ( 6f43c8f)

  • Invert carry flag internally ( 8aa7d1a)

  • x32

  • Signals

  • Fixes bug in the sigqueue syscalls ( 0416950)

Release FEX-2409 · FEX-Emu/FEX