FEX-2407
Read the blog post at FEX-Emu's Site!
I hope you're ready to game, because your Arm system just got support for AVX. And AVX2. And FMA3. And F16C. And BMI1. And BMI2. And VAES. And VPCLMULQDQ.
Yeah, we've been busy.
We did a little team-building exercise this month: "bring up AVX on 128-bit hardware in a week". Now our team is built, and AVX games run on FEX
AVX on 128-bit Arm
Computers traditionally perform one operation at a time. The hardware decodes an instruction, evaluates the operation on a single pair of numbers, and repeats for the next instruction. In mathematical terms, the instructions operate on scalars.
That design leaves performance on the table.
Many programs repeat one operation many times with different data. Modern instruction sets exploit that repetition. A single "vector" instruction can operate on multiple pieces of data at once. Programs will perform the same amount of arithmetic overall, but there are fewer instructions to decode and the arithmetic is more predictable. That enables more efficient hardware.
A "scalar" instruction adds a pair of numbers; a "vector" instruction adds multiple pairs. How many pairs? That is, what length is the vector?
That's a design trade-off. Increasing the vector length decreases the number of instructions we need to execute while increasing the hardware cost. Supporting large vectors efficiently requires a large register file and many
arithmetic logic units. Besides, there are diminishing returns past a certain vector length.There is no-one-size-fits-all vector length. Different instruction sets make different choices. In x86, the SSE instruction set uses 128-bit vectors, while AVX and AVX-512 instructions support 256-bit and 512-bit respectively. For Arm, the traditional ASIMD (NEON) instructions use 128-bit vectors. Depending on the specific Arm hardware implementation, the flexible new SVE instructions can use either 128-bit or 256-bit.
For performance, we try to translate each x86 instruction to an equivalent arm64 instruction. There's no perfect 1:1 correspondence, but we can get close. For vector instructions, we translate 128-bit SSE instructions into equivalent 128-bit ASIMD instructions.
In theory, we can do the same for AVX, mapping 256-bit AVX instructions to 256-bit SVE instructions. Mai implemented that last year, speculating that their work would enable AVX on future 256-bit SVE hardware.
...
That hardware never came. Some recent hardware supports SVE but only 128-bit. Others don't have that, supporting only ASIMD. We had a shiny AVX-256 implementation with no 256-bit hardware to use it with.
Our position remains that efficient AVX emulation requires 256-bit SVE. Unfortunately, many games today have a hard requirement on AVX. We want to let you play those games on your Arm devices, so we need to plug our noses and implement 256-bit AVX on 128-bit hardware.
The idea is simple; the implementation is not. To translate a 256-bit instruction, we decompose it into two 128-bit instructions operating on each half of the 256-bit vector. In effect, we partially "undo" the vectorization.
This plan has a gaping hole: the register file. Arm has more general purpose registers than x86, so we statically assign each x86 register to an Arm register. If we didn't, accessing certain x86 registers would require slow memory accesses. All efficient x86-on-Arm emulators therefore statically map registers.
This scheme unfortunately fails with AVX emulation. Our Arm hardware has 128-bit vectors, but we're emulating 256-bit AVX vectors. We would need twice as many Arm vector registers as x86 vector registers, and we don't have enough.
Still, running AVX games suboptimally is better than not running at all. Yes, we're out of registers -- we'll just have to keep some in memory. The assembly isn't pretty, but it works well enough.
Building on Mai's SVE-256 implementation of AVX, Ryan whipped together a version supporting both SVE-128 and ASIMD. That means it should work on all arm64 devices, all the way back to ARMv8.0.
F16C, FMA, and more
AVX isn't the only x86 extension new games require. After beating AVX, x86 has a post-AVX questline for emulator developers, with extensions like F16C. Fortunately, nothing could scare Ryan and Mai after AVX, and they tackled the
new extensions without a hitch.Speeding up translation
Dynamically assigning AVX registers means we can't translate instructions directly and expect good results. While we don't need a full optimizing compiler, we do need enough intelligence for basic inter-instruction optimizations. FEX has had some optimizations since day one, but the priority has always been on bringing up new games. There's been even less focus on the translation time -- not how fast the generated arm64 code executes but how fast we can generate the arm64 code. Translation overhead contributes to slow loading screens and in-game stutter, so while it flies under the issue radar, it does matter. So, Alyssa optimized the FEX optimizer this month by merging compiler passes. That both simplifies the code and speeds up translation. Since the start of June, we've reduced translation time 10%... and more optimization is coming.
Happy gaming :-)
FEX Release FEX-2407
ARM64
Adds new FMA vector instructions ( 00cf8d5)
AVX128
Fixes vmovlhps ( e2d4010)
Minor optimization to 256-bit vpshufb ( 5821054)
Fixes scalar FMA accidentally using vector wide ( 4626145)
Minor optimization to vmov{l,h}{ps,pd} ( cf24d3c)
Some quick bugfixes ( 739ac0f)
Enable all the things ( e519bf5)
Fixes wide shifts ( a031a49)
F16C support ( 4d821b8)
Implement support for gathers ( 6226c7f)
FMA3 ( 7d939a3)
fix VPCLMULQDQl ( 5da205d)
More instructions Part 4 ( 77aaa9a)
Fix vmovntdqa failing to zero upper 128-bits ( 7ff9622)
Fixes SSE4.2 string compare instructions ( dce1b24)
More instructions Part 3 ( 8c751d7)
More instructions ( ddb9f6d)
More various instructions ( be8ff9c)
Some pun pickles, moves and conversions ( 7bbbd95)
Move moves! ( f489135)
Arm64
Minor VBSL optimization with SVE128 ( 2e6b08c)
Remove contiguous masked element optimization ( dc44eb4)
Implement support for emulated masked vector loadstores ( 3d26e23)
Arm64Emitter
drop out of date comment ( 825d2c9)
BMI2
Ensure rorx immediate masks by operation size correctly. ( 54a1f7d)
CMake
Add a clang version check ( 5d67223)
CPUID
Oops, forgot to enable AVX2 ( 58e949e)
Update labeling on some reserved bits ( 2da1e90)
CodeEmitter
Fixes vector {ldr,str}{b,h} with reg-reg source ( 2e84f21)
CoreState
Move
InlineJITBlockHeader
to the start of the struct ( 933d622)Externals
Update vixl submodule ( b092b7a)
FEX
Consolidate JSON allocators and fix 3691 ( 643e964)
FEXConfig
Clear up TSO emulation string ( 8d92902)
FEXCore
remove very out-of-date optimizer docs ( 1a0d135)
Fixes address size override on GPR sources and destinations ( 53b1d15)
Implement AVX reconstruction helpers ( 6edf461)
Disentangle the SVE256 feature from AVX ( 13ebfb1)
Fixes Call with 32-bit displacement and address size override ( e4ff3da)
FEXGetConfig
Support the ability to get TSO emulation facts ( 2e009be)
Frontend
Expose AVX W flag ( 2e5fa1e)
HostFeatures
Always disable AVX in 32-bit mode to protect from stack overflows ( e99e252)
Work around Qualcomm Oryon RNG errata ( df96bc8)
IR
rename _VBic -> _VAndn ( 67e1ac0)
InstCountCI
add segment register cases ( e7bdb86)
explicitly disable AFP everywhere ( 7c7d767)
JIT
delete silly assert ( 61ff1b3)
Linux
Calculate cycle counter frequency for cpuinfo ( 76f3391)
LinuxEmulation
Add a helper for getting the ThreadStateObject from CPU frame ( 89b05a2)
OpcodeDispatcher
Implement support for non-temporal vector stores ( 472a373)
Optimize x86 canonical vector zero register ( a451420)
Handle F16C operations ( 94fd100)
Fixes bug in pshuf{lw,hw} ( dfda673)
refactor Comiss helper ( fac9972)
Adds initial groundwork for decomposed AVX operations ( b2eb8aa)
Refactor address modes ( db0bdd4)
refactor zero vector loads ( c57e9e0)
RA
priorize remat over spilling ( 46ca53a)
SMCTracking
Fix incorrect mprotect tracking ( d7348c8)
Scripts
add update_instcountci.sh script ( 1971404)
InstallFEX
update PPA URL ( 9744d8d)
Vector
Helper refactorings ( 3f232e6)
Windows
Small fixes for compat with newer toolchains/wine versions ( 4e5da49)
X86Tables
add Literal() helper ( 8f769ce)
Misc
Use number of jobs as defined by TEST_JOB_COUNT ( f453e15)
Revert removing RCLSE ( 02a218c)
Optimize gathers slightly ( ba04da8)
Largest x87 blocks of code from games ( 500ad34)
( 1700d54)
OpcodeDispatcher: optimize nzcv with asimd masked load/store ( 2e32426)
Tiny opt for vzeroall ( ad4d4c9)
Optimize vcvtps2ph ( 756fa2e)
Remove RCLSE ( d1d41f5)
( 1c24d63)
Fix VMOVLHPS instruction ( 3a310b8)
Adds back cmake error on x86-64 hosts ( b2db04f)
Set tag properly in X87 FST(reg) ( 6e3643c)
Fixes AFP.NEP handling on scalar insertions ( da21ee3)
VAES support ( d2baef2)
Ignore python files for clang-format ( 053620f)
FXCH should set C1 to zero ( 87fe1d6)
First few commits from Ryan's AVX branch ( 29f6442)
json_ir_generator: don't print unrecoverable temps ( f863b30)
Clean ups from my RCLSE branch ( 4965344)
Revert "OpcodeDispatcher: optimize logical flags" ( 9aa82ec)
Use FEX_HOSTFEATURES instead of FEX_ENABLEAVX ( b17a2e9)
unittests
Adds MMX and x87 conflating unit test ( d2437e6)
Fixes typo in vpcmpgtw test ( d884eb9)
Split up vtestps unittest to accumulate flags in independent registers. ( 4d00a52)
A new version of the FEX-EMU, which allows the execution of x86 and x86-64 binaries on an AArch64 host, has been released. The new version has support for AVX128, allowing games to run on Arm systems.