FEX-2308
Read the blog post at FEX-Emu's Site!
Whoa jeez, another month already? We've had our heads down working hard this last month, trying to make FEX-Emu the greatest x86/x86-64 emulator on
Linux. A huge focus this month is optimizations because of course what we want is to go fast. We're all cats and we've got the zoomies.Every day we're optimizing
As said before, this month has been an absolute mess of optimizations as we've been optimizing the project as thoroughly as possible. We could spend
another month talking about the optimizations that we did this last month, so let's blast through what we did. First let's show a graph for how much
FEX has improved over this last month.Look at those numbers! Some benchmarks from bytemark have cracked the 200% mark! While a couple benchmarks do have regressions, we're pretty sure that
we know what they are and they will be rectified soon. These are the sorts of optimizations that can be felt in real games though.So lets quickly run through some of the optimizations we ran in to this last month.
Switch to using half-barriers for memory accesses
When ARM hits an unaligned atomic memory access, we previous wrapped that load or store in two slow barrier instructions. We can now safely only use
one barrier on one half of the instruction! This makes unaligned accesses quite a bit quicker.Optimize x87 memory accesses
This removes a couple instructions when we access 80-bit floats.
Only clear icache for code
Some large code blocks can generate a decent amount of metadata that don't need an icache clear. Can remove a bit of stutter.
Const prop BFI operation
Sometimes when a BFI instruction has constants in it, we can remove the BFI instruction
Optimize vector TSO loadstores
vector operations typically need an additional add on its address if it can't fit in the instruction encoding for the immediate offset. We missed the
optimization in which the immediate offset CAN actually fit. Removes an instruction per vector loadstore commonlyUse TST instead of CMN
Sometimes these instructions hit a slow path on Cortex-A57 so a minor win there.
Optimize xor reg, reg
x86's universally agreed upon instruction for generating zero in a register is xor. This instruction isn't actually optimal in ARM hardware. We now
emit a move of constant zero which gets optimized to register rename on most ARM hardware.More instructions optimized
These mostly just make the implementations use less instructions which makes them faster. There will be way more of this in the coming month
- rotate flag calculations
- phsubsw/phaddsw
- cmpxchg8b/16b
- psad*
- 8-bit, 16-bit rcr
- fcmov
- shld/shrd
- movss
- maskmovdqu
- maskmovq
- phminposuw
- fild
- PF flag calculation optimization
- Optimizing packing RFLAGS
- Optimize ADD/ADC OF flag packing
Fixes bug in SSE4.2 pcmpestri
This was causing Java applications to crash. Now that we fixed a different bug last month, we now have Java working to an extent. It still crashes on
shutdown which is interesting and not all games are expected to work. But good luck testing random Java games!Pack NZCV flags
This is the first step towards FEX generating x86 flags in a more optimal way. These flags match the ARM flags fairly closely and can be emulated in a
more optimal way if we pack them together. This is likely what causes the regression in bytemark, but since this is an intermediate step it is
expected to go away with the next optimization after this. Look forward to future optimizations that make this faster!Remove weak symbol declarations in thunks
A bug that cropped up in thunks has been a crash that occurs when trying to use thunks from Ubuntu's PPA system. This has been a major thorn in FEX's
side for months because once you rebuild the project locally, it would never reproduce. The problem stems from the fact that clang would decide that
it can inline a "weak" symbol if its implementation is visible. This would only occur on Canonical's ARM builders, potentially due to whatever device
they use to compile the code on. This would cause our thunks to crash almost immediately if a user tried them from the PPA system. We have now worked
around this clang quirk and this will now fix thunks when enabled from the PPA system.Mingw build work
As part of FEX's effort towards supporting running as a WINE dll, we have been slowly adding support for compiling FEXCore as a Windows DLL.
This month we have removed a bunch of Linux assumptions and API usages from FEXCore and moved it to the frontend FEXInterpreter application. In doing
so, FEXCore can now be compiled using llvm-mingw as a WINE specific DLL. This is completely unusable for users today but sets the groundwork towards
what will eventually become a WoW64 integration in the future. We have also added mingw building of FEXCore to our CI so we ensure it doesn't get
broken.
To be clear, even though this work allows us to compile as a Windows DLL, this doesn't allow us to run under Windows. FEX still does a bunch of things
that are Linux specific inside of the code.ARMEmitter cleanups
Another improvement that doesn't affect our users but good to shoutout the improvement for our developers. @Lioncachehas spent a good amount of time this last month adding missing instructions and aliases to our AArch64 code emitter. While our code emitter covers a
decent amount of the AArch64 instruction space, it takes time to ensure full coverage. Whenever we're writing code for our JIT and an instruction is
missing, it slows down whatever we are working on. So kudos for improving our coverage because it makes everyone's lives easier.Implement missing accept4, recvmmsg, sendmmsg for 32-bit socketcall
In a recent Steam client update, it started using accept4 for some background thing. This would cause it to spam a bunch of logs when failing to
accept some connection. A simple fix just for a few missing system calls, Steam now no longer is complaining loudly.Fix variadic packing in X11 thunking
WINE had broken X11 thunking for all of FEX's history without any indication as to why. We never had time to look in to this but this last month we
finally hit a game that crashed which made this easier to debug. This bug occured because WINE is one of the few applications that pass more than
seven arguments through a few variadic API calls. This triggered a bug in FEX's variadic repacking code once we starting packing the arguments on to
the stack. With this fixed, WINE X11 thunking now works in significantly more games. This means that both OpenGL and Vulkan applications can be
thunked under WINE.Fixes dead context store elimination pass
This optimization pass removes redundant stores to FEX's CPU context state. While this usually doesn't save much, it can improve performance for some
edge cases in FEX's JIT. While this is a performance optimization, it likely won't affect many things.Fix 16-bit POPA instruction
This instruction was accidentally zero extending the 16-bit value in to the 32-bit register. We now insert the 16-bits as expected. This fixes an
issue with OpenAL in some cases.Raw Changes
ARMEmitter
Add missing atomic aliases ( 68cb6e6)
Add cinc/cinv/csetm aliases ( 30ab4d3)
Add ngc/ngcs aliases ( eebcbfd)
Add bfc/bfxil aliases ( d2bca9b)
Add sbfiz/tst/ubfiz aliases ( 4681061)
Finish off remaining SVE Integer Wide Immediate - Unpredicated categories ( 2fc6542)
Implement cmn alias ( 1ce0ea8)
Arm64
Switch to using half barriers ( 94273fb)
Fixes LR corruption in 128-bit divides ( 5821175)
Optimize {Load,Store}ContextIndexed address generation ( 536b2ed)
Only clear icache for code ( 0674dfa)
Emitter: Handle LD1{}/LDFF1{} Vector + Immediate encodings ( 421214e)
Emitter
Add remaining missing SVE predicate range assertions ( 64c7243)
Deduplicate some more SVE implementations pt. 2 ( 0a1820d)
Deduplicate some more SVE implementations ( 9c175da)
Reorganize some base opcode and assert locations ( 0be68a5)
Simplify SVE immediate shift helper ( e689c6f)
Collapse encoding cases for indexed dup ( d6697fc)
Handle SVE FP convert precision group ( e633ef7)
Handle SVE FP arithmetic with immediate (predicated) group ( 754bc18)
Handle SVE XAR ( 842b71c)
Add helper for encoding SVE shift immediates ( 41b3c52)
Deduplicate some opcode values ( d7200e2)
Handle unsized contiguous STR variants ( 46d3f28)
Handle ST1{*} scatter store variants ( 724a8e1)
Simplify SVEMemOperand data union ( 699c3f5)
Handle LD1{}/LDFF1{} scalar + vector load variants ( 5c27feb)
VectorOps
Hoist asserts out of VInsElement cases ( 5b261b0)
ArmEmitter
Support 32-bit bitmask moves ( d9b52fd)
CMake
Allow overriding linker ( 0d6837f)
Fix pkg version extraction for xcb ( 96b428d)
Stop installing fmt ( 2c3361b)
ConstProp
Handle constant Bfi ( 29dc77c)
Fix shift mask in const-prop ( daeba06)
Context
Pull out some long std::function declarations into aliases ( 7562ff2)
FEXCore
Ensure that the man page follows DESTDIR ( d196709)
Remove erroneous asserts in the project ( f526244)
Removes unused TLS variable ( 6455c48)
Fixes WIN32 compiling again ( 2283c73)
Minor cleanup ( 462feec)
Config
Adds support for enum mask configuration array ( 2c91b5c)
FEXLinuxTests
Fixes race in smc-mt-2 ( af35e18)
FEXRootFSFetcher
Make verification percent easier to read ( 7765bbc)
FHU
Avoid calling faccessat on WIN32 ( b34401b)
Fix WIN32 getcpu implementation ( 70d5412)
Frontend
Remove unused ModRMDecoded instance ( f9a9645)
Remove redundant lookups in BranchTargetInMultiblockRange() ( c81b1c3)
Handle VSIB byte ( 4a7fa7f)
Github
Adds mingw build test workflow ( 0f748bd)
IR
Expand to 16-bit opcodes ( 003c88e)
Print SSA values as %123 instead of %ssa123 ( 599b64e)
Optimize vector TSO loadstore address calculation ( 810c7d9)
Passes
Fixes DeadStoreElimination pass ( ee66985)
JIT
Use TST instead of CMN ( 5a53c92)
Jit
Add block links directly through the lookup cache on thread exit ( 8d8b64d)
Linux
Implement {recv,send}mmsg inside of socketcall ( 0a5db6c)
Implement accept4 inside of socketcall ( 7c4e4c4)
LongDivideRemovalPass
Remove unused variable ( 759747b)
Mingw
Fixes compiling again ( dc8f063)
OpcodeDispatcher
Optimize rotates ( 25b2af1)
Optimize phsubsw/phaddsw ( 7605994)
Optimize CMPXCHG{8B,16B} final comparison ( 6e15c9c)
Optimize PSAD* to use vuabdl{2,} ( 173b70d)
Fixes SHRD by immediate OF flag calculation ( 52d7efd)
Optimize 8/16-bit RCR ( 072f027)
Another FCMov minor optimization ( f7b7997)
Handle VSIB byte ( 6979dc9)
Remove unused member variables and reorganize ( be71886)
Minor optimization to phminposuw ( d3a2795)
Optimize 32/64-bit SH{L,R}D with extr ( 3cd6c2d)
Minor optimization to FILD ( 1b1e9e0)
Narrow use of LoadXMMRegister in StoreResult_WithOpSize ( 68a2441)
Fix and optimize PF calculation ( 1d7b4bb)
Optimize ADD/ADC OF flag packing ( 9722c4c)
Remove spurious bfe with flag storing ( ddd6dbf)
Optimize MOVSS to register ( 7f2557e)
Optimize MOVSS to memory destination ( 0121e85)
Optimize some shifts size masking ( 98eda5e)
Fixes bug with pcmpestri ( 5929357)
Optimize GetPackedRFLAG ( 8a4bfba)
Optimize MASKMOVDQU and MASKMOVQ ( 69ea03f)
RCLSE
Rename
Node
toValueNode
( d83960d)SignalDelegator
Moves last TLS variable to the frontend ( b86abfb)
StructPackVerifier
Fixes missing cursorkind again ( 296adf1)
Thunks
Set bitness flags for 64-bit guests ( 96aa0a8)
Remove weak symbol definitions ( ff3b404)
Fixes thunks in non-multiarch ( 9def04c)
X11
Fixes variadic packing and callbacks. ( 82295b2)
xcb
Fixes typo in version check. ( 8c53a37)
ThunksDB
Adds X11 dependency to XCB ( 597da88)
X86Tables
Fixes CLZero destination address ( 5d0b206)
Adds spaceship operator to couple op types ( 20aaad1)
Misc
Fix 16-bit popa insertion behaviour ( 21eb6e0)
Pack NZCV flags ( 91bd3aa)
Preserve PF across zero shift ( abf5e8c)
Optimize
xor %eax, %eax
( 77c88ff)github-actions: Adds timeout to upload results ( 6baee3b)
fix spelling errors ( 22f95e6)
unittests
Adds missing header ( 573f339)
gvisor
Adds all socket tests to flakes ( 6f2452e)
The FEX-EMU, which enables the execution of x86 and x86-64 binaries on an AArch64 host, has been updated.