1*9880d681SAndroid Build Coastguard Worker//===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===// 2*9880d681SAndroid Build Coastguard Worker 3*9880d681SAndroid Build Coastguard WorkerAMD64 Optimization Manual 8.2 has some nice information about optimizing integer 4*9880d681SAndroid Build Coastguard Workermultiplication by a constant. How much of it applies to Intel's X86-64 5*9880d681SAndroid Build Coastguard Workerimplementation? There are definite trade-offs to consider: latency vs. register 6*9880d681SAndroid Build Coastguard Workerpressure vs. code size. 7*9880d681SAndroid Build Coastguard Worker 8*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 9*9880d681SAndroid Build Coastguard Worker 10*9880d681SAndroid Build Coastguard WorkerAre we better off using branches instead of cmove to implement FP to 11*9880d681SAndroid Build Coastguard Workerunsigned i64? 12*9880d681SAndroid Build Coastguard Worker 13*9880d681SAndroid Build Coastguard Worker_conv: 14*9880d681SAndroid Build Coastguard Worker ucomiss LC0(%rip), %xmm0 15*9880d681SAndroid Build Coastguard Worker cvttss2siq %xmm0, %rdx 16*9880d681SAndroid Build Coastguard Worker jb L3 17*9880d681SAndroid Build Coastguard Worker subss LC0(%rip), %xmm0 18*9880d681SAndroid Build Coastguard Worker movabsq $-9223372036854775808, %rax 19*9880d681SAndroid Build Coastguard Worker cvttss2siq %xmm0, %rdx 20*9880d681SAndroid Build Coastguard Worker xorq %rax, %rdx 21*9880d681SAndroid Build Coastguard WorkerL3: 22*9880d681SAndroid Build Coastguard Worker movq %rdx, %rax 23*9880d681SAndroid Build Coastguard Worker ret 24*9880d681SAndroid Build Coastguard Worker 25*9880d681SAndroid Build Coastguard Workerinstead of 26*9880d681SAndroid Build Coastguard Worker 27*9880d681SAndroid Build Coastguard Worker_conv: 28*9880d681SAndroid Build Coastguard Worker movss LCPI1_0(%rip), %xmm1 29*9880d681SAndroid Build Coastguard Worker cvttss2siq %xmm0, %rcx 30*9880d681SAndroid Build Coastguard Worker movaps %xmm0, %xmm2 31*9880d681SAndroid Build Coastguard Worker subss %xmm1, %xmm2 32*9880d681SAndroid Build Coastguard Worker cvttss2siq %xmm2, %rax 33*9880d681SAndroid Build Coastguard Worker movabsq $-9223372036854775808, %rdx 34*9880d681SAndroid Build Coastguard Worker xorq %rdx, %rax 35*9880d681SAndroid Build Coastguard Worker ucomiss %xmm1, %xmm0 36*9880d681SAndroid Build Coastguard Worker cmovb %rcx, %rax 37*9880d681SAndroid Build Coastguard Worker ret 38*9880d681SAndroid Build Coastguard Worker 39*9880d681SAndroid Build Coastguard WorkerSeems like the jb branch has high likelihood of being taken. It would have 40*9880d681SAndroid Build Coastguard Workersaved a few instructions. 41*9880d681SAndroid Build Coastguard Worker 42*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 43*9880d681SAndroid Build Coastguard Worker 44*9880d681SAndroid Build Coastguard WorkerIt's not possible to reference AH, BH, CH, and DH registers in an instruction 45*9880d681SAndroid Build Coastguard Workerrequiring REX prefix. However, divb and mulb both produce results in AH. If isel 46*9880d681SAndroid Build Coastguard Workeremits a CopyFromReg which gets turned into a movb and that can be allocated a 47*9880d681SAndroid Build Coastguard Workerr8b - r15b. 48*9880d681SAndroid Build Coastguard Worker 49*9880d681SAndroid Build Coastguard WorkerTo get around this, isel emits a CopyFromReg from AX and then right shift it 50*9880d681SAndroid Build Coastguard Workerdown by 8 and truncate it. It's not pretty but it works. We need some register 51*9880d681SAndroid Build Coastguard Workerallocation magic to make the hack go away (e.g. putting additional constraints 52*9880d681SAndroid Build Coastguard Workeron the result of the movb). 53*9880d681SAndroid Build Coastguard Worker 54*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 55*9880d681SAndroid Build Coastguard Worker 56*9880d681SAndroid Build Coastguard WorkerThe x86-64 ABI for hidden-argument struct returns requires that the 57*9880d681SAndroid Build Coastguard Workerincoming value of %rdi be copied into %rax by the callee upon return. 58*9880d681SAndroid Build Coastguard Worker 59*9880d681SAndroid Build Coastguard WorkerThe idea is that it saves callers from having to remember this value, 60*9880d681SAndroid Build Coastguard Workerwhich would often require a callee-saved register. Callees usually 61*9880d681SAndroid Build Coastguard Workerneed to keep this value live for most of their body anyway, so it 62*9880d681SAndroid Build Coastguard Workerdoesn't add a significant burden on them. 63*9880d681SAndroid Build Coastguard Worker 64*9880d681SAndroid Build Coastguard WorkerWe currently implement this in codegen, however this is suboptimal 65*9880d681SAndroid Build Coastguard Workerbecause it means that it would be quite awkward to implement the 66*9880d681SAndroid Build Coastguard Workeroptimization for callers. 67*9880d681SAndroid Build Coastguard Worker 68*9880d681SAndroid Build Coastguard WorkerA better implementation would be to relax the LLVM IR rules for sret 69*9880d681SAndroid Build Coastguard Workerarguments to allow a function with an sret argument to have a non-void 70*9880d681SAndroid Build Coastguard Workerreturn type, and to have the front-end to set up the sret argument value 71*9880d681SAndroid Build Coastguard Workeras the return value of the function. The front-end could more easily 72*9880d681SAndroid Build Coastguard Workeremit uses of the returned struct value to be in terms of the function's 73*9880d681SAndroid Build Coastguard Workerlowered return value, and it would free non-C frontends from a 74*9880d681SAndroid Build Coastguard Workercomplication only required by a C-based ABI. 75*9880d681SAndroid Build Coastguard Worker 76*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 77*9880d681SAndroid Build Coastguard Worker 78*9880d681SAndroid Build Coastguard WorkerWe get a redundant zero extension for code like this: 79*9880d681SAndroid Build Coastguard Worker 80*9880d681SAndroid Build Coastguard Workerint mask[1000]; 81*9880d681SAndroid Build Coastguard Workerint foo(unsigned x) { 82*9880d681SAndroid Build Coastguard Worker if (x < 10) 83*9880d681SAndroid Build Coastguard Worker x = x * 45; 84*9880d681SAndroid Build Coastguard Worker else 85*9880d681SAndroid Build Coastguard Worker x = x * 78; 86*9880d681SAndroid Build Coastguard Worker return mask[x]; 87*9880d681SAndroid Build Coastguard Worker} 88*9880d681SAndroid Build Coastguard Worker 89*9880d681SAndroid Build Coastguard Worker_foo: 90*9880d681SAndroid Build Coastguard WorkerLBB1_0: ## entry 91*9880d681SAndroid Build Coastguard Worker cmpl $9, %edi 92*9880d681SAndroid Build Coastguard Worker jbe LBB1_3 ## bb 93*9880d681SAndroid Build Coastguard WorkerLBB1_1: ## bb1 94*9880d681SAndroid Build Coastguard Worker imull $78, %edi, %eax 95*9880d681SAndroid Build Coastguard WorkerLBB1_2: ## bb2 96*9880d681SAndroid Build Coastguard Worker movl %eax, %eax <---- 97*9880d681SAndroid Build Coastguard Worker movq _mask@GOTPCREL(%rip), %rcx 98*9880d681SAndroid Build Coastguard Worker movl (%rcx,%rax,4), %eax 99*9880d681SAndroid Build Coastguard Worker ret 100*9880d681SAndroid Build Coastguard WorkerLBB1_3: ## bb 101*9880d681SAndroid Build Coastguard Worker imull $45, %edi, %eax 102*9880d681SAndroid Build Coastguard Worker jmp LBB1_2 ## bb2 103*9880d681SAndroid Build Coastguard Worker 104*9880d681SAndroid Build Coastguard WorkerBefore regalloc, we have: 105*9880d681SAndroid Build Coastguard Worker 106*9880d681SAndroid Build Coastguard Worker %reg1025<def> = IMUL32rri8 %reg1024, 45, %EFLAGS<imp-def> 107*9880d681SAndroid Build Coastguard Worker JMP mbb<bb2,0x203afb0> 108*9880d681SAndroid Build Coastguard Worker Successors according to CFG: 0x203afb0 (#3) 109*9880d681SAndroid Build Coastguard Worker 110*9880d681SAndroid Build Coastguard Workerbb1: 0x203af60, LLVM BB @0x1e02310, ID#2: 111*9880d681SAndroid Build Coastguard Worker Predecessors according to CFG: 0x203aec0 (#0) 112*9880d681SAndroid Build Coastguard Worker %reg1026<def> = IMUL32rri8 %reg1024, 78, %EFLAGS<imp-def> 113*9880d681SAndroid Build Coastguard Worker Successors according to CFG: 0x203afb0 (#3) 114*9880d681SAndroid Build Coastguard Worker 115*9880d681SAndroid Build Coastguard Workerbb2: 0x203afb0, LLVM BB @0x1e02340, ID#3: 116*9880d681SAndroid Build Coastguard Worker Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2) 117*9880d681SAndroid Build Coastguard Worker %reg1027<def> = PHI %reg1025, mbb<bb,0x203af10>, 118*9880d681SAndroid Build Coastguard Worker %reg1026, mbb<bb1,0x203af60> 119*9880d681SAndroid Build Coastguard Worker %reg1029<def> = MOVZX64rr32 %reg1027 120*9880d681SAndroid Build Coastguard Worker 121*9880d681SAndroid Build Coastguard Workerso we'd have to know that IMUL32rri8 leaves the high word zero extended and to 122*9880d681SAndroid Build Coastguard Workerbe able to recognize the zero extend. This could also presumably be implemented 123*9880d681SAndroid Build Coastguard Workerif we have whole-function selectiondags. 124*9880d681SAndroid Build Coastguard Worker 125*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 126*9880d681SAndroid Build Coastguard Worker 127*9880d681SAndroid Build Coastguard WorkerTake the following code 128*9880d681SAndroid Build Coastguard Worker(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653): 129*9880d681SAndroid Build Coastguard Workerextern unsigned long table[]; 130*9880d681SAndroid Build Coastguard Workerunsigned long foo(unsigned char *p) { 131*9880d681SAndroid Build Coastguard Worker unsigned long tag = *p; 132*9880d681SAndroid Build Coastguard Worker return table[tag >> 4] + table[tag & 0xf]; 133*9880d681SAndroid Build Coastguard Worker} 134*9880d681SAndroid Build Coastguard Worker 135*9880d681SAndroid Build Coastguard WorkerCurrent code generated: 136*9880d681SAndroid Build Coastguard Worker movzbl (%rdi), %eax 137*9880d681SAndroid Build Coastguard Worker movq %rax, %rcx 138*9880d681SAndroid Build Coastguard Worker andq $240, %rcx 139*9880d681SAndroid Build Coastguard Worker shrq %rcx 140*9880d681SAndroid Build Coastguard Worker andq $15, %rax 141*9880d681SAndroid Build Coastguard Worker movq table(,%rax,8), %rax 142*9880d681SAndroid Build Coastguard Worker addq table(%rcx), %rax 143*9880d681SAndroid Build Coastguard Worker ret 144*9880d681SAndroid Build Coastguard Worker 145*9880d681SAndroid Build Coastguard WorkerIssues: 146*9880d681SAndroid Build Coastguard Worker1. First movq should be movl; saves a byte. 147*9880d681SAndroid Build Coastguard Worker2. Both andq's should be andl; saves another two bytes. I think this was 148*9880d681SAndroid Build Coastguard Worker implemented at one point, but subsequently regressed. 149*9880d681SAndroid Build Coastguard Worker3. shrq should be shrl; saves another byte. 150*9880d681SAndroid Build Coastguard Worker4. The first andq can be completely eliminated by using a slightly more 151*9880d681SAndroid Build Coastguard Worker expensive addressing mode. 152*9880d681SAndroid Build Coastguard Worker 153*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 154*9880d681SAndroid Build Coastguard Worker 155*9880d681SAndroid Build Coastguard WorkerConsider the following (contrived testcase, but contains common factors): 156*9880d681SAndroid Build Coastguard Worker 157*9880d681SAndroid Build Coastguard Worker#include <stdarg.h> 158*9880d681SAndroid Build Coastguard Workerint test(int x, ...) { 159*9880d681SAndroid Build Coastguard Worker int sum, i; 160*9880d681SAndroid Build Coastguard Worker va_list l; 161*9880d681SAndroid Build Coastguard Worker va_start(l, x); 162*9880d681SAndroid Build Coastguard Worker for (i = 0; i < x; i++) 163*9880d681SAndroid Build Coastguard Worker sum += va_arg(l, int); 164*9880d681SAndroid Build Coastguard Worker va_end(l); 165*9880d681SAndroid Build Coastguard Worker return sum; 166*9880d681SAndroid Build Coastguard Worker} 167*9880d681SAndroid Build Coastguard Worker 168*9880d681SAndroid Build Coastguard WorkerTestcase given in C because fixing it will likely involve changing the IR 169*9880d681SAndroid Build Coastguard Workergenerated for it. The primary issue with the result is that it doesn't do any 170*9880d681SAndroid Build Coastguard Workerof the optimizations which are possible if we know the address of a va_list 171*9880d681SAndroid Build Coastguard Workerin the current function is never taken: 172*9880d681SAndroid Build Coastguard Worker1. We shouldn't spill the XMM registers because we only call va_arg with "int". 173*9880d681SAndroid Build Coastguard Worker2. It would be nice if we could sroa the va_list. 174*9880d681SAndroid Build Coastguard Worker3. Probably overkill, but it'd be cool if we could peel off the first five 175*9880d681SAndroid Build Coastguard Workeriterations of the loop. 176*9880d681SAndroid Build Coastguard Worker 177*9880d681SAndroid Build Coastguard WorkerOther optimizations involving functions which use va_arg on floats which don't 178*9880d681SAndroid Build Coastguard Workerhave the address of a va_list taken: 179*9880d681SAndroid Build Coastguard Worker1. Conversely to the above, we shouldn't spill general registers if we only 180*9880d681SAndroid Build Coastguard Worker call va_arg on "double". 181*9880d681SAndroid Build Coastguard Worker2. If we know nothing more than 64 bits wide is read from the XMM registers, 182*9880d681SAndroid Build Coastguard Worker we can change the spilling code to reduce the amount of stack used by half. 183*9880d681SAndroid Build Coastguard Worker 184*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===// 185