xref: /aosp_15_r20/external/llvm/lib/Target/X86/README-X86-64.txt (revision 9880d6810fe72a1726cb53787c6711e909410d58)
1*9880d681SAndroid Build Coastguard Worker//===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===//
2*9880d681SAndroid Build Coastguard Worker
3*9880d681SAndroid Build Coastguard WorkerAMD64 Optimization Manual 8.2 has some nice information about optimizing integer
4*9880d681SAndroid Build Coastguard Workermultiplication by a constant. How much of it applies to Intel's X86-64
5*9880d681SAndroid Build Coastguard Workerimplementation? There are definite trade-offs to consider: latency vs. register
6*9880d681SAndroid Build Coastguard Workerpressure vs. code size.
7*9880d681SAndroid Build Coastguard Worker
8*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
9*9880d681SAndroid Build Coastguard Worker
10*9880d681SAndroid Build Coastguard WorkerAre we better off using branches instead of cmove to implement FP to
11*9880d681SAndroid Build Coastguard Workerunsigned i64?
12*9880d681SAndroid Build Coastguard Worker
13*9880d681SAndroid Build Coastguard Worker_conv:
14*9880d681SAndroid Build Coastguard Worker	ucomiss	LC0(%rip), %xmm0
15*9880d681SAndroid Build Coastguard Worker	cvttss2siq	%xmm0, %rdx
16*9880d681SAndroid Build Coastguard Worker	jb	L3
17*9880d681SAndroid Build Coastguard Worker	subss	LC0(%rip), %xmm0
18*9880d681SAndroid Build Coastguard Worker	movabsq	$-9223372036854775808, %rax
19*9880d681SAndroid Build Coastguard Worker	cvttss2siq	%xmm0, %rdx
20*9880d681SAndroid Build Coastguard Worker	xorq	%rax, %rdx
21*9880d681SAndroid Build Coastguard WorkerL3:
22*9880d681SAndroid Build Coastguard Worker	movq	%rdx, %rax
23*9880d681SAndroid Build Coastguard Worker	ret
24*9880d681SAndroid Build Coastguard Worker
25*9880d681SAndroid Build Coastguard Workerinstead of
26*9880d681SAndroid Build Coastguard Worker
27*9880d681SAndroid Build Coastguard Worker_conv:
28*9880d681SAndroid Build Coastguard Worker	movss LCPI1_0(%rip), %xmm1
29*9880d681SAndroid Build Coastguard Worker	cvttss2siq %xmm0, %rcx
30*9880d681SAndroid Build Coastguard Worker	movaps %xmm0, %xmm2
31*9880d681SAndroid Build Coastguard Worker	subss %xmm1, %xmm2
32*9880d681SAndroid Build Coastguard Worker	cvttss2siq %xmm2, %rax
33*9880d681SAndroid Build Coastguard Worker	movabsq $-9223372036854775808, %rdx
34*9880d681SAndroid Build Coastguard Worker	xorq %rdx, %rax
35*9880d681SAndroid Build Coastguard Worker	ucomiss %xmm1, %xmm0
36*9880d681SAndroid Build Coastguard Worker	cmovb %rcx, %rax
37*9880d681SAndroid Build Coastguard Worker	ret
38*9880d681SAndroid Build Coastguard Worker
39*9880d681SAndroid Build Coastguard WorkerSeems like the jb branch has high likelihood of being taken. It would have
40*9880d681SAndroid Build Coastguard Workersaved a few instructions.
41*9880d681SAndroid Build Coastguard Worker
42*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
43*9880d681SAndroid Build Coastguard Worker
44*9880d681SAndroid Build Coastguard WorkerIt's not possible to reference AH, BH, CH, and DH registers in an instruction
45*9880d681SAndroid Build Coastguard Workerrequiring REX prefix. However, divb and mulb both produce results in AH. If isel
46*9880d681SAndroid Build Coastguard Workeremits a CopyFromReg which gets turned into a movb and that can be allocated a
47*9880d681SAndroid Build Coastguard Workerr8b - r15b.
48*9880d681SAndroid Build Coastguard Worker
49*9880d681SAndroid Build Coastguard WorkerTo get around this, isel emits a CopyFromReg from AX and then right shift it
50*9880d681SAndroid Build Coastguard Workerdown by 8 and truncate it. It's not pretty but it works. We need some register
51*9880d681SAndroid Build Coastguard Workerallocation magic to make the hack go away (e.g. putting additional constraints
52*9880d681SAndroid Build Coastguard Workeron the result of the movb).
53*9880d681SAndroid Build Coastguard Worker
54*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
55*9880d681SAndroid Build Coastguard Worker
56*9880d681SAndroid Build Coastguard WorkerThe x86-64 ABI for hidden-argument struct returns requires that the
57*9880d681SAndroid Build Coastguard Workerincoming value of %rdi be copied into %rax by the callee upon return.
58*9880d681SAndroid Build Coastguard Worker
59*9880d681SAndroid Build Coastguard WorkerThe idea is that it saves callers from having to remember this value,
60*9880d681SAndroid Build Coastguard Workerwhich would often require a callee-saved register. Callees usually
61*9880d681SAndroid Build Coastguard Workerneed to keep this value live for most of their body anyway, so it
62*9880d681SAndroid Build Coastguard Workerdoesn't add a significant burden on them.
63*9880d681SAndroid Build Coastguard Worker
64*9880d681SAndroid Build Coastguard WorkerWe currently implement this in codegen, however this is suboptimal
65*9880d681SAndroid Build Coastguard Workerbecause it means that it would be quite awkward to implement the
66*9880d681SAndroid Build Coastguard Workeroptimization for callers.
67*9880d681SAndroid Build Coastguard Worker
68*9880d681SAndroid Build Coastguard WorkerA better implementation would be to relax the LLVM IR rules for sret
69*9880d681SAndroid Build Coastguard Workerarguments to allow a function with an sret argument to have a non-void
70*9880d681SAndroid Build Coastguard Workerreturn type, and to have the front-end to set up the sret argument value
71*9880d681SAndroid Build Coastguard Workeras the return value of the function. The front-end could more easily
72*9880d681SAndroid Build Coastguard Workeremit uses of the returned struct value to be in terms of the function's
73*9880d681SAndroid Build Coastguard Workerlowered return value, and it would free non-C frontends from a
74*9880d681SAndroid Build Coastguard Workercomplication only required by a C-based ABI.
75*9880d681SAndroid Build Coastguard Worker
76*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
77*9880d681SAndroid Build Coastguard Worker
78*9880d681SAndroid Build Coastguard WorkerWe get a redundant zero extension for code like this:
79*9880d681SAndroid Build Coastguard Worker
80*9880d681SAndroid Build Coastguard Workerint mask[1000];
81*9880d681SAndroid Build Coastguard Workerint foo(unsigned x) {
82*9880d681SAndroid Build Coastguard Worker if (x < 10)
83*9880d681SAndroid Build Coastguard Worker   x = x * 45;
84*9880d681SAndroid Build Coastguard Worker else
85*9880d681SAndroid Build Coastguard Worker   x = x * 78;
86*9880d681SAndroid Build Coastguard Worker return mask[x];
87*9880d681SAndroid Build Coastguard Worker}
88*9880d681SAndroid Build Coastguard Worker
89*9880d681SAndroid Build Coastguard Worker_foo:
90*9880d681SAndroid Build Coastguard WorkerLBB1_0:	## entry
91*9880d681SAndroid Build Coastguard Worker	cmpl	$9, %edi
92*9880d681SAndroid Build Coastguard Worker	jbe	LBB1_3	## bb
93*9880d681SAndroid Build Coastguard WorkerLBB1_1:	## bb1
94*9880d681SAndroid Build Coastguard Worker	imull	$78, %edi, %eax
95*9880d681SAndroid Build Coastguard WorkerLBB1_2:	## bb2
96*9880d681SAndroid Build Coastguard Worker	movl	%eax, %eax                    <----
97*9880d681SAndroid Build Coastguard Worker	movq	_mask@GOTPCREL(%rip), %rcx
98*9880d681SAndroid Build Coastguard Worker	movl	(%rcx,%rax,4), %eax
99*9880d681SAndroid Build Coastguard Worker	ret
100*9880d681SAndroid Build Coastguard WorkerLBB1_3:	## bb
101*9880d681SAndroid Build Coastguard Worker	imull	$45, %edi, %eax
102*9880d681SAndroid Build Coastguard Worker	jmp	LBB1_2	## bb2
103*9880d681SAndroid Build Coastguard Worker
104*9880d681SAndroid Build Coastguard WorkerBefore regalloc, we have:
105*9880d681SAndroid Build Coastguard Worker
106*9880d681SAndroid Build Coastguard Worker        %reg1025<def> = IMUL32rri8 %reg1024, 45, %EFLAGS<imp-def>
107*9880d681SAndroid Build Coastguard Worker        JMP mbb<bb2,0x203afb0>
108*9880d681SAndroid Build Coastguard Worker    Successors according to CFG: 0x203afb0 (#3)
109*9880d681SAndroid Build Coastguard Worker
110*9880d681SAndroid Build Coastguard Workerbb1: 0x203af60, LLVM BB @0x1e02310, ID#2:
111*9880d681SAndroid Build Coastguard Worker    Predecessors according to CFG: 0x203aec0 (#0)
112*9880d681SAndroid Build Coastguard Worker        %reg1026<def> = IMUL32rri8 %reg1024, 78, %EFLAGS<imp-def>
113*9880d681SAndroid Build Coastguard Worker    Successors according to CFG: 0x203afb0 (#3)
114*9880d681SAndroid Build Coastguard Worker
115*9880d681SAndroid Build Coastguard Workerbb2: 0x203afb0, LLVM BB @0x1e02340, ID#3:
116*9880d681SAndroid Build Coastguard Worker    Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2)
117*9880d681SAndroid Build Coastguard Worker        %reg1027<def> = PHI %reg1025, mbb<bb,0x203af10>,
118*9880d681SAndroid Build Coastguard Worker                            %reg1026, mbb<bb1,0x203af60>
119*9880d681SAndroid Build Coastguard Worker        %reg1029<def> = MOVZX64rr32 %reg1027
120*9880d681SAndroid Build Coastguard Worker
121*9880d681SAndroid Build Coastguard Workerso we'd have to know that IMUL32rri8 leaves the high word zero extended and to
122*9880d681SAndroid Build Coastguard Workerbe able to recognize the zero extend.  This could also presumably be implemented
123*9880d681SAndroid Build Coastguard Workerif we have whole-function selectiondags.
124*9880d681SAndroid Build Coastguard Worker
125*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
126*9880d681SAndroid Build Coastguard Worker
127*9880d681SAndroid Build Coastguard WorkerTake the following code
128*9880d681SAndroid Build Coastguard Worker(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653):
129*9880d681SAndroid Build Coastguard Workerextern unsigned long table[];
130*9880d681SAndroid Build Coastguard Workerunsigned long foo(unsigned char *p) {
131*9880d681SAndroid Build Coastguard Worker  unsigned long tag = *p;
132*9880d681SAndroid Build Coastguard Worker  return table[tag >> 4] + table[tag & 0xf];
133*9880d681SAndroid Build Coastguard Worker}
134*9880d681SAndroid Build Coastguard Worker
135*9880d681SAndroid Build Coastguard WorkerCurrent code generated:
136*9880d681SAndroid Build Coastguard Worker	movzbl	(%rdi), %eax
137*9880d681SAndroid Build Coastguard Worker	movq	%rax, %rcx
138*9880d681SAndroid Build Coastguard Worker	andq	$240, %rcx
139*9880d681SAndroid Build Coastguard Worker	shrq	%rcx
140*9880d681SAndroid Build Coastguard Worker	andq	$15, %rax
141*9880d681SAndroid Build Coastguard Worker	movq	table(,%rax,8), %rax
142*9880d681SAndroid Build Coastguard Worker	addq	table(%rcx), %rax
143*9880d681SAndroid Build Coastguard Worker	ret
144*9880d681SAndroid Build Coastguard Worker
145*9880d681SAndroid Build Coastguard WorkerIssues:
146*9880d681SAndroid Build Coastguard Worker1. First movq should be movl; saves a byte.
147*9880d681SAndroid Build Coastguard Worker2. Both andq's should be andl; saves another two bytes.  I think this was
148*9880d681SAndroid Build Coastguard Worker   implemented at one point, but subsequently regressed.
149*9880d681SAndroid Build Coastguard Worker3. shrq should be shrl; saves another byte.
150*9880d681SAndroid Build Coastguard Worker4. The first andq can be completely eliminated by using a slightly more
151*9880d681SAndroid Build Coastguard Worker   expensive addressing mode.
152*9880d681SAndroid Build Coastguard Worker
153*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
154*9880d681SAndroid Build Coastguard Worker
155*9880d681SAndroid Build Coastguard WorkerConsider the following (contrived testcase, but contains common factors):
156*9880d681SAndroid Build Coastguard Worker
157*9880d681SAndroid Build Coastguard Worker#include <stdarg.h>
158*9880d681SAndroid Build Coastguard Workerint test(int x, ...) {
159*9880d681SAndroid Build Coastguard Worker  int sum, i;
160*9880d681SAndroid Build Coastguard Worker  va_list l;
161*9880d681SAndroid Build Coastguard Worker  va_start(l, x);
162*9880d681SAndroid Build Coastguard Worker  for (i = 0; i < x; i++)
163*9880d681SAndroid Build Coastguard Worker    sum += va_arg(l, int);
164*9880d681SAndroid Build Coastguard Worker  va_end(l);
165*9880d681SAndroid Build Coastguard Worker  return sum;
166*9880d681SAndroid Build Coastguard Worker}
167*9880d681SAndroid Build Coastguard Worker
168*9880d681SAndroid Build Coastguard WorkerTestcase given in C because fixing it will likely involve changing the IR
169*9880d681SAndroid Build Coastguard Workergenerated for it.  The primary issue with the result is that it doesn't do any
170*9880d681SAndroid Build Coastguard Workerof the optimizations which are possible if we know the address of a va_list
171*9880d681SAndroid Build Coastguard Workerin the current function is never taken:
172*9880d681SAndroid Build Coastguard Worker1. We shouldn't spill the XMM registers because we only call va_arg with "int".
173*9880d681SAndroid Build Coastguard Worker2. It would be nice if we could sroa the va_list.
174*9880d681SAndroid Build Coastguard Worker3. Probably overkill, but it'd be cool if we could peel off the first five
175*9880d681SAndroid Build Coastguard Workeriterations of the loop.
176*9880d681SAndroid Build Coastguard Worker
177*9880d681SAndroid Build Coastguard WorkerOther optimizations involving functions which use va_arg on floats which don't
178*9880d681SAndroid Build Coastguard Workerhave the address of a va_list taken:
179*9880d681SAndroid Build Coastguard Worker1. Conversely to the above, we shouldn't spill general registers if we only
180*9880d681SAndroid Build Coastguard Worker   call va_arg on "double".
181*9880d681SAndroid Build Coastguard Worker2. If we know nothing more than 64 bits wide is read from the XMM registers,
182*9880d681SAndroid Build Coastguard Worker   we can change the spilling code to reduce the amount of stack used by half.
183*9880d681SAndroid Build Coastguard Worker
184*9880d681SAndroid Build Coastguard Worker//===---------------------------------------------------------------------===//
185