1# Go internal ABI specification 2 3Self-link: [go.dev/s/regabi](https://go.dev/s/regabi) 4 5This document describes Go’s internal application binary interface 6(ABI), known as ABIInternal. 7Go's ABI defines the layout of data in memory and the conventions for 8calling between Go functions. 9This ABI is *unstable* and will change between Go versions. 10If you’re writing assembly code, please instead refer to Go’s 11[assembly documentation](/doc/asm.html), which describes Go’s stable 12ABI, known as ABI0. 13 14All functions defined in Go source follow ABIInternal. 15However, ABIInternal and ABI0 functions are able to call each other 16through transparent *ABI wrappers*, described in the [internal calling 17convention proposal](https://golang.org/design/27539-internal-abi). 18 19Go uses a common ABI design across all architectures. 20We first describe the common ABI, and then cover per-architecture 21specifics. 22 23*Rationale*: For the reasoning behind using a common ABI across 24architectures instead of the platform ABI, see the [register-based Go 25calling convention proposal](https://golang.org/design/40724-register-calling). 26 27## Memory layout 28 29Go's built-in types have the following sizes and alignments. 30Many, though not all, of these sizes are guaranteed by the [language 31specification](/doc/go_spec.html#Size_and_alignment_guarantees). 32Those that aren't guaranteed may change in future versions of Go (for 33example, we've considered changing the alignment of int64 on 32-bit). 34 35| Type | 64-bit | | 32-bit | | 36|-----------------------------|--------|-------|--------|-------| 37| | Size | Align | Size | Align | 38| bool, uint8, int8 | 1 | 1 | 1 | 1 | 39| uint16, int16 | 2 | 2 | 2 | 2 | 40| uint32, int32 | 4 | 4 | 4 | 4 | 41| uint64, int64 | 8 | 8 | 8 | 4 | 42| int, uint | 8 | 8 | 4 | 4 | 43| float32 | 4 | 4 | 4 | 4 | 44| float64 | 8 | 8 | 8 | 4 | 45| complex64 | 8 | 4 | 8 | 4 | 46| complex128 | 16 | 8 | 16 | 4 | 47| uintptr, *T, unsafe.Pointer | 8 | 8 | 4 | 4 | 48 49The types `byte` and `rune` are aliases for `uint8` and `int32`, 50respectively, and hence have the same size and alignment as these 51types. 52 53The layout of `map`, `chan`, and `func` types is equivalent to *T. 54 55To describe the layout of the remaining composite types, we first 56define the layout of a *sequence* S of N fields with types 57t<sub>1</sub>, t<sub>2</sub>, ..., t<sub>N</sub>. 58We define the byte offset at which each field begins relative to a 59base address of 0, as well as the size and alignment of the sequence 60as follows: 61 62``` 63offset(S, i) = 0 if i = 1 64 = align(offset(S, i-1) + sizeof(t_(i-1)), alignof(t_i)) 65alignof(S) = 1 if N = 0 66 = max(alignof(t_i) | 1 <= i <= N) 67sizeof(S) = 0 if N = 0 68 = align(offset(S, N) + sizeof(t_N), alignof(S)) 69``` 70 71Where sizeof(T) and alignof(T) are the size and alignment of type T, 72respectively, and align(x, y) rounds x up to a multiple of y. 73 74The `interface{}` type is a sequence of 1. a pointer to the runtime type 75description for the interface's dynamic type and 2. an `unsafe.Pointer` 76data field. 77Any other interface type (besides the empty interface) is a sequence 78of 1. a pointer to the runtime "itab" that gives the method pointers and 79the type of the data field and 2. an `unsafe.Pointer` data field. 80An interface can be "direct" or "indirect" depending on the dynamic 81type: a direct interface stores the value directly in the data field, 82and an indirect interface stores a pointer to the value in the data 83field. 84An interface can only be direct if the value consists of a single 85pointer word. 86 87An array type `[N]T` is a sequence of N fields of type T. 88 89The slice type `[]T` is a sequence of a `*[cap]T` pointer to the slice 90backing store, an `int` giving the `len` of the slice, and an `int` 91giving the `cap` of the slice. 92 93The `string` type is a sequence of a `*[len]byte` pointer to the 94string backing store, and an `int` giving the `len` of the string. 95 96A struct type `struct { f1 t1; ...; fM tM }` is laid out as the 97sequence t1, ..., tM, tP, where tP is either: 98 99- Type `byte` if sizeof(tM) = 0 and any of sizeof(t*i*) ≠ 0. 100- Empty (size 0 and align 1) otherwise. 101 102The padding byte prevents creating a past-the-end pointer by taking 103the address of the final, empty fN field. 104 105Note that user-written assembly code should generally not depend on Go 106type layout and should instead use the constants defined in 107[`go_asm.h`](/doc/asm.html#data-offsets). 108 109## Function call argument and result passing 110 111Function calls pass arguments and results using a combination of the 112stack and machine registers. 113Each argument or result is passed either entirely in registers or 114entirely on the stack. 115Because access to registers is generally faster than access to the 116stack, arguments and results are preferentially passed in registers. 117However, any argument or result that contains a non-trivial array or 118does not fit entirely in the remaining available registers is passed 119on the stack. 120 121Each architecture defines a sequence of integer registers and a 122sequence of floating-point registers. 123At a high level, arguments and results are recursively broken down 124into values of base types and these base values are assigned to 125registers from these sequences. 126 127Arguments and results can share the same registers, but do not share 128the same stack space. 129Beyond the arguments and results passed on the stack, the caller also 130reserves spill space on the stack for all register-based arguments 131(but does not populate this space). 132 133The receiver, arguments, and results of function or method F are 134assigned to registers or the stack using the following algorithm: 135 1361. Let NI and NFP be the length of integer and floating-point register 137 sequences defined by the architecture. 138 Let I and FP be 0; these are the indexes of the next integer and 139 floating-point register. 140 Let S, the type sequence defining the stack frame, be empty. 1411. If F is a method, assign F’s receiver. 1421. For each argument A of F, assign A. 1431. Add a pointer-alignment field to S. This has size 0 and the same 144 alignment as `uintptr`. 1451. Reset I and FP to 0. 1461. For each result R of F, assign R. 1471. Add a pointer-alignment field to S. 1481. For each register-assigned receiver and argument of F, let T be its 149 type and add T to the stack sequence S. 150 This is the argument's (or receiver's) spill space and will be 151 uninitialized at the call. 1521. Add a pointer-alignment field to S. 153 154Assigning a receiver, argument, or result V of underlying type T works 155as follows: 156 1571. Remember I and FP. 1581. If T has zero size, add T to the stack sequence S and return. 1591. Try to register-assign V. 1601. If step 3 failed, reset I and FP to the values from step 1, add T 161 to the stack sequence S, and assign V to this field in S. 162 163Register-assignment of a value V of underlying type T works as follows: 164 1651. If T is a boolean or integral type that fits in an integer 166 register, assign V to register I and increment I. 1671. If T is an integral type that fits in two integer registers, assign 168 the least significant and most significant halves of V to registers 169 I and I+1, respectively, and increment I by 2 1701. If T is a floating-point type and can be represented without loss 171 of precision in a floating-point register, assign V to register FP 172 and increment FP. 1731. If T is a complex type, recursively register-assign its real and 174 imaginary parts. 1751. If T is a pointer type, map type, channel type, or function type, 176 assign V to register I and increment I. 1771. If T is a string type, interface type, or slice type, recursively 178 register-assign V’s components (2 for strings and interfaces, 3 for 179 slices). 1801. If T is a struct type, recursively register-assign each field of V. 1811. If T is an array type of length 0, do nothing. 1821. If T is an array type of length 1, recursively register-assign its 183 one element. 1841. If T is an array type of length > 1, fail. 1851. If I > NI or FP > NFP, fail. 1861. If any recursive assignment above fails, fail. 187 188The above algorithm produces an assignment of each receiver, argument, 189and result to registers or to a field in the stack sequence. 190The final stack sequence looks like: stack-assigned receiver, 191stack-assigned arguments, pointer-alignment, stack-assigned results, 192pointer-alignment, spill space for each register-assigned argument, 193pointer-alignment. 194The following diagram shows what this stack frame looks like on the 195stack, using the typical convention where address 0 is at the bottom: 196 197 +------------------------------+ 198 | . . . | 199 | 2nd reg argument spill space | 200 | 1st reg argument spill space | 201 | <pointer-sized alignment> | 202 | . . . | 203 | 2nd stack-assigned result | 204 | 1st stack-assigned result | 205 | <pointer-sized alignment> | 206 | . . . | 207 | 2nd stack-assigned argument | 208 | 1st stack-assigned argument | 209 | stack-assigned receiver | 210 +------------------------------+ ↓ lower addresses 211 212To perform a call, the caller reserves space starting at the lowest 213address in its stack frame for the call stack frame, stores arguments 214in the registers and argument stack fields determined by the above 215algorithm, and performs the call. 216At the time of a call, spill space, result stack fields, and result 217registers are left uninitialized. 218Upon return, the callee must have stored results to all result 219registers and result stack fields determined by the above algorithm. 220 221There are no callee-save registers, so a call may overwrite any 222register that doesn’t have a fixed meaning, including argument 223registers. 224 225### Example 226 227Consider the function `func f(a1 uint8, a2 [2]uintptr, a3 uint8) (r1 228struct { x uintptr; y [2]uintptr }, r2 string)` on a 64-bit 229architecture with hypothetical integer registers R0–R9. 230 231On entry, `a1` is assigned to `R0`, `a3` is assigned to `R1` and the 232stack frame is laid out in the following sequence: 233 234 a2 [2]uintptr 235 r1.x uintptr 236 r1.y [2]uintptr 237 a1Spill uint8 238 a3Spill uint8 239 _ [6]uint8 // alignment padding 240 241In the stack frame, only the `a2` field is initialized on entry; the 242rest of the frame is left uninitialized. 243 244On exit, `r2.base` is assigned to `R0`, `r2.len` is assigned to `R1`, 245and `r1.x` and `r1.y` are initialized in the stack frame. 246 247There are several things to note in this example. 248First, `a2` and `r1` are stack-assigned because they contain arrays. 249The other arguments and results are register-assigned. 250Result `r2` is decomposed into its components, which are individually 251register-assigned. 252On the stack, the stack-assigned arguments appear at lower addresses 253than the stack-assigned results, which appear at lower addresses than 254the argument spill area. 255Only arguments, not results, are assigned a spill area on the stack. 256 257### Rationale 258 259Each base value is assigned to its own register to optimize 260construction and access. 261An alternative would be to pack multiple sub-word values into 262registers, or to simply map an argument's in-memory layout to 263registers (this is common in C ABIs), but this typically adds cost to 264pack and unpack these values. 265Modern architectures have more than enough registers to pass all 266arguments and results this way for nearly all functions (see the 267appendix), so there’s little downside to spreading base values across 268registers. 269 270Arguments that can’t be fully assigned to registers are passed 271entirely on the stack in case the callee takes the address of that 272argument. 273If an argument could be split across the stack and registers and the 274callee took its address, it would need to be reconstructed in memory, 275a process that would be proportional to the size of the argument. 276 277Non-trivial arrays are always passed on the stack because indexing 278into an array typically requires a computed offset, which generally 279isn’t possible with registers. 280Arrays in general are rare in function signatures (only 0.7% of 281functions in the Go 1.15 standard library and 0.2% in kubelet). 282We considered allowing array fields to be passed on the stack while 283the rest of an argument’s fields are passed in registers, but this 284creates the same problems as other large structs if the callee takes 285the address of an argument, and would benefit <0.1% of functions in 286kubelet (and even these very little). 287 288We make exceptions for 0 and 1-element arrays because these don’t 289require computed offsets, and 1-element arrays are already decomposed 290in the compiler’s SSA representation. 291 292The ABI assignment algorithm above is equivalent to Go’s stack-based 293ABI0 calling convention if there are zero architecture registers. 294This is intended to ease the transition to the register-based internal 295ABI and make it easy for the compiler to generate either calling 296convention. 297An architecture may still define register meanings that aren’t 298compatible with ABI0, but these differences should be easy to account 299for in the compiler. 300 301The assignment algorithm assigns zero-sized values to the stack 302(assignment step 2) in order to support ABI0-equivalence. 303While these values take no space themselves, they do result in 304alignment padding on the stack in ABI0. 305Without this step, the internal ABI would register-assign zero-sized 306values even on architectures that provide no argument registers 307because they don't consume any registers, and hence not add alignment 308padding to the stack. 309 310The algorithm reserves spill space for arguments in the caller’s frame 311so that the compiler can generate a stack growth path that spills into 312this reserved space. 313If the callee has to grow the stack, it may not be able to reserve 314enough additional stack space in its own frame to spill these, which 315is why it’s important that the caller do so. 316These slots also act as the home location if these arguments need to 317be spilled for any other reason, which simplifies traceback printing. 318 319There are several options for how to lay out the argument spill space. 320We chose to lay out each argument according to its type's usual memory 321layout but to separate the spill space from the regular argument 322space. 323Using the usual memory layout simplifies the compiler because it 324already understands this layout. 325Also, if a function takes the address of a register-assigned argument, 326the compiler must spill that argument to memory in its usual memory 327layout and it's more convenient to use the argument spill space for 328this purpose. 329 330Alternatively, the spill space could be structured around argument 331registers. 332In this approach, the stack growth spill path would spill each 333argument register to a register-sized stack word. 334However, if the function takes the address of a register-assigned 335argument, the compiler would have to reconstruct it in memory layout 336elsewhere on the stack. 337 338The spill space could also be interleaved with the stack-assigned 339arguments so the arguments appear in order whether they are register- 340or stack-assigned. 341This would be close to ABI0, except that register-assigned arguments 342would be uninitialized on the stack and there's no need to reserve 343stack space for register-assigned results. 344We expect separating the spill space to perform better because of 345memory locality. 346Separating the space is also potentially simpler for `reflect` calls 347because this allows `reflect` to summarize the spill space as a single 348number. 349Finally, the long-term intent is to remove reserved spill slots 350entirely – allowing most functions to be called without any stack 351setup and easing the introduction of callee-save registers – and 352separating the spill space makes that transition easier. 353 354## Closures 355 356A func value (e.g., `var x func()`) is a pointer to a closure object. 357A closure object begins with a pointer-sized program counter 358representing the entry point of the function, followed by zero or more 359bytes containing the closed-over environment. 360 361Closure calls follow the same conventions as static function and 362method calls, with one addition. Each architecture specifies a 363*closure context pointer* register and calls to closures store the 364address of the closure object in the closure context pointer register 365prior to the call. 366 367## Software floating-point mode 368 369In "softfloat" mode, the ABI simply treats the hardware as having zero 370floating-point registers. 371As a result, any arguments containing floating-point values will be 372passed on the stack. 373 374*Rationale*: Softfloat mode is about compatibility over performance 375and is not commonly used. 376Hence, we keep the ABI as simple as possible in this case, rather than 377adding additional rules for passing floating-point values in integer 378registers. 379 380## Architecture specifics 381 382This section describes per-architecture register mappings, as well as 383other per-architecture special cases. 384 385### amd64 architecture 386 387The amd64 architecture uses the following sequence of 9 registers for 388integer arguments and results: 389 390 RAX, RBX, RCX, RDI, RSI, R8, R9, R10, R11 391 392It uses X0 – X14 for floating-point arguments and results. 393 394*Rationale*: These sequences are chosen from the available registers 395to be relatively easy to remember. 396 397Registers R12 and R13 are permanent scratch registers. 398R15 is a scratch register except in dynamically linked binaries. 399 400*Rationale*: Some operations such as stack growth and reflection calls 401need dedicated scratch registers in order to manipulate call frames 402without corrupting arguments or results. 403 404Special-purpose registers are as follows: 405 406| Register | Call meaning | Return meaning | Body meaning | 407| --- | --- | --- | --- | 408| RSP | Stack pointer | Same | Same | 409| RBP | Frame pointer | Same | Same | 410| RDX | Closure context pointer | Scratch | Scratch | 411| R12 | Scratch | Scratch | Scratch | 412| R13 | Scratch | Scratch | Scratch | 413| R14 | Current goroutine | Same | Same | 414| R15 | GOT reference temporary if dynlink | Same | Same | 415| X15 | Zero value (*) | Same | Scratch | 416 417(*) Except on Plan 9, where X15 is a scratch register because SSE 418registers cannot be used in note handlers (so the compiler avoids 419using them except when absolutely necessary). 420 421*Rationale*: These register meanings are compatible with Go’s 422stack-based calling convention except for R14 and X15, which will have 423to be restored on transitions from ABI0 code to ABIInternal code. 424In ABI0, these are undefined, so transitions from ABIInternal to ABI0 425can ignore these registers. 426 427*Rationale*: For the current goroutine pointer, we chose a register 428that requires an additional REX byte. 429While this adds one byte to every function prologue, it is hardly ever 430accessed outside the function prologue and we expect making more 431single-byte registers available to be a net win. 432 433*Rationale*: We could allow R14 (the current goroutine pointer) to be 434a scratch register in function bodies because it can always be 435restored from TLS on amd64. 436However, we designate it as a fixed register for simplicity and for 437consistency with other architectures that may not have a copy of the 438current goroutine pointer in TLS. 439 440*Rationale*: We designate X15 as a fixed zero register because 441functions often have to bulk zero their stack frames, and this is more 442efficient with a designated zero register. 443 444*Implementation note*: Registers with fixed meaning at calls but not 445in function bodies must be initialized by "injected" calls such as 446signal-based panics. 447 448#### Stack layout 449 450The stack pointer, RSP, grows down and is always aligned to 8 bytes. 451 452The amd64 architecture does not use a link register. 453 454A function's stack frame is laid out as follows: 455 456 +------------------------------+ 457 | return PC | 458 | RBP on entry | 459 | ... locals ... | 460 | ... outgoing arguments ... | 461 +------------------------------+ ↓ lower addresses 462 463The "return PC" is pushed as part of the standard amd64 `CALL` 464operation. 465On entry, a function subtracts from RSP to open its stack frame and 466saves the value of RBP directly below the return PC. 467A leaf function that does not require any stack space may omit the 468saved RBP. 469 470The Go ABI's use of RBP as a frame pointer register is compatible with 471amd64 platform conventions so that Go can inter-operate with platform 472debuggers and profilers. 473 474#### Flags 475 476The direction flag (D) is always cleared (set to the “forward” 477direction) at a call. 478The arithmetic status flags are treated like scratch registers and not 479preserved across calls. 480All other bits in RFLAGS are system flags. 481 482At function calls and returns, the CPU is in x87 mode (not MMX 483technology mode). 484 485*Rationale*: Go on amd64 does not use either the x87 registers or MMX 486registers. Hence, we follow the SysV platform conventions in order to 487simplify transitions to and from the C ABI. 488 489At calls, the MXCSR control bits are always set as follows: 490 491| Flag | Bit | Value | Meaning | 492| --- | --- | --- | --- | 493| FZ | 15 | 0 | Do not flush to zero | 494| RC | 14/13 | 0 (RN) | Round to nearest | 495| PM | 12 | 1 | Precision masked | 496| UM | 11 | 1 | Underflow masked | 497| OM | 10 | 1 | Overflow masked | 498| ZM | 9 | 1 | Divide-by-zero masked | 499| DM | 8 | 1 | Denormal operations masked | 500| IM | 7 | 1 | Invalid operations masked | 501| DAZ | 6 | 0 | Do not zero de-normals | 502 503The MXCSR status bits are callee-save. 504 505*Rationale*: Having a fixed MXCSR control configuration allows Go 506functions to use SSE operations without modifying or saving the MXCSR. 507Functions are allowed to modify it between calls (as long as they 508restore it), but as of this writing Go code never does. 509The above fixed configuration matches the process initialization 510control bits specified by the ELF AMD64 ABI. 511 512The x87 floating-point control word is not used by Go on amd64. 513 514### arm64 architecture 515 516The arm64 architecture uses R0 – R15 for integer arguments and results. 517 518It uses F0 – F15 for floating-point arguments and results. 519 520*Rationale*: 16 integer registers and 16 floating-point registers are 521more than enough for passing arguments and results for practically all 522functions (see Appendix). While there are more registers available, 523using more registers provides little benefit. Additionally, it will add 524overhead on code paths where the number of arguments are not statically 525known (e.g. reflect call), and will consume more stack space when there 526is only limited stack space available to fit in the nosplit limit. 527 528Registers R16 and R17 are permanent scratch registers. They are also 529used as scratch registers by the linker (Go linker and external 530linker) in trampolines. 531 532Register R18 is reserved and never used. It is reserved for the OS 533on some platforms (e.g. macOS). 534 535Registers R19 – R25 are permanent scratch registers. In addition, 536R27 is a permanent scratch register used by the assembler when 537expanding instructions. 538 539Floating-point registers F16 – F31 are also permanent scratch 540registers. 541 542Special-purpose registers are as follows: 543 544| Register | Call meaning | Return meaning | Body meaning | 545| --- | --- | --- | --- | 546| RSP | Stack pointer | Same | Same | 547| R30 | Link register | Same | Scratch (non-leaf functions) | 548| R29 | Frame pointer | Same | Same | 549| R28 | Current goroutine | Same | Same | 550| R27 | Scratch | Scratch | Scratch | 551| R26 | Closure context pointer | Scratch | Scratch | 552| R18 | Reserved (not used) | Same | Same | 553| ZR | Zero value | Same | Same | 554 555*Rationale*: These register meanings are compatible with Go’s 556stack-based calling convention. 557 558*Rationale*: The link register, R30, holds the function return 559address at the function entry. For functions that have frames 560(including most non-leaf functions), R30 is saved to stack in the 561function prologue and restored in the epilogue. Within the function 562body, R30 can be used as a scratch register. 563 564*Implementation note*: Registers with fixed meaning at calls but not 565in function bodies must be initialized by "injected" calls such as 566signal-based panics. 567 568#### Stack layout 569 570The stack pointer, RSP, grows down and is always aligned to 16 bytes. 571 572*Rationale*: The arm64 architecture requires the stack pointer to be 57316-byte aligned. 574 575A function's stack frame, after the frame is created, is laid out as 576follows: 577 578 +------------------------------+ 579 | ... locals ... | 580 | ... outgoing arguments ... | 581 | return PC | ← RSP points to 582 | frame pointer on entry | 583 +------------------------------+ ↓ lower addresses 584 585The "return PC" is loaded to the link register, R30, as part of the 586arm64 `CALL` operation. 587 588On entry, a function subtracts from RSP to open its stack frame, and 589saves the values of R30 and R29 at the bottom of the frame. 590Specifically, R30 is saved at 0(RSP) and R29 is saved at -8(RSP), 591after RSP is updated. 592 593A leaf function that does not require any stack space may omit the 594saved R30 and R29. 595 596The Go ABI's use of R29 as a frame pointer register is compatible with 597arm64 architecture requirement so that Go can inter-operate with platform 598debuggers and profilers. 599 600This stack layout is used by both register-based (ABIInternal) and 601stack-based (ABI0) calling conventions. 602 603#### Flags 604 605The arithmetic status flags (NZCV) are treated like scratch registers 606and not preserved across calls. 607All other bits in PSTATE are system flags and are not modified by Go. 608 609The floating-point status register (FPSR) is treated like scratch 610registers and not preserved across calls. 611 612At calls, the floating-point control register (FPCR) bits are always 613set as follows: 614 615| Flag | Bit | Value | Meaning | 616| --- | --- | --- | --- | 617| DN | 25 | 0 | Propagate NaN operands | 618| FZ | 24 | 0 | Do not flush to zero | 619| RC | 23/22 | 0 (RN) | Round to nearest, choose even if tied | 620| IDE | 15 | 0 | Denormal operations trap disabled | 621| IXE | 12 | 0 | Inexact trap disabled | 622| UFE | 11 | 0 | Underflow trap disabled | 623| OFE | 10 | 0 | Overflow trap disabled | 624| DZE | 9 | 0 | Divide-by-zero trap disabled | 625| IOE | 8 | 0 | Invalid operations trap disabled | 626| NEP | 2 | 0 | Scalar operations do not affect higher elements in vector registers | 627| AH | 1 | 0 | No alternate handling of de-normal inputs | 628| FIZ | 0 | 0 | Do not zero de-normals | 629 630*Rationale*: Having a fixed FPCR control configuration allows Go 631functions to use floating-point and vector (SIMD) operations without 632modifying or saving the FPCR. 633Functions are allowed to modify it between calls (as long as they 634restore it), but as of this writing Go code never does. 635 636### loong64 architecture 637 638The loong64 architecture uses R4 – R19 for integer arguments and integer results. 639 640It uses F0 – F15 for floating-point arguments and results. 641 642Registers R20 - R21, R23 – R28, R30 - R31, F16 – F31 are permanent scratch registers. 643 644Register R2 is reserved and never used. 645 646Register R20, R21 is Used by runtime.duffcopy, runtime.duffzero. 647 648Special-purpose registers used within Go generated code and Go assembly code 649are as follows: 650 651| Register | Call meaning | Return meaning | Body meaning | 652| --- | --- | --- | --- | 653| R0 | Zero value | Same | Same | 654| R1 | Link register | Link register | Scratch | 655| R3 | Stack pointer | Same | Same | 656| R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero | 657| R22 | Current goroutine | Same | Same | 658| R29 | Closure context pointer | Same | Same | 659| R30, R31 | used by the assembler | Same | Same | 660 661*Rationale*: These register meanings are compatible with Go’s stack-based 662calling convention. 663 664#### Stack layout 665 666The stack pointer, R3, grows down and is aligned to 8 bytes. 667 668A function's stack frame, after the frame is created, is laid out as 669follows: 670 671 +------------------------------+ 672 | ... locals ... | 673 | ... outgoing arguments ... | 674 | return PC | ← R3 points to 675 +------------------------------+ ↓ lower addresses 676 677This stack layout is used by both register-based (ABIInternal) and 678stack-based (ABI0) calling conventions. 679 680The "return PC" is loaded to the link register, R1, as part of the 681loong64 `JAL` operation. 682 683#### Flags 684All bits in CSR are system flags and are not modified by Go. 685 686### ppc64 architecture 687 688The ppc64 architecture uses R3 – R10 and R14 – R17 for integer arguments 689and results. 690 691It uses F1 – F12 for floating-point arguments and results. 692 693Register R31 is a permanent scratch register in Go. 694 695Special-purpose registers used within Go generated code and Go 696assembly code are as follows: 697 698| Register | Call meaning | Return meaning | Body meaning | 699| --- | --- | --- | --- | 700| R0 | Zero value | Same | Same | 701| R1 | Stack pointer | Same | Same | 702| R2 | TOC register | Same | Same | 703| R11 | Closure context pointer | Scratch | Scratch | 704| R12 | Function address on indirect calls | Scratch | Scratch | 705| R13 | TLS pointer | Same | Same | 706| R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero | 707| R30 | Current goroutine | Same | Same | 708| R31 | Scratch | Scratch | Scratch | 709| LR | Link register | Link register | Scratch | 710*Rationale*: These register meanings are compatible with Go’s 711stack-based calling convention. 712 713The link register, LR, holds the function return 714address at the function entry and is set to the correct return 715address before exiting the function. It is also used 716in some cases as the function address when doing an indirect call. 717 718The register R2 contains the address of the TOC (table of contents) which 719contains data or code addresses used when generating position independent 720code. Non-Go code generated when using cgo contains TOC-relative addresses 721which depend on R2 holding a valid TOC. Go code compiled with -shared or 722-dynlink initializes and maintains R2 and uses it in some cases for 723function calls; Go code compiled without these options does not modify R2. 724 725When making a function call R12 contains the function address for use by the 726code to generate R2 at the beginning of the function. R12 can be used for 727other purposes within the body of the function, such as trampoline generation. 728 729R20 and R21 are used in duffcopy and duffzero which could be generated 730before arguments are saved so should not be used for register arguments. 731 732The Count register CTR can be used as the call target for some branch instructions. 733It holds the return address when preemption has occurred. 734 735On PPC64 when a float32 is loaded it becomes a float64 in the register, which is 736different from other platforms and that needs to be recognized by the internal 737implementation of reflection so that float32 arguments are passed correctly. 738 739Registers R18 - R29 and F13 - F31 are considered scratch registers. 740 741#### Stack layout 742 743The stack pointer, R1, grows down and is aligned to 8 bytes in Go, but changed 744to 16 bytes when calling cgo. 745 746A function's stack frame, after the frame is created, is laid out as 747follows: 748 749 +------------------------------+ 750 | ... locals ... | 751 | ... outgoing arguments ... | 752 | 24 TOC register R2 save | When compiled with -shared/-dynlink 753 | 16 Unused in Go | Not used in Go 754 | 8 CR save | nonvolatile CR fields 755 | 0 return PC | ← R1 points to 756 +------------------------------+ ↓ lower addresses 757 758The "return PC" is loaded to the link register, LR, as part of the 759ppc64 `BL` operations. 760 761On entry to a non-leaf function, the stack frame size is subtracted from R1 to 762create its stack frame, and saves the value of LR at the bottom of the frame. 763 764A leaf function that does not require any stack space does not modify R1 and 765does not save LR. 766 767*NOTE*: We might need to save the frame pointer on the stack as 768in the PPC64 ELF v2 ABI so Go can inter-operate with platform debuggers 769and profilers. 770 771This stack layout is used by both register-based (ABIInternal) and 772stack-based (ABI0) calling conventions. 773 774#### Flags 775 776The condition register consists of 8 condition code register fields 777CR0-CR7. Go generated code only sets and uses CR0, commonly set by 778compare functions and use to determine the target of a conditional 779branch. The generated code does not set or use CR1-CR7. 780 781The floating point status and control register (FPSCR) is initialized 782to 0 by the kernel at startup of the Go program and not changed by 783the Go generated code. 784 785### riscv64 architecture 786 787The riscv64 architecture uses X10 – X17, X8, X9, X18 – X23 for integer arguments 788and results. 789 790It uses F10 – F17, F8, F9, F18 – F23 for floating-point arguments and results. 791 792Special-purpose registers used within Go generated code and Go 793assembly code are as follows: 794 795| Register | Call meaning | Return meaning | Body meaning | 796| --- | --- | --- | --- | 797| X0 | Zero value | Same | Same | 798| X1 | Link register | Link register | Scratch | 799| X2 | Stack pointer | Same | Same | 800| X3 | Global pointer | Same | Used by dynamic linker | 801| X4 | TLS (thread pointer) | TLS | Scratch | 802| X24,X25 | Scratch | Scratch | Used by duffcopy, duffzero | 803| X26 | Closure context pointer | Scratch | Scratch | 804| X27 | Current goroutine | Same | Same | 805| X31 | Scratch | Scratch | Scratch | 806 807*Rationale*: These register meanings are compatible with Go’s 808stack-based calling convention. Context register X20 will change to X26, 809duffcopy, duffzero register will change to X24, X25 before this register ABI been adopted. 810X10 – X17, X8, X9, X18 – X23, is the same order as A0 – A7, S0 – S7 in platform ABI. 811F10 – F17, F8, F9, F18 – F23, is the same order as FA0 – FA7, FS0 – FS7 in platform ABI. 812X8 – X23, F8 – F15 are used for compressed instruction (RVC) which will benefit code size in the future. 813 814#### Stack layout 815 816The stack pointer, X2, grows down and is aligned to 8 bytes. 817 818A function's stack frame, after the frame is created, is laid out as 819follows: 820 821 +------------------------------+ 822 | ... locals ... | 823 | ... outgoing arguments ... | 824 | return PC | ← X2 points to 825 +------------------------------+ ↓ lower addresses 826 827The "return PC" is loaded to the link register, X1, as part of the 828riscv64 `CALL` operation. 829 830#### Flags 831 832The riscv64 has Zicsr extension for control and status register (CSR) and 833treated as scratch register. 834All bits in CSR are system flags and are not modified by Go. 835 836## Future directions 837 838### Spill path improvements 839 840The ABI currently reserves spill space for argument registers so the 841compiler can statically generate an argument spill path before calling 842into `runtime.morestack` to grow the stack. 843This ensures there will be sufficient spill space even when the stack 844is nearly exhausted and keeps stack growth and stack scanning 845essentially unchanged from ABI0. 846 847However, this wastes stack space (the median wastage is 16 bytes per 848call), resulting in larger stacks and increased cache footprint. 849A better approach would be to reserve stack space only when spilling. 850One way to ensure enough space is available to spill would be for 851every function to ensure there is enough space for the function's own 852frame *as well as* the spill space of all functions it calls. 853For most functions, this would change the threshold for the prologue 854stack growth check. 855For `nosplit` functions, this would change the threshold used in the 856linker's static stack size check. 857 858Allocating spill space in the callee rather than the caller may also 859allow for faster reflection calls in the common case where a function 860takes only register arguments, since it would allow reflection to make 861these calls directly without allocating any frame. 862 863The statically-generated spill path also increases code size. 864It is possible to instead have a generic spill path in the runtime, as 865part of `morestack`. 866However, this complicates reserving the spill space, since spilling 867all possible register arguments would, in most cases, take 868significantly more space than spilling only those used by a particular 869function. 870Some options are to spill to a temporary space and copy back only the 871registers used by the function, or to grow the stack if necessary 872before spilling to it (using a temporary space if necessary), or to 873use a heap-allocated space if insufficient stack space is available. 874These options all add enough complexity that we will have to make this 875decision based on the actual code size growth caused by the static 876spill paths. 877 878### Clobber sets 879 880As defined, the ABI does not use callee-save registers. 881This significantly simplifies the garbage collector and the compiler's 882register allocator, but at some performance cost. 883A potentially better balance for Go code would be to use *clobber 884sets*: for each function, the compiler records the set of registers it 885clobbers (including those clobbered by functions it calls) and any 886register not clobbered by function F can remain live across calls to 887F. 888 889This is generally a good fit for Go because Go's package DAG allows 890function metadata like the clobber set to flow up the call graph, even 891across package boundaries. 892Clobber sets would require relatively little change to the garbage 893collector, unlike general callee-save registers. 894One disadvantage of clobber sets over callee-save registers is that 895they don't help with indirect function calls or interface method 896calls, since static information isn't available in these cases. 897 898### Large aggregates 899 900Go encourages passing composite values by value, and this simplifies 901reasoning about mutation and races. 902However, this comes at a performance cost for large composite values. 903It may be possible to instead transparently pass large composite 904values by reference and delay copying until it is actually necessary. 905 906## Appendix: Register usage analysis 907 908In order to understand the impacts of the above design on register 909usage, we 910[analyzed](https://github.com/aclements/go-misc/tree/master/abi) the 911impact of the above ABI on a large code base: cmd/kubelet from 912[Kubernetes](https://github.com/kubernetes/kubernetes) at tag v1.18.8. 913 914The following table shows the impact of different numbers of available 915integer and floating-point registers on argument assignment: 916 917``` 918| | | | stack args | spills | stack total | 919| ints | floats | % fit | p50 | p95 | p99 | p50 | p95 | p99 | p50 | p95 | p99 | 920| 0 | 0 | 6.3% | 32 | 152 | 256 | 0 | 0 | 0 | 32 | 152 | 256 | 921| 0 | 8 | 6.4% | 32 | 152 | 256 | 0 | 0 | 0 | 32 | 152 | 256 | 922| 1 | 8 | 21.3% | 24 | 144 | 248 | 8 | 8 | 8 | 32 | 152 | 256 | 923| 2 | 8 | 38.9% | 16 | 128 | 224 | 8 | 16 | 16 | 24 | 136 | 240 | 924| 3 | 8 | 57.0% | 0 | 120 | 224 | 16 | 24 | 24 | 24 | 136 | 240 | 925| 4 | 8 | 73.0% | 0 | 120 | 216 | 16 | 32 | 32 | 24 | 136 | 232 | 926| 5 | 8 | 83.3% | 0 | 112 | 216 | 16 | 40 | 40 | 24 | 136 | 232 | 927| 6 | 8 | 87.5% | 0 | 112 | 208 | 16 | 48 | 48 | 24 | 136 | 232 | 928| 7 | 8 | 89.8% | 0 | 112 | 208 | 16 | 48 | 56 | 24 | 136 | 232 | 929| 8 | 8 | 91.3% | 0 | 112 | 200 | 16 | 56 | 64 | 24 | 136 | 232 | 930| 9 | 8 | 92.1% | 0 | 112 | 192 | 16 | 56 | 72 | 24 | 136 | 232 | 931| 10 | 8 | 92.6% | 0 | 104 | 192 | 16 | 56 | 72 | 24 | 136 | 232 | 932| 11 | 8 | 93.1% | 0 | 104 | 184 | 16 | 56 | 80 | 24 | 128 | 232 | 933| 12 | 8 | 93.4% | 0 | 104 | 176 | 16 | 56 | 88 | 24 | 128 | 232 | 934| 13 | 8 | 94.0% | 0 | 88 | 176 | 16 | 56 | 96 | 24 | 128 | 232 | 935| 14 | 8 | 94.4% | 0 | 80 | 152 | 16 | 64 | 104 | 24 | 128 | 232 | 936| 15 | 8 | 94.6% | 0 | 80 | 152 | 16 | 64 | 112 | 24 | 128 | 232 | 937| 16 | 8 | 94.9% | 0 | 16 | 152 | 16 | 64 | 112 | 24 | 128 | 232 | 938| ∞ | 8 | 99.8% | 0 | 0 | 0 | 24 | 112 | 216 | 24 | 120 | 216 | 939``` 940 941The first two columns show the number of available integer and 942floating-point registers. 943The first row shows the results for 0 integer and 0 floating-point 944registers, which is equivalent to ABI0. 945We found that any reasonable number of floating-point registers has 946the same effect, so we fixed it at 8 for all other rows. 947 948The “% fit” column gives the fraction of functions where all arguments 949and results are register-assigned and no arguments are passed on the 950stack. 951The three “stack args” columns give the median, 95th and 99th 952percentile number of bytes of stack arguments. 953The “spills” columns likewise summarize the number of bytes in 954on-stack spill space. 955And “stack total” summarizes the sum of stack arguments and on-stack 956spill slots. 957Note that these are three different distributions; for example, 958there’s no single function that takes 0 stack argument bytes, 16 spill 959bytes, and 24 total stack bytes. 960 961From this, we can see that the fraction of functions that fit entirely 962in registers grows very slowly once it reaches about 90%, though 963curiously there is a small minority of functions that could benefit 964from a huge number of registers. 965Making 9 integer registers available on amd64 puts it in this realm. 966We also see that the stack space required for most functions is fairly 967small. 968While the increasing space required for spills largely balances out 969the decreasing space required for stack arguments as the number of 970available registers increases, there is a general reduction in the 971total stack space required with more available registers. 972This does, however, suggest that eliminating spill slots in the future 973would noticeably reduce stack requirements. 974