18 Jan 2014.
This is my writeup of how system calls work at the assembler level. This article specifically deals with syscalls on Linux on x86_64, but some of the material is more widely applicable.
I apologize for the AT&T assembler syntax, but all the GNU tools default to it. I figured insisting on Intel syntax everywhere would have made the writeup more confusing.
Hello World
Here's a simple "hello world" program:
#include <unistd.h> int main() { const char msg[] = "Hello world!\n"; write(STDOUT_FILENO, msg, sizeof(msg) - 1); return 0; }
(source code: hello1.c)
Note that sizeof(msg)
would return fourteen because the size of the string includes its NUL terminator.
The code above builds and runs as you'd expect:
$ gcc hello1.c $ ./a.out Hello world! $
But how is write()
actually implemented?
Like most syscalls, there isn't a write.c
in glibc
that implements it. Instead, the code is mechanically generated at build time:
$ cd glibc-build-dir $ make [...] (echo '#define SYSCALL_NAME write'; \ echo '#define SYSCALL_NARGS 3'; \ echo '#define SYSCALL_SYMBOL __libc_write'; \ echo '#define SYSCALL_CANCELLABLE 1'; \ echo '#include <syscall-template.S>'; \ echo 'weak_alias (__libc_write, __write)'; \ echo 'libc_hidden_weak (__write)'; \ echo 'weak_alias (__libc_write, write)'; \ echo 'libc_hidden_weak (write)'; \ ) | gcc [lots of args elided...]
It's interesting to read through syscall-template.S
and its dependencies, but it's much quicker to just look at the disassembly of the resulting object file:
$ objdump -d io/write.o [...] 0000000000000009 <__write_nocancel>: 9: b8 01 00 00 00 mov $0x1,%eax e: 0f 05 syscall 10: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 16: 0f 83 00 00 00 00 jae 1c <__write_nocancel+0x13> 1c: c3 retq
Let's ignore the nocancel
and error handling bits for now, and implement our own write()
like this:
ssize_t my_write(int fd, const void *buf, size_t count) { asm ( "movl $1, %%eax\n\t" "syscall" ::: "eax" /* tell GCC the above clobbers the eax register */ ); }
(source code: hello2.c)
And it works!
The reason it works, in two parts: First, when main()
calls my_write()
, it uses the x86_64 function call convention where the first six arguments are passed in registers rdi
, rsi
, rdx
, rcx
, r8
, r9
.
Second, when my_write()
issues the syscall
instruction, the kernel is expecting the syscall number in rax
, and six arguments in the registers rdi
, rsi
, rdx
, r10
, r8
, r9
.
So all three of the arguments to the write
syscall are already in the correct registers, we just have to set rax
.
Some asides:
- We set
eax
(the 32-bit register that the 80386 had) instead ofrax
(the 64-bit register introduced in x86_64) because the former has a shorter encoding (by two bytes) and is functionally equivalent because it clears the upper half ofrax
anyway.
b8 01 00 00 00 mov $0x1,%eax 48 c7 c0 01 00 00 00 mov $0x1,%rax
- The inline assembler we specify in our C source is pasted directly into the assembler output that
cc1
produces, and that gets passed togas
. We need to include newlines to separate instructions, and (optionally) tabs for style. - Syscalls are limited to six arguments, whereas C functions can be passed additional arguments on the stack. Of course, there's a lot more to this, and if you're interested you should go and read the System V Application Binary Interface - AMD64 Architecture Processor Supplement.
Here's what main()
looks like:
main: .LFB1: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl $13, %edx # <-- movl $msg, %esi # <-- movl $1, %edi # <-- call my_write # <-- movl $0, %eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc
The optimizer
If we turn on optimization, our code no longer works, because my_write()
gets inlined into main()
and instead of the above, we get:
main: .LFB1: .cfi_startproc #APP # 4 "hello2.c" 1 movl $1, %eax syscall # 0 "" 2 #NO_APP movl $0, %eax ret .cfi_endproc
The optimizer has helpfully optimized out all the register shuffling between the two functions it merged, so our syscall no longer gets any args!
The way to fix this is to explicitly tell GCC what value needs to be in which register, instead of relying on the machine's function call convention. We do this by specifying "constraints" on our inline assembler block:
ssize_t my_write(int fd, const void *buf, size_t count) { int64_t result; asm volatile ( "syscall" : /* outputs */ "=a" (result) : /* inputs */ "a" (__NR_write), // rax = syscall number "D" (fd), // rdi = arg1 "S" (buf), // rsi = arg2 "d" (count) // rdx = arg3 : /* clobbers */ "cc", "r11", "rcx" ); return 0; }
(source code: hello3.c)
Now our code works again, and survives inlining and optimization!
Some things to note:
- Instead of setting
rax
ourselves, we use an input constraint to set it to the syscall number we get fromasm/unistd_64.h
. This reduces the amount of icky AT&T assembler we have to write.
- We specify the three exact registers and which arguments they should take. If we expanded this to a syscall with a fourth argument, where the userland and syscall calling conventions disagree, inlining and optimizing would result in less register shuffling than calling a
glibc
function!
- Syscalls return a result in
rax
, and we use an output constraint to tie that to a local variable we can use later. Note that this local variable might not actually exist on the stack, it's just a way of capturing the syscall result and giving it a name. Its concrete representation will depend on the optimizer.
- Since we don't end up using the
result
variable, we have to mark theasm
block asvolatile
or GCC optimizes the entire thing away as a dead write!
- Input and output constraints use wacky single-character codes like uppercase
D
versus lowercased
instead of register names likerdi
andrdx
(respectively). If you'd like to learn more, read the GCC documentation on inline assembler and constraints.
- The kernel is allowed to change the values of
r11
,rcx
andEFLAGS
(which the GCC folks spellcc
). Specifying these as "clobbers" tells GCC that it needs to save and restore their values if it cares about them surviving across ourasm
block.
Our optimized main()
looks very clean now:
main: .LFB1: .cfi_startproc movl $1, %edi movl $13, %edx movl $msg, %esi movl %edi, %eax #APP # 7 "hello3.c" 1 syscall # 0 "" 2 #NO_APP xorl %eax, %eax ret
And everything except the syscall
instruction is coming from GCC's code generator instead of our inline assembler.
errno
We can use our output constraint above to correctly set errno
after the write()
syscall:
ssize_t my_write(int fd, const void *buf, size_t count) { int64_t result; asm ( "syscall" : /* outputs */ "=a" (result) : /* inputs */ "a" (__NR_write), // rax = syscall number "D" (fd), // rdi = arg1 "S" (buf), // rsi = arg2 "d" (count) // rdx = arg3 : /* clobbers */ "cc", "r11", "rcx" ); if (result >= -4095 && result <= -1) { errno = -result; return -1; } else { return result; } }
(source code: hello4.c)
putchar
Let's implement putchar()
and make it take an int
instead of a char
argument like the C89 standard says:
void my_putchar(int c) { char ch = c; my_write(STDOUT_FILENO, &ch, 1); } static const char msg1[] = "This is "; static const char msg2[] = "broken.\n"; int main() { write(STDOUT_FILENO, msg1, sizeof(msg1) - 1); my_putchar('n'); my_putchar('o'); my_putchar('t'); my_putchar(' '); write(STDOUT_FILENO, msg2, sizeof(msg2) - 1); return 0; }
(source code: putchar1.c)
You can tell what the punch-line is going to be:
$ gcc -O2 putchar1.c $ ./a.out This is broken. $
So what does the generated code look like?
my_putchar: .LFB1: .cfi_startproc leaq -1(%rsp), %rsi movl $1, %edi movl $1, %edx movl %edi, %eax #APP # 7 "putchar1.c" 1 syscall # 0 "" 2 #NO_APP ret
As always, the write
is inlined. my_putchar
is passed a single argument in rdi
and clobbers it. The buffer it passes to the write syscall is the next byte on the stack, but it doesn't set it to anything.
There's a subtle change we need to make to the input constraints for write
:
--- putchar1.c 2014-01-19 00:19:23.213002952 +1100 +++ putchar2.c 2014-01-19 00:24:25.697744948 +1100 @@ -12,6 +12,7 @@ "a" (__NR_write), // rax = syscall number "D" (fd), // rdi = arg1 "S" (buf), // rsi = arg2 + "m" (buf), // buf has to be a memory location "d" (count) // rdx = arg3 : /* clobbers */ "cc",
(source code: putchar2.c)
This gives us a working putchar
:
my_putchar: .LFB1: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movl $1, %eax movl $1, %edx movq %rsp, %rbp .cfi_def_cfa_register 6 andq $-32, %rsp addq $16, %rsp leaq -80(%rsp), %rsi # rsi = pointer to variable on stack movb %dil, -80(%rsp) # write %dil (DL in Intel) to that location movq %rsi, -48(%rsp) movl %eax, %edi #APP # 7 "putchar2.c" 1 syscall # 0 "" 2 #NO_APP leave .cfi_def_cfa 7, 8 ret