Recall strace
: shows syscalls invoked during program execution.
malloc()
does not appear in the strace, because it is not a syscall. How does
process increase the size of its heap?
See the following lines in strace output:
brk(NULL) = 0x558d52610000
brk(0x558d52631000) = 0x558d52631000
brk()
: changes the location of the program break, which defines the end of the
process’s data segment (i.e., the program break is the first location after the
end of the uninitialized data segment). Increasing the program break has the
effect of allocating memory to the process; decreasing the break deallocates
memory.
A program’s break is the address of the top of its heap. brk(NULL)
gets the
current process break and break(addr)
sets the break to addr
.
Program setup involves mapping in the C standard library:
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(0x7f6e911ec000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f6e911ec000
close(3)
Recall from L05-ipc: use mmap()
to create a file-backed mapping in virtual
memory:
Region of virtual memory refers to a “snapshot” of the file in memory, i.e. a “page”.
All regions of virtual memory map to a page of physical memory:
All processes have a direct mapping to shared kernel code & data
task_struct
lists, per-process kernel stack, file sharing data
structures are stored hereKernel mapping is above the stack, but is not accessible to userspace program because of memory protection. How do syscalls safely access kernel mapping?
CPU runs with a privilege mode that determines what kind of operations it can perform:
A userspace program (running in user mode) can issue a syscall (e.g. read()
)
to enter kernel mode so that it can perform a privileged operation (e.g. issue
I/O request).
Syscalls act as predefined entry-points into the kernel. They allow userspace programs to “trap” into the kernel to perform a privileged operation and then “return-from-trap” back to user mode.
Need some sort of indicator to tell the processor to stop executing user code and trap into the kernel to perform privileged operations.
Three kinds of interrupts:
int
: raise software interruptint 0x80
), debuggerSimplified CPU hardware execution loop:
while (1) {
if (interrupt or exception) {
n = interrupt/exception type
call interrupt handler n
}
fetch next instruction
if (instruction == int n)
call interrupt handler n
else
run instruction
}
Note: do not confuse interrupts with userspace signals!
SIGFPE
SIGSEGV
User program invokes read()
:
0x80
(system call)
__NR_read
into %eax
register0x80
in Interrupt Descriptor Table (IDT)0x80
’s handler: system_call()
system_call()
looks up __NR_read
in sys_call_table
__NR_read
’s handler: sys_read()
read()
’s real work in sys_read()
Notes:
int 0x80
is how syscalls were invoked in 32-bit x86. Process varies in
64-bit/other architectures./arch/x86/entry
sys_read()
implementation in /fs/read_write.c
:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
return ksys_read(fd, buf, count);
}
Syscall parameters are passed via registers
struct sigaction *
)We can’t let user programs trick the kernel using malicious addresses:
read()
/write()
: what if buf
actually points to kernel memory?System calls must validate pointer parameters before copying:
// /include/linux/uaccess.h
static __always_inline unsigned long __must_check
copy_to_user(void __user *to, const void *from, unsigned long n);
static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n);
Last updated: 2023-02-19