****************************************************************************** * system calls, cont. code running inside kernel, in privileged mode, on behalf of user. syscall: open(filename, mode), read(fd, buf, len) libc: malloc(size), printf(...) both cases: fxn has a name, takes 0+ args, may return a value, and may have side effects (malloc creates mem, write modifies a file). But syscalls are more expensive. Kernel executes code for user as if the kernel is a "remote" entity. Not much different than executing network remote procedure calls (RPCs): pack fxn name and args, ship it to remote side, wait for response w/ errors and/or return data/values. (A) In userland 1. process invokes system call foo(arg1, arg2, arg3) 2. invoking a libc "wrapper" for syscall: syscall3(NR_FOO, arg1, arg2, arg3) - prog have to pass to kernel the args and "id" of syscall (fixed number is faster than a string) - need a shared medium for prog. to pass info to kernel: memory, cpu registers. - for every syscall there's an "NR_XXX" #define in some system header like : kernel and user/libc MUST agree on these numbers! 3. libc wrapper, stores NR_FOO into register 1 (R1), then args into R2, R3, ... 4. libc wrapper will invoke a special syscall interrupt, to tell kernel to run the syscall. Sometimes called "int80H" (MS-DOS, Windows). (B) In kernel mode 1. CPU is stopped from executing whatever it's running (the user program) 2. CPU register state is preserved (e.g., in some memory) 3. invoke syscall interrupt handler 4. put user process to sleep (WAIT state) - possible to defer this till we know how expensive the actual syscall is (e.g., would it have any I/O? some syscalls don't have any I/O, like getpid(2)). 5. syscall handler gets the preserved info about the syscall: namely the syscall number and its arguments. 6. based on syscall#, invoke a syscall dispatch table of syscall functions, setup stack frame with content of arg1/arg2/arg3/etc. as passed from user program. 7. Invokes function named like sys_FOO(args...). May be prefixed by "asmlinkage". 8. system call is running.... (in kernel mode, while process is sleeping) - we'll discuss this in great detail in later lectures. - syscall runs on a dedicated core in kernel mode, and has its own kernel stack. 9. once entry fxn for syscall is done, it can simply "return val". 10. syscall "patch" code prepares to restart process that invoked syscall. - puts return val into a pre-designated register, say, R6 (or mem of process). - move process from WAIT to READY - yield to the scheduler 11. at some point, scheduler will pick this process and run it. - restore state of CPU registers, including "R6" where retval is stored. - switch context to run the process () Back In userland 5. we're still inside libc wrapper - when syscall fails, it returns -1. - to find out in userland what was the error, consult the global "errno" variable. errno is global in libc, for a given user process. - however, the (Linux) kernel returns negative error numbers: -ENOENT, -ENOMEM, -EPERM; and on success, it returns 0 or positive numbers. - last task of libc wrapper is to check actual retval from kernel 1. if less than 0, set "errno = abs(retval)" and return -1 2. else return retval. - now we return from libc wrapper and we're back to next instruction after "foo(args...)" * Notes In linux, "struct task" records the stat of a running process. When running a system call, linux sets up for the currently executing task a fixed name variable "struct task *current". Side effects of system call: can "return" data by reading/writing into user provided buffers. On some architectures, there's not enough registers to hold all args of syscalls with many params (e.g., 5-6 or more args). So use a shared mem buffer in user program, and return start addr of that buf in a single reg. Kernel has to now that for such a syscall, to look for its args in user mem not registers. * writing kernel code // entry point into the read syscall in kernel int sys_read(unsigned int fd, char __user * buf, size_t count) { // 1. check params passed to program // test if fd is for an opened file // test that fd is open for reading // is fd num. above max allowed open fds (OSs set it to 1024 or the like) // fd zero (valid == stdin); fd 1 == stdout; fd 2 == stderr if (fd > FD_MAX) return -EINVAL; if (fd not opened for reading) return -EPERM; // verify that user buf all way to count bytes, is valid addr space, and // that kernel can WRITE to it. (the write(2) syscall needs READ access to buf). // can use access_ok(), and many other functions if (verify_area(VERIFY_WRITE, buf, count) != 0) return -EFAULT; // mapping tables in kernel are in granularity of pages (4KB) // byte addr is 32-bits; 4KB (2^12); shift addr right >>12 find page# to // consult mapping tables. // general: for any buf and len, convert first buf[0], and last byte // buf[count-1] into page addrs, and verify EACH page addr from start to // end, including any intermediate pages. // 2. perform some initializations // 3. actually perform the read // 4. post conditions to check, verify, and return from call } // imaginary syscall to copy file1 to file2, len bytes int sys_cp(char *file1, char *file2, u_int len) { void *buf; struct file *fp1, fp2; // 1. verify valid params // 2. initializations // open file1 for reading fp1 = filp_open(file1, O_READ, ...); if (fp1 == NULL) { // then filp_open failed (WARNING: not quite) return -ENOENT; // return some other error } // open file2 for writing fp2 = filp_open(file2, O_WRITE, ...); if (fp2 == NULL) { // then filp_open failed (WARNING: not quite) filp_close(fp1); return -ENOENT; // return some other error } // allocate buffer of "len" to read/write bytes buf = kmalloc(len, ...); // kmalloc failed if (buf == NULL) { filp_close(fp2); filp_close(fp1); return -ENOMEM; } // if some other init here, will need to kfree + filp_close x 2 // 3. actually doing the work // copy len bytes from file1 to file 2. // 4. cleanup kfree(buf); filp_close(fp2); filp_close(fp1); }