******************************************************************************
* system calls, cont.

code running inside kernel, in privileged mode, on behalf of user.

syscall: open(filename, mode), read(fd, buf, len)
libc:    malloc(size), printf(...)

both cases: fxn has a name, takes 0+ args, may return a value, and may have
side effects (malloc creates mem, write modifies a file).  But syscalls are
more expensive.

Kernel executes code for user as if the kernel is a "remote" entity.  Not
much different than executing network remote procedure calls (RPCs): pack
fxn name and args, ship it to remote side, wait for response w/ errors
and/or return data/values.

(A) In userland

1. process invokes system call foo(arg1, arg2, arg3)
2. invoking a libc "wrapper" for syscall: syscall3(NR_FOO, arg1, arg2, arg3)
- prog have to pass to kernel the args and "id" of syscall (fixed number is
  faster than a string)
- need a shared medium for prog. to pass info to kernel: memory, cpu
  registers.
- for every syscall there's an "NR_XXX" #define in some system header like
  <sys/syscall.h>: kernel and user/libc MUST agree on these numbers!
3. libc wrapper, stores NR_FOO into register 1 (R1), then args into R2,
   R3, ...
4. libc wrapper will invoke a special syscall interrupt, to tell kernel to
   run the syscall.  Sometimes called "int80H" (MS-DOS, Windows).

(B) In kernel mode

1. CPU is stopped from executing whatever it's running (the user program)
2. CPU register state is preserved (e.g., in some memory)
3. invoke syscall interrupt handler
4. put user process to sleep (WAIT state)
- possible to defer this till we know how expensive the actual syscall is
  (e.g., would it have any I/O? some syscalls don't have any I/O, like
  getpid(2)).
5. syscall handler gets the preserved info about the syscall: namely the
   syscall number and its arguments.
6. based on syscall#, invoke a syscall dispatch table of syscall functions,
   setup stack frame with content of arg1/arg2/arg3/etc. as passed from user
   program.
7. Invokes function named like sys_FOO(args...).  May be prefixed by
   "asmlinkage".
8. system call is running.... (in kernel mode, while process is sleeping)
- we'll discuss this in great detail in later lectures.
- syscall runs on a dedicated core in kernel mode, and has its own kernel stack.
9. once entry fxn for syscall is done, it can simply "return val".
10. syscall "patch" code prepares to restart process that invoked syscall.
- puts return val into a pre-designated register, say, R6 (or mem of process).
- move process from WAIT to READY
- yield to the scheduler
11. at some point, scheduler will pick this process and run it.
- restore state of CPU registers, including "R6" where retval is stored.
- switch context to run the process

() Back In userland

5. we're still inside libc wrapper
- when syscall fails, it returns -1.
- to find out in userland what was the error, consult the global "errno"
  variable.  errno is global in libc, for a given user process.
- however, the (Linux) kernel returns negative error numbers: -ENOENT,
  -ENOMEM, -EPERM; and on success, it returns 0 or positive numbers.
- last task of libc wrapper is to check actual retval from kernel
	1. if less  than 0, set "errno = abs(retval)" and return -1
	2. else return retval.
- now we return from libc wrapper and we're back to next instruction after
  "foo(args...)"

* Notes

In linux, "struct task" records the stat of a running process.  When running
a system call, linux sets up for the currently executing task a fixed name
variable "struct task *current".

Side effects of system call: can "return" data by reading/writing into user
provided buffers.

On some architectures, there's not enough registers to hold all args of
syscalls with many params (e.g., 5-6 or more args).  So use a shared mem
buffer in user program, and return start addr of that buf in a single reg.
Kernel has to now that for such a syscall, to look for its args in user mem
not registers.

* writing kernel code

// entry point into the read syscall in kernel
int sys_read(unsigned int fd, char __user * buf, size_t count)
{
  // 1. check params passed to program
  // test if fd is for an opened file
  // test that fd is open for reading
  // is fd num. above max allowed open fds (OSs set it to 1024 or the like)
  // fd zero (valid == stdin); fd 1 == stdout; fd 2 == stderr
  if (fd > FD_MAX)
    return -EINVAL;
  if (fd not opened for reading)
    return -EPERM;

  // verify that user buf all way to count bytes, is valid addr space, and
  // that kernel can WRITE to it. (the write(2) syscall needs READ access to buf).
  // can use access_ok(), and many other functions
  if (verify_area(VERIFY_WRITE, buf, count) != 0)
    return -EFAULT;
  // mapping tables in kernel are in granularity of pages (4KB)
  // byte addr is 32-bits; 4KB (2^12); shift addr right >>12 find page# to
  // consult mapping tables.
  // general: for any buf and len, convert first buf[0], and last byte
  // buf[count-1] into page addrs, and verify EACH page addr from start to
  // end, including any intermediate pages.

  // 2. perform some initializations

  // 3. actually perform the read

  // 4. post conditions to check, verify, and return from call
}

// imaginary syscall to copy file1 to file2, len bytes
int sys_cp(char *file1, char *file2, u_int len)
{
  void *buf;
  struct file *fp1, fp2;

  // 1. verify valid params

  // 2. initializations
  // open file1 for reading
  fp1 = filp_open(file1, O_READ, ...);
  if (fp1 == NULL) { // then filp_open failed (WARNING: not quite)
    return -ENOENT; // return some other error
  }
  // open file2 for writing
  fp2 = filp_open(file2, O_WRITE, ...);
  if (fp2 == NULL) { // then filp_open failed (WARNING: not quite)
    filp_close(fp1);
    return -ENOENT; // return some other error
  }
  // allocate buffer of "len" to read/write bytes
  buf = kmalloc(len, ...); // kmalloc failed
  if (buf == NULL) {
    filp_close(fp2);
    filp_close(fp1);
    return -ENOMEM;
  }
  // if some other init here, will need to kfree + filp_close x 2

  // 3. actually doing the work
  // copy len bytes from file1 to file 2.

  // 4. cleanup
  kfree(buf);
  filp_close(fp2);
  filp_close(fp1);
}