Shell Lab [Updated 7/28/03] (README, Writeup, Release Notes, Self-Study Handout)

Students implement their own simple Unix shell program with job control, including the ctrl-c and ctrl-z keystrokes, fg, bg, and jobs commands. This is the students’ first introduction to application level concurrency, and gives them a clear idea of Unix process control, signals, and signal handling.

前言

本篇博客将会详细介绍 CSAPP 之 ShellLab 的完成过程，实现一个简易（lou）的 shell。tsh 拥有以下功能：

可以执行外部程序
支持四个内建命令，名称和功能为：
- quit：退出终端
- jobs：列出所有后台作业
- bg <job>：继续在后台运行一个处于停止状态的后台作业，<job> 可以是 PID 或者 %JID 形式
- fg <job>：将一个处于运行或者停止状态的后台作业转移到前台继续运行
按下 ctrl + c 终止前台作业
按下 ctrl + z 停止前台作业

实验材料中已经写好了一些函数，只要求我们实现下列核心函数：

eval：解析并执行指令
builtin_cmd：识别并执行内建指令
do_bgfg：执行 fg 和 bg 指令
waitfg：阻塞终端直至前台任务完成
sigchld_handler：捕获 SIGCHLD 信号
sigint_handler：捕获 SIGINT 信号
sigtstp_handler：捕获 SIGTSTP 信号

理论知识检验

Q1：wait是等待子进程终止，然后父进程去收割？

Q2：kill是父进程去杀死子进程？

以上两个问题若回答yes，则说明很有必要再把书本或者slides学习一遍，此外 man wait / man kill 也会给出非常棒的手册（都推荐）。

信号处理函数

sigint_handler 和 sigtstp_handler

这两个函数的主要任务，是在收到 shell 传来的信号时，将这个信号“转发”给在 shell 中运行的进程。这个过程很好办——先用 fgpid 获取前台进程（为啥只有前台进程嘞？因为 SIGTSTP 和 SIGINT 信号是只发给前台进程的）的 pid，之后走 kill 调用，向这个子进程组发对应的信号。

/*
 * sigint_handler - The kernel sends a SIGINT to the shell whenver the
 *    user types ctrl-c at the keyboard.  Catch it and send it along
 *    to the foreground job.
 */
void sigint_handler(int sig) {
  int old_errno = errno;
  pid_t pid = fgpid(jobs);
  if (pid > 0) {
    kill(-pid, sig);
  }
  errno = old_errno;
}

/*
 * sigtstp_handler - The kernel sends a SIGTSTP to the shell whenever
 *     the user types ctrl-z at the keyboard. Catch it and suspend the
 *     foreground job by sending it a SIGTSTP.
 */
void sigtstp_handler(int sig) {
  int old_errno = errno;
  pid_t pid = fgpid(jobs);
  if (pid > 0) {
    kill(-pid, sig);
  }
  errno = old_errno;
}

Q3：为什么要 kill(-pid, sig) ?

如果 shell fork 出来的子进程，没有再 fork 它自己的子进程的话，填 “pid” 没有任何问题；但是，如果它 fork 了的话（shell 就有孙进程了），这时候子进程和孙进程的 pid 是不一样的。填正的 pid，只能保证子进程能被结束；但是孙进程么……就没那么幸运了——它会“丧父”（变成孤儿进程），直到操作系统“收养”它。

这里可以看出 kill 只是 send a signal to a process，并不一定是发送SIGKILL。

扩展：SIGKILL 无法被忽略或组织。

Q4：handler 中是否需要阻塞信号的接收？

有隐式阻塞机制，无须显式调用 sigprocmask 。

扩展：根据 G2 需要在handler入口和离开时暂存并恢复 errno。

根据 G1，实际上不能应该是用Standard I/O函数，如printf，但既然 sigquit_handler 中使用 printf，那就默认我们也能用吧。

sigchld_handler

阅读代码注释，有2点要求需要注意：

or stops because it received a SIGSTOP or SIGTSTP signal
but doesn’t wait for any other currently running children to terminate.

可以 man waitpid，里面有些好东西：

All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal.

这里可以回答Q1: wait for process to change state，这里的改变状态不只是 terminated 。

In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a “zombie” state (see NOTES below).

If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call (assuming that system calls are not automatically restarted using the SA_RESTART flag of sigaction(2)).

这里指明了对于一个terminated child 也可以调用 wait，即可以子进程先term，父进程后wait。

/*
 * sigchld_handler - The kernel sends a SIGCHLD to the shell whenever
 *     a child job terminates (becomes a zombie), or stops because it
 *     received a SIGSTOP or SIGTSTP signal. The handler reaps all
 *     available zombie children, but doesn't wait for any other
 *     currently running children to terminate.
 */
void sigchld_handler(int sig) {
  int old_errno = errno;
  pid_t pid;
  int status;
  while ((pid = waitpid(-1, &status, WNOHANG | WUNTRACED)) > 0) {
    if (WIFEXITED(status)) {
      deletejob(jobs, pid);
    } else if (WIFSIGNALED(status)) {
      int jid = pid2jid(pid);
      printf("Job [%d] (%d) terminated by signal %d\n", jid, pid,
             WTERMSIG(status));
      deletejob(jobs, pid);
    } else if (WIFSTOPPED(status)) {
      struct job_t *job = getjobpid(jobs, pid);
      job->state = ST;
      int jid = pid2jid(pid);
      printf("Job [%d] (%d) Stopped by signal %d\n", jid, pid,
             WSTOPSIG(status));
    }
  }
}

选项含义：

WNOHANG： return immediately if no child has exited.
WUNTRACED：also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.

eval 和 waitfg

eval

Q5：由于shell不会终止，其fg子进程可以被正常reap，但是bg子进程怎么reap呢？

解决方法就是 detach： setpgid(0, 0);。这样做还能解决其他问题：

当我们按下 Ctrl + C，给子进程发终止消息的时候，如果 shell 和子进程的进程组号相同，那么它和子进程都会收到转发的 SIGINT 信号，之后一起终止。只要我们在子进程里重新设下 gpid，就能解决这个问题了。

/*
 * eval - Evaluate the command line that the user has just typed in
 *
 * If the user has requested a built-in command (quit, jobs, bg or fg)
 * then execute it immediately. Otherwise, fork a child process and
 * run the job in the context of the child. If the job is running in
 * the foreground, wait for it to terminate and then return.  Note:
 * each child process must have a unique process group ID so that our
 * background children don't receive SIGINT (SIGTSTP) from the kernel
 * when we type ctrl-c (ctrl-z) at the keyboard.
 */
void eval(char *cmdline) {
  char *argv[MAXARGS];
  pid_t pid;

  sigset_t mask_all, mask_one, prev_mask;
  sigfillset(&mask_all);
  sigemptyset(&mask_one);
  sigaddset(&mask_one, SIGCHLD);

  int bg = parseline(cmdline, argv);
  if (!argv[0]) return;
  if (builtin_cmd(argv)) return;

  sigprocmask(SIG_BLOCK, &mask_one, &prev_mask);
  if ((pid = Fork()) == 0) {
    sigprocmask(SIG_SETMASK, &prev_mask, NULL);
    setpgid(0, 0);  // NOTE
    Execve(argv[0], argv, environ);
  }
  sigprocmask(SIG_BLOCK, &mask_one, NULL);
  addjob(jobs, pid, (bg ? BG : FG), cmdline);

  if (!bg) {
    waitfg(pid);
  } else {
    printf("[%d] (%d) %s", pid2jid(pid), pid, cmdline);
  }
  sigprocmask(SIG_SETMASK, &prev_mask, NULL);
}

这里还涉及一个利用 block / unblock 进行同步的问题（如果不加，则不能保证handler中 deletejob 会晚于父进程中的 addjob 执行），建议阅读slide理解此问题。

其中 Fork 和 Execve 是CMU wrap过的函数：

pid_t Fork() {
  pid_t pid = fork();
  if (pid < 0) {
    unix_error("Fork error");
  }
  return pid;
}

int Execve(const char *__path, char *const *__argv, char *const *__envp) {
  int result = execve(__path, __argv, __envp);
  if (result < 0) {
    printf("%s: Command not found\n", __argv[0]);
    exit(1);
  }
  return result;
}

waitfg

除了sigsuspend, 其他方法不太行：

（上图中的 Program is correct, but very wasteful 指的是 while (!pid) ;）

int sigsuspend(const sigset_t *mask) 的描述如下：

sigsuspend() temporarily replaces the signal mask of the calling thread with the mask given
by mask and then suspends the thread until delivery of a signal whose action is to invoke a
signal handler or to terminate a process.

If the signal terminates the process, then sigsuspend() does not return. If the signal is
caught, then sigsuspend() returns after the signal handler returns, and the signal mask is
restored to the state before the call to sigsuspend().

It is not possible to block SIGKILL or SIGSTOP; specifying these signals in mask, has no effect on the thread’s signal mask.

/*
 * waitfg - Block until process pid is no longer the foreground process
 */
void waitfg(pid_t pid) {
  sigset_t mask;
  sigemptyset(&mask);
  while (fgpid(jobs) == pid) {
    sigsuspend(&mask);
  }
}

builtin_cmd 和 do_bgfg

builtin_cmd

/*
 * builtin_cmd - If the user has typed a built-in command then execute
 *    it immediately.
 */
int builtin_cmd(char **argv) {
  int is_builtin = 1;
  if (!strcmp(argv[0], "quit")) {
    exit(0);
  } else if (!strcmp(argv[0], "fg") || !strcmp(argv[0], "bg")) {
    do_bgfg(argv);
  } else if (!strcmp(argv[0], "jobs")) {
    listjobs(jobs);
  } else {
    is_builtin = 0;
  }
  return is_builtin;
}

do_bgfg

进程状态转化如下：

Q6：job是什么，怎么使用？

实验手册中这样说：The child processes created as a result of interpreting a single command line are known collectively as a job. In general, a job can consist of multiple child processes connected by Unix pipes.

Q7：fg %2 对 jid = 2 的进程 / 进程组有何影响？

对于 bg 命令，我们只是向目标进程发送 SIGCONT 信号，让它继续执行；对于 fg 命令呢，我们先判断目标进程是不是已经暂停了（如果是，就先启动它）—— 我们也可以对 bg / fg 目标job所在进程组都发一个CONT信号。之后调用 waitfg 等待进程结束。注意哦，这里的 kill 函数的第一个参数也是要填负值的。

当用户与命令行交互时，通常只有一个 foreground process（而非 foreground process group）在运行，只用等待这个进程结束。

/*
 * do_bgfg - Execute the builtin bg and fg commands
 */
void do_bgfg(char **argv) {
  char *cmd = argv[0];
  char *id = argv[1];
  struct job_t *job;
  if (!id) {
    printf("%s command requires PID or %%jobid argument\n", cmd);
    return;
  }
  if (id[0] == '%') {
    if (!(job = getjobjid(jobs, atoi(&id[1])))) {
      printf("%s: No such job\n", id);
      return;
    }
  } else if (atoi(id) > 0) {
    if (!(job = getjobpid(jobs, atoi(id)))) {
      printf("%s: No such process\n", id);
      return;
    }
  } else {
    printf("%s: argument must be a PID or %%jobid\n", cmd);
    return;
  }

  kill(-job->pid, SIGCONT);
  if (strcmp(cmd, "bg")) {
    job->state = BG;
    printf("[%d] (%d) %s", job->jid, job->pid, job->cmdline);
  } else if (strcmp(cmd, "fg")) {
    job->state = FG;
    waitfg(job->pid);
  }
}

注意 kill(-job->pid, SIGCONT) 和 waitfg(job->pid) 。

参考资料

【【深入理解计算机系统实验4 CSAPP】Shell Lab 实现 CMU 详细讲解 shelllab】 https://www.bilibili.com/video/BV1EF411h791/?share_source=copy_web&vd_source=1e8c177289cfed3be80e766714c3f11f （郭郭wg的讲解视频）
csapp-shlab 详解 - 独小雪的文章 - 知乎 https://zhuanlan.zhihu.com/p/422490811 （通俗易懂）
CSAPP 之 ShellLab 详解 - 之一Yo - 博客园 (cnblogs.com) （简洁明了）
CSAPP 之 ShellLab 详解 - 之一Yo - 博客园 (cnblogs.com) （逐trace分析）