llhuii's Blog

Happy coding

Jos

花了一个月做JOS[1], 阅读xv6[2]和linux-0.11[3], 收获颇多.
MIT的计算机和Linus果然名不虚传(即便只是linux-0.11).

## 读源码也是一种乐趣.

-----------------------------------
JOS

它分为七个实验:
1.启动.
2.内存管理.
3.用户环境
4.可抢断多任务
5.文件系统
6.网络驱动
7.A shell

============================
Lab 1: Booting a PC.
计算机启动->BIOS->引导启动程序->内核, 也可见[4].

引导启动程序主要干两件事:
1.将处理器从实模式转换到32位保护模式. (boot/boot.S)
2.从硬盘读内核, 将控制权转给内核. (boot/main.c)

内核获得控制之后, 由于代码还在低地址运行需要开启分页机制, 然后通过间接调用跳到
高地址的C初始化代码. (kern/entry.S, kern/entrypgdir.c)

内核初始化(kern/init.c):
1)console: keyboard/vga/serial. (kern/console.c)
2)目前直接monitor.(kern/monitor.c) (后序实验初始化, 如内存, 文件系统等)

Challenges:
1) Enhance the console, see [3] "字符设备驱动程序"一章.

Summary: It took me much time to review not-so-familiar knowledge, i.e.
assembly instructions, 80x86 registers, and gdb.


============================
Lab 2: Memory Management.

Part 1: 物理内存管理. (kern/pmap.c)
1)页面的初始化.
2)页面的分配和回收.

Part 2: 虚拟内存. (kern/pmap.c)
1)虚拟地址/线性地址/物理地址三者之间的转换.
2)页表的管理, 插入和删除页面时页目录和页表的更新.

Part 3: 内核地址空间
1)JOS 内存layout. (inc/memlayout.h)
2)内核的保护. (页表项, 段选择符)

Challenges:
1) Reducing pages for KERNBASE mapping, see [2](entrypgdir in main.c).
2) Malloc/free, see [2](umallo.c) and [4](lib/mallo.c).

Summary: I'm more clear about these three addresses and memory
layout/allocating/freeing, etc.


============================
Lab 3: User Environments.

Part A: 用户环境和异常处理

JOS的用户环境类似于Unix进程. (kern/env.c)
1) 内核需要为用户环境保存的信息. (env_id 含有个有趣的部分)
2) 分配用户环境空间, 创建和运行用户环境. (由于没有文件系统, JOS嵌入binary到
        kernel)

异常处理, 设置中断描述表IDT. (kern/trap.c, kern/trapentry.S)

Part B: 页故障, 断点和系统调用
页故障, CR2. (kern/trap.c)

断点, 主要用来user-mode panic. (kern/trap.c)

系统调用. (kern/syscall.c)
1)参数的传递是通过5个registers(edx, ecx, ebx, edi, esi), eax保存系统调用号.

Challenges:
1) Clean trapentry.S up, we can script it, see [4](vectors.pl).
2) Support 'continue', set TF and RF of EFLAGS, then go back user.
3) Sysenter/sysexit, see [5].

Summary: Trap, exception, and syscall; kernel protection; kernel/user address
space switching.


============================
Lab 4: Preemptive multitasking

Part A: 多核支持和合作式多任务

多核支持: (kern/lapic.c) ## 这部分没细看
1)BSP(bootstrap processor) 激活APs(application processors)
2)Per-CPU 状态和初始化. (Kernel stack, tss 和 tss 描述符, 当前env, registers,
        idle env)

Locking(kern/spinlock.c), Big Kernel Lock.

Scheduling(kern/sched.c), Round-Robin.

System calls for 创建用户环境: (注意权限和地址的合法性判断)
1) sys_exofork, 类似Unix fork, 创建新用户环境, child 被标记不可运行.
2) sys_env_set_status, 设置可运行或不可运行.
3) sys_page_alloc, 分配新页面.
4) sys_page_map, 复制一页面的映射.
5) sys_page_unmap, unmap 一页面.

Part B: Copy-on-Write Fork

用户级页错误处理函数:
1) 设置页错误处理函数.
2) 异常栈(Exception stack).
3) 调用页错误处理函数.
4) Entrypoint. ## 有点难且有趣

实现CoW Fork: ## 难点: 理解vpt/vpd
1) 设置页错误处理函数.
2) sys_exofork 创建新env.
3) 为父/子env 设置cow-or-writable 页面为 cow, 除了异常栈.
4) 为子env设置页错误处理函数.
5) 标记子env可运行.

Part C: IPC

1) send/receive.
2) lib wrapper.

## 令我十分吃惊的是primes.c, 它用一种新颖的思路实现prime sieve.

## 这部分Challenges十分有趣, 花了我很多时间, 尤其是power-series.
Challenges:
1) Drop big kernel lock, see [2] and [4].
2) Add a scheduling policy, well, todo.
3) SFORK, easy but heed the global ``thisenv'' pointer.
4) Un-loop ipc_send, the only way I can occur is the block-wait.
5) Matrix multiplication, using sfork.
6) Power series, GOOD ONE, also see [6](Python implementation).
7) "Improving IPC by Kernel Design", so many tricks :-), todo.

Summary: Other CPUs are activated by BSP; Locking and Scheduling both are big
chunks; COW fork is easy to say, but seem tricky to implement; Page Fault is
also tricky; IPC...

============================
Lab 5: File System
## 主要工作是看代码.

文件系统在磁盘上数据结构:
1) 区(Sector)与块(Block)
2) 超级块(Superblock)

3) 块位图(block bitmap)
4) 文件元数据(File meta-data)

文件系统组成部分:
1) 磁盘访问, see also [4]'s "块设备驱动程序" 一节.
2) 块缓存的管理.
3) 位图的管理.
4) 文件操作.

该文件系统是由一个专门env来实现, JOS 通过IPC的方式让其它env来使用该文件系统.

Spawning processes
## 这段代码还挺有意思.
##

Challenges:
1) Crash-resilient, see [2]
2) Inode, see [2]
3) Unix-style exec, tricky as the kernel is needed to do the entire replacement.
4) Mmap-style, no that hard.

Summary: Delving into this simple FS, now I understand FS more clearly.


============================
Lab 6: Network Driver

## 本实验主要阅读Intel的8254x手册, 以及阅读源码(略过lwIP).

## 之前我一直使用从官方编译的QEMU, 这个实验要使用6.828的版本便于调试(见JOS-tool).

## 这个实验开始grade-test改用python实现, 这可以说是一大改进, 因为可以部分测试了,
## 十分方便.

该网络驱动分为四个部分:
1) 核心网络服务环境
2) 输入环境
3) 输出环境
4) 定时器
## 主要完成2/3/4.

Part A: 初始化和网络包的传输

网卡识别与初始化:
1) PCI接口
2) 内存映射I/O
3) DMA

包的传输, 传输队列的初始化与更新, 注意队列满的处理.

Output 环境, 注意等待的情况.

Part B: 包的接收和Web服务器

包的接受, 类似于传输的情况, 接收队列的相关处理.

Input 环境, 注意IPC 页面的竞争.

Web 服务器, 一个简单网页服务器的实现.

Challenges:
1) Reading MAC address from EEPROM, from the 8254x manual, there are two ways to
         do this:EEPROM Read register and control register. But the first doesn't work,
         the second implementation is stolen from linux-2.6.38.
2) Zero-copy, messy, see [7].
3) For other challenges, I'm sorry, either you're hard or I'm busy.

Summary: I'm forced to read the Intel's 8054x manual, and now feel comfortable
about hardware's details. Networking seems to be interesting.


============================
Lab 7: Final Project
1) 共享库代码, PTE_COW.
2) 键盘接口, see [3].
3) Shell, see also [2].

Challenges:
1) The project, 我还不够优秀来完成这个Project, 没有足够好的想法。

Summary: This lab except the challenge requires the least code. If I have much
time, I would want to see the bash source.


============================
Needed requirements:
1) OS concept
2) 80x86 concept
2) Assembly(gas)/C/Inline_assembler
3) GDB
4) ELF, part.

Optional requirements:
1) Hardware programming, keyboard/console/disk/network etc.
2) Make, Makefile of course.
3) Shell, grade test for lab1-5.
4) Python, grade test for lab6-7. ## It took me 1/2 day to understand, and
        another 1/2 to deal with the qemu-bug.
5) Qemu, well I didn't dive into it.
6) Gcc/ld, ditto.

Tools I use:
## well, it could be better to write another article.

1) Vim. You need a damn great editor and make full use of it. For vim, I heavily
        use cscope, marks, tabs, global/substitute, etc, and good plugins[8].
2) Tmux. A terminal multiplexer (Before I used GNU screen).

-----------------------------------
Xv6
阅读[2]给的书和源码.

-----------------------------------
Linux-0.11
阅读[3]和linux-0.11源码.



引用: ## JOS主页本身含有大量好资源.
[1] MIT-JOS, http://pdos.csail.mit.edu/6.828/2011/overview.html
[2] Xv6, http://pdos.csail.mit.edu/6.828/2011/xv6.html
[3] Linux0.11, <<linux内核完全剖析>> 赵炯
[4] Booting, http://www.ruanyifeng.com/blog/2013/02/booting.html
[5] Sysenter, http://articles.manugarg.com/systemcallinlinux2_6.html
[6] Power-series, https://github.com/pdonis/powerseries
[7] Zero-copy, http://www.ibm.com/developerworks/library/j-zerocopy/index.html
[8] Vim-good-plugins, https://github.com/carlhuda/janus

Load

Let's dive into load average.

Sample output of `uptime':
 18:38:47 up 28 days,  9:15,  9 users,  load average: 2.63, 2.67, 2.50
当前时间, 系统运行时间, 目前登录用户数, 过去1/5/15 min的平均负载

什么是load average?  
来自于uptime(1)
System load averages is the average number of processes that are either
        in a runnable or uninterruptable state.  A process in a runnable  state
        is  either using the CPU or waiting to use the CPU.  A process in unin‐
        terruptable state is waiting for some I/O access, eg waiting for  disk.
        The  averages  are  taken over the three time intervals.  Load averages
        are not normalized for the number of CPUs in a system, so a load  aver‐
        age  of 1 means a single CPU system is loaded all the time while on a 4
        CPU system it means it was idle 75% of the time.

Kernel magic:
source code in kernel/sched.c, include/linux/sched.h

/* calc_global_load function */
3252   avenrun[0] = calc_load(avenrun[0], EXP_1, active); /* 1min */
3253   avenrun[1] = calc_load(avenrun[1], EXP_5, active); /* 5min */
3254   avenrun[2] = calc_load(avenrun[2], EXP_15, active);/* 15min */

3256   calc_load_update += LOAD_FREQ;

而calc_load 的定义:

3046 static unsigned long
3047 calc_load(unsigned long load, unsigned long exp, unsigned long active)
3048 {
3049   load *= exp;;
3050   load += active * (FIXED_1 - exp);
3051   load += 1UL << (FSHIFT - 1);
3052   return load >> FSHIFT;
3053 }

故计算公式:
load(t1) = (load(t0)*exp + active*(FIXED_1-exp)) >> FSHIFT

exp 有EXP_1, EXP_5, EXP_15, 对应1min, 5min, 15min.
但EXP_*, FSHIFT, FIXED_1 又是什么?

/*
* These are the constant used to fake the fixed-point load-average
* counting. Some notes:
*  - 11 bit fractions expand to 22 bits by the multiplies: this gives
*    a load-average precision of 10 bits integer + 11 bits fractional
*  - if you want to count load-averages more often, you need more
*    precision, or rounding will get you. With 2-second counting freq,
*    the EXP_n values would be 1981, 2034 and 2043 if still using only
*    11 bit fractions.
*/
125 #define FSHIFT    11    /* nr of bits of precision */
126 #define FIXED_1   (1<<FSHIFT) /* 1.0 as fixed-point */
127 #define LOAD_FREQ (5*HZ+1)  /* 5 sec intervals */
128 #define EXP_1   1884    /* 1/exp(5sec/1min) as fixed-point */
129 #define EXP_5   2014    /* 1/exp(5sec/5min) */
130 #define EXP_15    2037    /* 1/exp(5sec/15min) */

由此可知真正的计算公式:
load(t1) = load(t0) * exp_m + active * (1 - exp_m)
其中exp_m 对应于EXP_M, 是小于1的常量, 称之为负载因子, 作为前次负载的比重。

EPX_* 是如何计算的?

代码实现的exp_m = exp(-load_freq/m), load_freq 为更新频率
由于每隔5s进行负载计算的更新, 过去 m min的负载因子
exp_m = exp(-5s/60ms), 所以
EXP_M = exp_m << FSHIFT
                        = 2**11 * exp(-5/60m)
                        = 2**(11 - 5*log2(e)/60m)

EXP_1 = 2**(11 - 5*log2(e)/60) = 1884
同理EXP_5 = 2014, EXP_15 = 2037

reference:
http://en.wikipedia.org/wiki/Load_(computing)
http://www.teamquest.com/pdfs/whitepaper/ldavg1.pdf