课程网址：https://pdos.csail.mit.edu/6.824/schedule.html
以下内容来自课程课件，由于课件纯文本显示效果不佳，故对其排版整理并添加了部分个人笔记批注。

Today’s topic: storage + gfs + consistency

Why hard?

high performance -> shared data across servers
many servers -> constant faults
fault tolerance -> replication
replication -> potential inconsistencies
storge consistencies -> lower performance

Ideal consistency: behave as if single system
有两种风险: concurrency 和 fault

Why storage system is important?

建造fault tolerant存储系统后，应用程序可以是无状态的，存储系统存储持久化状态。

GFS context

Many Google services needed a big fast unified storage system
Mapreduce, crawler, indexer, log storage/analysis
Shared among multiple applications e.g. crawl, index, analyze
Huge capacity
Huge performance
Fault tolerant
But:
- just for internal Google use
- aimed at batch big-data performance, not interactive

Google使用过GFS，后面提出的Colossus、以及其他类型的基于集群的文件系统比如MapReduce使用的HDFS的灵感都来源于GFS
在GFS之前的设计都不规范，这种不规范性来源于两方面：

只有1个master，master不可复制——构建出的容错文件系统却包含单点故障

inconsistencies
即使今天，在容错、复制、性能和一致性之间都存在着斗争

GFS overview

100s/1000s of clients (e.g. MapReduce worker machines)
100s of chunkservers, each with its own disk
one coordinator

Capacity story?

big files split into 64 MB chunks
each file’s chunks striped/sharded over chunkservers
so a file can be much larger than any one disk
each chunk in a Linux file

stripe: n.条纹，线条；（军装上表示等级的）臂章，军阶条；类型，特点; v.给……加条纹
sharded: adj.分片的

Throughput story?

clients talk directly to chunkservers to read/write data
if lots of clients access different chunks, huge parallel throughput read or write

Fault tolerance story?

each 64 MB chunk stored (replicated) on three chunkservers
client writes are sent to all of a chunk’s copies
a read just needs to consult one copy

GFS并非完全自动容错，但在提高容错能力方面做的相当好

What are the steps when client C wants to read a file?

C sends filename and offset to coordinator (CO) (if not cached)
CO has a filename -> array-of-chunkhandle table
and a chunkhandle -> list-of-chunkservers table
CO finds chunk handle for that offset
CO replies with chunkhandle + list of chunkservers
C caches handle + chunkserver list
C sends request to nearest chunkserver
chunk handle, offset
chunk server reads from chunk file on disk, returns to client

第4步中client缓存的目的：减少延迟 + 减少与master的通信，提高CO的throughput：缓存后当需要想通块时不需要向CO询问

Clients only ask coordinator where to find a file’s chunks

clients cache name -> chunkhandle info
coordinator does not handle data, so (hopefully) not heavily loaded

master的状态维护：

file name -> array of chunk handles
放在stable storage中：如果master crash，重启后从日志中重建

chunk handle -> version# && list of chunk servers (primary, ?secondaries?) 每个主服务器有一个释放时间
除了version外可放进内存中
version需要放进stable storage中，它需要知道chunk server的块是否是旧的。为什么不能是master向所有chunk server去要信息，然后找到最大版本号的那个作为最新块：有可能崩溃丢失的块才是最新块

log + checkpoints
每当名称空间发生更改时，需要把log写入stable storage
在回应客户端之前，改变使master首先写入稳定存储器，然后对客户端回应
master的checkpoints也需要写入stable storage

计划：storage gfs consistency

Ideal consistency: behave as if single system
有两种风险: concurrency 和 fault

GFS: performance -> replication + FT -> consistency
GFS依然在2方面不规范：1、只有一个master，无法复制（包含单点故障）；2、in-consistencies，不具备强一致性，容错、复制、性能、一致性之间的斗争至今仍存在

GFS 的关键属性：

Big: large data set，比如MapReduce的数据集可能是整个万维网
Fast: the way is to do automatic sharding
Global: all apps see the same fs
Fault tolerant: 并不会完全自动容错，但在提高容错能力方面做的相当好

master的状态维护：

file name -> array of chunk handles
放在stable storage中：如果master crash，重启后从日志中重建
chunk handle -> version# && list of chunk servers (primary, ?secondaries?) 每个主服务器有一个释放时间
除了version外可放进内存中
version需要放进stable storage中，它需要知道chunk server的块是否是旧的。为什么不能是master向所有chunk server去要信息，然后找到最大版本号的那个作为最新块：有可能崩溃丢失的块才是最新块
log + checkpoints
每当名称空间发生更改时，需要把log写入stable storage
在回应客户端之前，改变使master首先写入稳定存储器，然后对客户端回应
master的checkpoints也需要写入stable storage

Write a file: append
filename -> chunk handles
handles->servers:

increase #ver
master和chunk servers都把版本号放在stabel storage上
primiary + secondary
lease

append时，检查#ver是否有效 + lease是否有效
如果primary向secondary B写入失败时，会重写，重写时使用不同的offset

GFS只在google内部使用，所有节点都（认为）是可信的

版本号由coordinator维护，并且仅在选取新的coordinator时才会增加
序列号和版本号不同

当考虑consistency时，需要考虑所有可能的失败，并讨论这些失败是否会导致不一致

如果master向primary发送心跳包得不到回应，也只能等到lease到期时才能选取新的primary，否则可能会有2个primary

split brain syndrome 脑裂：产生两个primary
解决的方法就是使用lease，只有当lease expired时才选取新的primary

获得更强的一致性：Update all P + s or none
GFS是针对MapReduce定制的？