课程网址:https://pdos.csail.mit.edu/6.824/schedule.html
以下内容来自课程课件,由于课件纯文本显示效果不佳,故对其排版整理并添加了部分个人笔记批注。

Today’s topic: storage + gfs + consistency

Why hard?

high performance -> shared data across servers
many servers -> constant faults
fault tolerance -> replication
replication -> potential inconsistencies
storge consistencies -> lower performance

Ideal consistency: behave as if single system
有两种风险: concurrency 和 fault

Why storage system is important?

建造fault tolerant存储系统后,应用程序可以是无状态的,存储系统存储持久化状态。

GFS context

  • Many Google services needed a big fast unified storage system
    Mapreduce, crawler, indexer, log storage/analysis
  • Shared among multiple applications e.g. crawl, index, analyze
  • Huge capacity
  • Huge performance
  • Fault tolerant
  • But:
    • just for internal Google use
    • aimed at batch big-data performance, not interactive

Google使用过GFS,后面提出的Colossus、以及其他类型的基于集群的文件系统比如MapReduce使用的HDFS的灵感都来源于GFS
在GFS之前的设计都不规范,这种不规范性来源于两方面:

  1. 只有1个master,master不可复制——构建出的容错文件系统却包含单点故障
  2. inconsistencies
    即使今天,在容错、复制、性能和一致性之间都存在着斗争

GFS overview

  • 100s/1000s of clients (e.g. MapReduce worker machines)
  • 100s of chunkservers, each with its own disk
  • one coordinator

Capacity story?

  • big files split into 64 MB chunks
  • each file’s chunks striped/sharded over chunkservers
    so a file can be much larger than any one disk
  • each chunk in a Linux file

stripe: n.条纹,线条;(军装上表示等级的)臂章,军阶条;类型,特点; v.给……加条纹
sharded: adj.分片的

Throughput story?

  • clients talk directly to chunkservers to read/write data
  • if lots of clients access different chunks, huge parallel throughput read or write

Fault tolerance story?

  • each 64 MB chunk stored (replicated) on three chunkservers
  • client writes are sent to all of a chunk’s copies
  • a read just needs to consult one copy

GFS并非完全自动容错,但在提高容错能力方面做的相当好

What are the steps when client C wants to read a file?

image-20231010163441161

  1. C sends filename and offset to coordinator (CO) (if not cached)
    CO has a filename -> array-of-chunkhandle table
    and a chunkhandle -> list-of-chunkservers table
  2. CO finds chunk handle for that offset
  3. CO replies with chunkhandle + list of chunkservers
  4. C caches handle + chunkserver list
  5. C sends request to nearest chunkserver
    chunk handle, offset
  6. chunk server reads from chunk file on disk, returns to client

第4步中client缓存的目的:减少延迟 + 减少与master的通信,提高CO的throughput:缓存后当需要想通块时不需要向CO询问

Clients only ask coordinator where to find a file’s chunks

  • clients cache name -> chunkhandle info
  • coordinator does not handle data, so (hopefully) not heavily loaded

master的状态维护:

  • file name -> array of chunk handles
    放在stable storage中:如果master crash,重启后从日志中重建
  • chunk handle -> version# && list of chunk servers (primary, ?secondaries?) 每个主服务器有一个释放时间
    除了version外可放进内存中
    version需要放进stable storage中,它需要知道chunk server的块是否是旧的。为什么不能是master向所有chunk server去要信息,然后找到最大版本号的那个作为最新块:有可能崩溃丢失的块才是最新块
  • log + checkpoints
    每当名称空间发生更改时,需要把log写入stable storage
    在回应客户端之前,改变使master首先写入稳定存储器,然后对客户端回应
    master的checkpoints也需要写入stable storage

计划:storage gfs consistency

Ideal consistency: behave as if single system
有两种风险: concurrency 和 fault

GFS: performance -> replication + FT -> consistency
GFS依然在2方面不规范:1、只有一个master,无法复制(包含单点故障);2、in-consistencies,不具备强一致性,容错、复制、性能、一致性之间的斗争至今仍存在

GFS 的关键属性:

  • Big: large data set,比如MapReduce的数据集可能是整个万维网
  • Fast: the way is to do automatic sharding
  • Global: all apps see the same fs
  • Fault tolerant: 并不会完全自动容错,但在提高容错能力方面做的相当好

master的状态维护:

  • file name -> array of chunk handles
    放在stable storage中:如果master crash,重启后从日志中重建
  • chunk handle -> version# && list of chunk servers (primary, ?secondaries?) 每个主服务器有一个释放时间
    除了version外可放进内存中
    version需要放进stable storage中,它需要知道chunk server的块是否是旧的。为什么不能是master向所有chunk server去要信息,然后找到最大版本号的那个作为最新块:有可能崩溃丢失的块才是最新块
  • log + checkpoints
    每当名称空间发生更改时,需要把log写入stable storage
    在回应客户端之前,改变使master首先写入稳定存储器,然后对客户端回应
    master的checkpoints也需要写入stable storage

Write a file: append
filename -> chunk handles
handles->servers:

  • increase #ver
    master和chunk servers都把版本号放在stabel storage上
  • primiary + secondary
  • lease

append时,检查#ver是否有效 + lease是否有效
如果primary向secondary B写入失败时,会重写,重写时使用不同的offset

GFS只在google内部使用,所有节点都(认为)是可信的

版本号由coordinator维护,并且仅在选取新的coordinator时才会增加
序列号和版本号不同

当考虑consistency时,需要考虑所有可能的失败,并讨论这些失败是否会导致不一致

如果master向primary发送心跳包得不到回应,也只能等到lease到期时才能选取新的primary,否则可能会有2个primary

split brain syndrome 脑裂:产生两个primary
解决的方法就是使用lease,只有当lease expired时才选取新的primary

获得更强的一致性:Update all P + s or none
GFS是针对MapReduce定制的?