26
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Building File System Semantics for an Exabyte Scale Object Storage System Shane Mainali Raji Easwaran Microsoft

Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1

Building File System Semantics for an Exabyte Scale Object Storage System

Shane MainaliRaji EaswaranMicrosoft

Page 2: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2

Agenda

Analytics Workloads (Access patterns & challenges) Azure Data Lake Storage overview Under the hood Q&A

Page 3: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

Analytics WorkloadsAccess Patterns and Challenges

Page 4: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4

Analytics Workload Pattern

Sensors and IoT(unstructured)

Files (unstructured)

Media (unstructured)

Logs (unstructured)

Business/custom apps(structured) Power BI

Azure Analysis Services

Real-time Apps

Cosmos DB

INGEST PREP & TRAIN MODEL & SERVE

STOREAzure Data Lake Storage Gen2

EXPLORE

SQL Data Warehouse Azure Databricks

Azure Data Explorer

Azure SQLData Warehouse

Azure DatabricksAzure Data Factory

Page 5: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5

Challenges

- Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks

- Client-side file system emulation impacts performance, semantics, and correctness

- Directory operations are expensive- Coarse grained Access Control- Throughput is critical for Big Data

Page 6: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6

Storage for Analytics - Goals

Address shortcomings of client-side design First-class hierarchical namespace Interoperability with Object Storage (Blobs) Object-level ACLs (POSIX) Platform for future filesystem-based

protocols (e.g. NFS)

Page 7: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7

Azure Data Lake StorageFile System Semantics on Object Storage

Page 8: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8

Hierarchical Namespace (HNS)

Page 9: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9

Azure Data Lake Storage Architecture

Object Tiering and Lifecycle Policy Management

AAD Integration, RBAC, Storage Account Security

HA/DR support through ZRS and RA-GRS

Common Blob Storage Foundation

Blob API ADLS Gen2 API

Server Backups, Archive Storage, Semi-structured

Data

Unstructured Object Data

Hadoop File System, File and Folder Hierarchy,

Granular ACLs, Atomic File Transactions

File Data

Page 10: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10

Blobs and Flat Namespace

Data

Blob API

Flat Namespace

/foo/bar/file.txt

Page 11: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11

Files and Folders in HNS

Data

ADLS API

Hierarchical Namespacefoo

bar

baz.txt

Blob API

Page 12: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12

Mapping the concepts Same Storage account URIs are same except endpoint

http://account.dfs.core.windows.net/container/videos/movie.mp4http://account.blob.core.windows.net/container/videos/movie.mp4

Filesystem == ContainerCreate File System and Create Container APIs do the same thingExactly the same metadata and objects under the covers

Directory ~= BlobDirectories are first class entities; both implicit and explicit creation supportedImplicit creation when blobs are createdACLs and Leases obeyed by both

File == BlobADLS Gen 2 adds Append and Flush semanticsExisting Blob semantics supported as isACLs and Leases obeyed by both

Account

File System

Directory

File

Account

Container

Blob

BlobADLS

Page 13: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13

API Interoperability

Can use Blob or ADLS Gen 2 API’s to access the same data

Existing Blob applications work without code changes and no data movement on the Data Lake account

Account

File System

Directory

File

Account

Container

Blob

BlobADLS

Page 14: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14

Under the HoodDesigning for Performance, Scale & Throughput

Page 15: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15

Blob Storage Architecture

Front EndStateless, front door for request handling (auth, request metering/throttling, validation)

Partition LayerServes data in key-value fashion based on partitions, enables batch transactions and strong consistency

Stream LayerStores multiple replicas of the data, deals with failures, bit rot, etc.

FE 2

Partition 3(F-J)

Stream 2

Page 16: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16

Blob Storage with HNS ArchitectureFront EndStateless, front door for request handling (auth, request metering/throttling, validation)

Hierarchical NamespaceServes metadata based on partitions, including file names, directory structure and ACLs.

Partition LayerServes data in key-value fashion based on partitions, enables batch transactions and strong consistency

Stream LayerStores multiple replicas of the data, deals with the media/devices, handles failures, bit rot, etc.

FE 2

Partition 3(G3-G4)

Stream 2

Page 17: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17

Hierarchical Namespace Topology

/

/path1/

/path2/file1

/path2/file2

/path2/

/path2/path3/

/path2/path3/file3

/

path2path1

path3

GUID1

GUID2 GUID3

GUID4

GUID1 -> GUID2GUID1 -> GUID3GUID3 -> GUID4

--------, GUID1 <=> “/”GUID1, GUID2 <=> “path1”GUID1, GUID3 <=> “path2”GUID3, GUID4 <=> “path3”

file1

file2

file3

Page 18: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18

Partition Layer Schema

# Parent ID Name CT Del File Metadata Child ID1 GUID-ROOT . 00001 N Y … GUID-BLOB1

2 GUID-ROOT path1 00100 N N … GUID-PATH1

3 GUID-ROOT path2 00200 N N … GUID-PATH2

4 GUID-PATH1 . 00100 N Y … GUID-BLOB2

6 GUID-PATH2 . 00200 N N … GUID-BLOB3

GUID-PATH2 file1 00300 N Y … GUID-BLOB4

7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4

8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5

GUID-PATH2 path3 00400 N N … GUID-PATH3

10 GUID-PATH3 . 00400 N Y … GUID-BLOB6

11 GUID-PATH3 file3 00401 N N … GUID-BLOB7

Account;FileSystem;GUID-ROOT path1 00100

Partition Key Row Key Columns

/

/path1/

/path2/file1

/path2/file2

/path2/

/path2/path3/

/path2/path3/file3

Page 19: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 19

Hierarchical Namespace Flow

FE 2

Partition 3(F-J)

Stream 2

Staging\Oscars\Movie.mp4

Create File

Parent Name Label<Null> Guid1 Staging

Guid1 Guid2 OscarsGuid2 Guid3 Movie.mp4

Hierarchical Namespace

3

Page 20: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20

Hierarchical Namespace Flow

FE 2

Partition 3(F-J)

Stream 2

Master\Oscars\Movie.mp4

Rename Directory

Parent Name Label<Null> Guid1 Staging

Guid1 Guid2 OscarsGuid2 Guid3 Movie.mp4

Hierarchical Namespace

3

Master

Page 21: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21

Scale Unit 1

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Scale Out & Load BalancingNamespace ProcessorsNP1: GUID 1 – 100NP2: GUID 101 to 200NP3: GUID 201 to 300NP4: GUID 301 to 400NP5: GUID 401 to 500NP6: GUID 501 to 1000

Namespace ProcessorsNP1: GUID 1001 – 1100NP2: GUID 1101 – 1200NP3: GUID 1201 – 1300NP4: GUID 1301 – 1400NP5: GUID 751 to 1000

NP6: GUID 501 to 750

A Scale Unit contains hundreds of nodes

Each node has many Namespace Processors (NP)

Each NP manages a portion of the namespace (GUID range for each <Account, FileSystem>)

Hot nodes are load balanced with other nodes in the Scale Unit by splitting managed GUID ranges among NPs

An Azure region contains several scale units

When a majority of nodes in a Scale unit become hot, load balancing occurs across Scale unitsScale Unit 2

Node

Node

Node

Node

Node

Node

Node

Node

NP6: GUID 501 to 750

NP5: GUID 751 to 1000

Page 22: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22

Transaction Processing and Caching

/path2/file1

/

path2

file1NP1 NP2

NP3

NP4 NP5

NP6

NP7 NP8

NP9

/

path2path1

path3

GUID1

GUID2 GUID3

GUID4

file1

file2

file3

Parent ID Name File Child IDGUID-ROOT path2 N GUID-PATH2

GUID-PATH2 file1 Y GUID-BLOB4

ID NameOwning

NP

GUID-ROOT / NP1

GUID-PATH2 path2 NP7

GUID-BLOB4 file1 NP6

Page 23: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23

High Throughput A single object/file

contains multiple blocks The block range is

partitioned uniformly across partitions

A single write can potentially be served by all partition nodes

Support 100s of Gbps of Ingress/Egress for a single account or to a single file

2 layers of caching to enable high throughput read performance

FE 2

Partition 3(F-J)

Stream 2

Movie.mp4 (50 GB)

Page 24: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24

Performance & Scale Implications Hierarchical Namespace is only in the path of namespace

traversal and metadata operations Data reads and writes don’t go through Hierarchical Namespace

Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead

Separation of Distributed Cache and Persistent State (Partition) layers is critical

Load Balancing is very efficient and fast Leverage Partition Layer; distinct partitioning for Blobs and HNS

While Distributed Transactions are more expensive, they are less frequent

Page 25: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25

Opportunities

- Snapshots at any level of the hierarchy- Time travel operations with E2E built-in transaction

timestamps- Support a wide variety of File Systems

- Interop across all- Zero data copying

- In-Place upgrade from Flat -> Hierarchical Namespace- Cross-Entity Strongly Consistent Reads- High-Fidelity On-Prem->Cloud Migration/Hybrid

Page 26: Building File System Semantics for an Exabyte Scale Object ... › sites › default › files › SDC › 2019 › presentation… · Challenges-Containers are mounted as filesystems

2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26

Q & A