2
Tutorial: Deep Dive into Apache Cassandra: Theory, Design, and Application Lewis Tseng, Haochen Pan, Yingjian Wu Boston College Boston, USA {lewis.tseng, haochen.pan, wuit}@bc.edu TUTORIAL ABSTRACT Distributed storage systems are fundamental to many Internet-scale applications. To achieve high perfor- mance, modern storage systems often choose to provide only weaker guarantees (e.g., weak consistency and eventual correctness). Moreover, to deal with semi- or unstructured and unpredictable data, these systems mainly support a flexible data scheme and simple in- terface. As a result, it requires a new way of thinking about storage and data modeling. This tutorial consists of a series of hands-on and interactive exercises for beginners that have minimum or even no experience in distributed storage systems. We will first give an overview of distributed storages and new storage paradigms like NoSQL/NewSQL. Then we discuss their key properties (e.g., consistency, availabil- ity, fault-tolerance, etc.) and an important trade-off (i.e., CAP theorem). Finally, we will walk through exercises of installing and configuring Cassandra. If time permits, we will build a simple application on top of Cassandra. TUTORIAL OUTLINE The tutorial will follow the outline below: Brief introduction to distributed storage systems Brief introduction to NoSQL/NewSQL and their properties Brief introduction to CAP theorem Exercises: install, configure, and develop applica- tions on top of Cassandra AUDIENCE The tutorial will be accessible to anyone with a background of basic knowledge on algorithms and pro- gramming. We do not assume any background in cloud computing or distributed systems. Some knowledge of Linux would help. We welcome everyone to attend, but the tutorial is mainly designed for people who are interested in large- scale distributed storage systems and data-intensive stor- ages that require ultra-high availability and low latency. LEARNING GOALS After the tutorial, you are expected to learn: What is NoSQL/NewSQL? What is Cassandra? What are the trade-offs of large-scale distributed storage systems? How do I install, configure, and develop applica- tions on top of Cassandra? BIO Lewis Tseng is currently an assistant professor in Computer Science department at Boston College. Before that, he spent a year and a half as a researcher at Toyota InfoTechnology Center. He received a B.S. and a Ph.D. degree both in Computer Science from the University of Illinois at Urbana-Champaign (UIUC) in 2010 and 2016, respectively. His research broadly lies in the intersection of fault-tolerant computing and distributed computing. He won the best paper award in the International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS) 2017. Haochen (Roger) Pan is a junior at Boston College pursuing his dual degree in Computer Science and Mathematics. His research interests include distributed computing & systems, Blockchain-based systems, and vehicular ad-hoc networks. He received the Sophomore Scholar Award, the Advanced Study Grant, and the Undergraduate Research Fellowship from Boston College.

Tutorial: Deep Dive into Data-Intensive Distributed

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tutorial: Deep Dive into Data-Intensive Distributed

Tutorial: Deep Dive into Apache Cassandra:Theory, Design, and Application

Lewis Tseng, Haochen Pan, Yingjian WuBoston CollegeBoston, USA

{lewis.tseng, haochen.pan, wuit}@bc.edu

TUTORIAL ABSTRACT

Distributed storage systems are fundamental to manyInternet-scale applications. To achieve high perfor-mance, modern storage systems often choose to provideonly weaker guarantees (e.g., weak consistency andeventual correctness). Moreover, to deal with semi-or unstructured and unpredictable data, these systemsmainly support a flexible data scheme and simple in-terface. As a result, it requires a new way of thinkingabout storage and data modeling.

This tutorial consists of a series of hands-on andinteractive exercises for beginners that have minimum oreven no experience in distributed storage systems. Wewill first give an overview of distributed storages andnew storage paradigms like NoSQL/NewSQL. Then wediscuss their key properties (e.g., consistency, availabil-ity, fault-tolerance, etc.) and an important trade-off (i.e.,CAP theorem). Finally, we will walk through exercisesof installing and configuring Cassandra. If time permits,we will build a simple application on top of Cassandra.

TUTORIAL OUTLINE

The tutorial will follow the outline below:• Brief introduction to distributed storage systems• Brief introduction to NoSQL/NewSQL and their

properties• Brief introduction to CAP theorem• Exercises: install, configure, and develop applica-

tions on top of Cassandra

AUDIENCE

The tutorial will be accessible to anyone with abackground of basic knowledge on algorithms and pro-gramming. We do not assume any background in cloudcomputing or distributed systems. Some knowledge ofLinux would help.

We welcome everyone to attend, but the tutorial ismainly designed for people who are interested in large-scale distributed storage systems and data-intensive stor-ages that require ultra-high availability and low latency.

LEARNING GOALS

After the tutorial, you are expected to learn:• What is NoSQL/NewSQL? What is Cassandra?

• What are the trade-offs of large-scale distributedstorage systems?

• How do I install, configure, and develop applica-tions on top of Cassandra?

BIO

Lewis Tsengis currentlyan assistantprofessor inComputer Sciencedepartment atBoston College.Before that, hespent a yearand a half asa researcherat Toyota

InfoTechnology Center. He received a B.S. and aPh.D. degree both in Computer Science from theUniversity of Illinois at Urbana-Champaign (UIUC)in 2010 and 2016, respectively. His research broadlylies in the intersection of fault-tolerant computing anddistributed computing. He won the best paper award inthe International Symposium on Stabilization, Safety,and Security of Distributed Systems (SSS) 2017.

Haochen (Roger) Panis a junior at BostonCollege pursuinghis dual degree inComputer Scienceand Mathematics.His research interestsinclude distributedcomputing & systems,Blockchain-basedsystems, and vehicularad-hoc networks.He received theSophomore Scholar

Award, the Advanced Study Grant, and theUndergraduate Research Fellowship from BostonCollege.

Page 2: Tutorial: Deep Dive into Data-Intensive Distributed

Yingjian (Steven) Wuis a senior at BostonCollege pursuinghis dual degree inComputer Scienceand Mathematics.His research interestsinclude distributedcomputing & systems,Blockchain-basedsystems, and Big

Data infrastructure internal design. He received IEEEComsoc Student Travel Grant for Globecom 2019. Hereceived the Advanced Study Grant for thesis researchand Undergraduate Research Fellowship from BostonCollege.