82
Introduction to Unicode & i18n in Rust Behnam Esfahbod Software Engineer

Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Introduction to Unicode & i18n in Rust

Behnam Esfahbod Software Engineer

Page 2: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

The Rust Programming Language has native support for Unicode Characters’ Unicode Scalar Values, to be exact. The language provides fast and compact string type with low-level control over memory consumption, while providing a high-level API and enforcing memory and data safety at compile time. The Rust Standard Library covers the basic Unicode functionalities, and third-party libraries – called Crates – are responsible for the rest. UNIC’s Unicode and Internationalization Crates for Rust is a project to develop a collection of crates for Unicode and internationalization data and algorithm, and tools to build them, designed to have reusable modules and easy-to-use and efficient API.

In this talk we will cover the basics of Rust's API for characters and strings, and look under the hood of how they are implemented in the compiler and the standard library. Afterwards, we look at UNIC's design model, how it implements various features, and lessons learned from building sharable organic micro components.

The talk is suitable for anyone new to Unicode, or Unicode experts who like to learn about how things are done in the Rust world.

Abstract

42nd

Internationalization &

Unicode Conference

September 2018

Santa Clara, CA, USA

!2

Page 3: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

The Rust Programming Language has native support for Unicode Characters’ Unicode Scalar Values, to be exact. The language provides fast and compact string type with low-level control over memory consumption, while providing a high-level API and enforcing memory and data safety at compile time. The Rust Standard Library covers the basic Unicode functionalities, and third-party libraries – called Crates – are responsible for the rest. UNIC’s Unicode and Internationalization Crates for Rust is a project to develop a collection of crates for Unicode and internationalization data and algorithm, and tools to build them, designed to have reusable modules and easy-to-use and efficient API.

In this talk we will cover the basics of Rust's API for characters and strings, and look under the hood of how they are implemented in the compiler and the standard library. Afterwards, we look at UNIC's design model, how it implements various features, and lessons learned from building sharable organic micro components.

The talk is suitable for anyone new to Unicode, or Unicode experts who like to learn about how things are done in the Rust world.

Abstract

42nd

Internationalization &

Unicode Conference

September 2018

Santa Clara, CA, USA

!3

Page 4: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Fluent 1.0 — Next Generation Localization System from Mozilla by Zibi Braniecki

Localization systems have been largely stagnant over the last 20 years. The last major innovation - ICU MessageFormat - has been designed before Unicode 3.0, targeting C++ and Java environments. Several attempts have been made since then to fit the API into modern programming environments with mixed results.

Fluent is a modern localization system designed over last 7 years by Mozilla. It builds on top of MessageFormat, ICU and CLDR, bringing integration with modern ICU features, bidirectionality, user friendly file format and bindings into modern programming environments like JavaScript, DOM, React, Rust, Python and others. The system comes with a full localization workflow cycle, command line tools and a CAT tool.

With the release of 1.0 we are ready to offer the new system to the wider community and propose it for standardization.

Looking for L10n in Rust?

Happening NOWon Track 3!

!4

Page 5: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

• Software Engineer @ Quora, Inc.

• Co-Chair of Arabic Layout Task Force @ W3C i18n Activity

• Virgule Typeworks

• Facebook, Inc.

• IRNIC Domain Registry

• Sharif FarsiWeb, Inc.

About me

!5

Page 6: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

• Quick Intro to Rust

• Characters & Strings

• It Gets Complicated!

• On Top of the Language

This talk

!6

Page 7: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Quick Intro to Rust

Page 8: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

XKCD | GOTO [CC BY-NC 2.5]!8

Page 9: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

History

!9

• 2006: The project started out of a personal project of Graydon Hoare

- OCaml compiler

• 2009: Mozilla began sponsoring

• 2011: Self-hosting compiler, using LLVM as backend

Page 10: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

History

!10

• 2006: The project started out of a personal project of Graydon Hoare

- OCaml compiler

• 2009: Mozilla began sponsoring

• 2011: Self-hosting compiler, using LLVM as backend

• Pre-2015: Many design changes

- Drop garbage collection

- Move memory allocation out of the compiler

• 2015: Rust 1.0, the first stable release

• 2018: First major new edition, Rust 2018

Page 11: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Build System & Tooling

!11

• Cargo

- Package manager

- Resolve dependencies

- Compile

- Build package and upload to crates.io

Page 12: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!12

• Cargo

- Package manager

- Resolve dependencies

- Compile

- Build package and upload to crates.io

• Common tooling

- Rustup

- Rustfmt

- Clippy

- Bindgen

Build System & Tooling

Page 13: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Systems Language

!13

• Abstraction without overhead (ZCA)

- & without hidden costs

Page 14: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Systems Language

!14

• Abstraction without overhead (ZCA)

- & without hidden costs

• Compile to machine code

- & runs on microprocessors (no OS/malloc)

Page 15: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Systems Language

!15

• Abstraction without overhead (ZCA)

- & without hidden costs

• Compile to machine code

- & runs on microprocessors (no OS/malloc)

• Full control of memory usage

- Even where there’s no memory allocation

Page 16: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Systems Language

!16

• Abstraction without overhead (ZCA)

- & without hidden costs

• Compile to machine code

- & runs on microprocessors (no OS/malloc)

• Full control of memory usage

- Even where there’s no memory allocation

• Compiles to Web Assembly

- & runs in your favorite browser

Page 17: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Typed Language

!17

• Statically typed

- All types are known at compile-time

- Generics for data types and code blocks

Page 18: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Typed Language

!18

• Statically typed

- All types are known at compile-time

- Generics for data types and code blocks

• Strongly typed

- Harder to write incorrect programs

- No runtime null-pointer failures

Page 19: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Syntax

!19

Similar to C/C++ &

Java

Page 20: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Type System

!20

• Algebraic types

- First Systems PL

- Tuples, structs, enums, & unions

- Pattern matching (match) for selection and destructure

• Some basic types

- Option enum type: Some value, or None

- Result enum type: Ok value, or Err

Page 21: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Type System

!21

Option (example)

Page 22: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Type System

!22

Result (definition)

Page 23: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Type System

!23

Result (example)

Page 24: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Memory Management & Safety

!24

• No garbage collection

- Strict memory management

Page 25: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Memory Management & Safety

!25

• No garbage collection

- Strict memory management

• Ownership

- Memory parts are owned by exactly one variable

- Destruct memory when variable goes out of scope

Page 26: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Memory Management & Safety

!26

• No garbage collection

- Strict memory management

• Ownership

- Memory parts are owned by exactly one variable

- Destruct memory when variable goes out of scope

• Borrow checker

- Data-race free

- Similar to type checker

- Either read-only pointers or one read-write pointer

Page 27: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Memory Management & Safety

!27

• No garbage collection

- Strict memory management

• Ownership

- Memory parts are owned by exactly one variable

- Destruct memory when variable goes out of scope

• Borrow checker

- Data-race free

- Similar to type checker

- Either read-only pointers or one read-write pointer

• Lifetimes

- ≈ Position in the stack that owns the heap allocation

Page 28: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!28

• Traits

- Define behavior (can’t own data)

- Inheritance

- Deref

Interfaces & Impl.s

Page 29: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!29

• Traits

- Define behavior (can’t own data)

- Inheritance

- Deref

• Impl blocks

- Implement types and traits (can’t own data)

- Composition

Interfaces & Impl.s

Page 30: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!30

• Traits

- Define behavior (can’t own data)

- Inheritance

- Deref

• Impl blocks

- Implement types and traits (can’t own data)

- Composition

• Code blocks

- Functions, methods and closures

Interfaces & Impl.s

Page 31: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!31

• Traits

- Define behavior (can’t own data)

- Inheritance

- Deref

• Impl blocks

- Implement types and traits (can’t own data)

- Composition

• Code blocks

- Functions, methods and closures

• Macros

- assert!(), format!(), print!(), println!()

Interfaces & Impl.s

Page 32: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Characters & Strings

Page 33: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Numeric Types

!33

• Signed & unsigned integer types

Page 34: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Numeric Types

!34

• Signed & unsigned integer types

• Floating-point types

- f32, f64

Page 37: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Character Type

!37

Unicode scalar

values

• As defined by The Unicode Standard

- “Any Unicode code point, except high-surrogate and low-surrogate

code points.”

- U+0000 to U+D7FF (inclusive)

- U+E000 to U+10FFFF (inclusive)

- Total of 1,112,064 code points

Page 38: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Character Type

!38

Limited integer

type

• No numerical operations on the char type

- What would the result of `U+D7FF + 1`?

Page 39: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Character Type

!39

Algebraic types in

action

• Compiler knows that all values of the 4 bytes are not used!

Page 40: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!40

Page 41: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Pointer Types

!41

• Narrow pointers

- Point to Sized types (size is known at compile-time)

- Single usize value

Page 42: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Pointer Types

!42

• Narrow pointers

- Point to Sized types (size is known at compile-time)

- Single usize value

• Fat Pointers

- Point to something with unknown size (at compile-time)

- Single usize value, plus more data

Page 43: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Arrays,Slices & Vectors

!43

• Arrays

- Sized sequence of elements

- [T; size] - Unsized sequence of elements

- [T]

Page 44: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Arrays,Slices & Vectors

!44

• Arrays

- Sized sequence of elements

- [T; size] - Unsized sequence of elements

- [T]

• Slice

- A view into a sequence of elements

- &[T]

- On arrays, vectors, …

Page 45: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Arrays,Slices & Vectors

!45

• Arrays

- Sized sequence of elements

- [T; size] - Unsized sequence of elements

- [T]

• Slice

- A view into a sequence of elements

- &[T]

- On arrays, vectors, …

• Vector

- A dynamic-length sequence of

elements

- Sit in the heap

Page 46: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

String Types

!46

• str

- A special [u8]

- Always a valid UTF-8 sequence

Page 47: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

String Types

!47

• str

- A special [u8]

- Always a valid UTF-8 sequence

• &str

- A special &[u8]

Page 48: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

String Types

!48

• str

- A special [u8]

- Always a valid UTF-8 sequence

• &str

- A special &[u8]

• String

- A dynamic UTF-8 sequence

- Return type of str functions that

cannot guarantee preserving bytes

length

Page 49: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

String Operations

!49

ASCII-only

• Apply to both &[u8] and &str

Page 50: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

String Operations

!50

Non-ASCII Unicode

• Apply only to &str

Page 51: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Iterating Strings

!51

• Iterating over characters of a string

Page 52: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

!52

Page 53: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

It Gets Complicated!

Page 54: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-Platform Encoding Challenges

• OS & environment variables

- File Names

- Environment variables

- Command-line parameters

• Different per system

- Unix: bytes; commonly UTF-8 these days

- Windows: UTF-16, but not always well-formed

!54

Page 55: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-Platform Encoding Challenges

• OS & environment variables

- File Names

- Environment variables

- Command-line parameters

• Different per system

- Unix: bytes; commonly UTF-8 these days

- Windows: UTF-16, but not always well-formed

• OsStr and OsString

- Internal data depends on OS

- &OsStr is to OsString as &str is to String

!55

Page 56: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-PlatformData Types

• OsStr and OsString

- Internal data depends on OS

- &OsStr is to OsString as &str is to String

!56

Page 57: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-PlatformData Types

• OsStr and OsString

- Internal data depends on OS

- &OsStr is to OsString as &str is to String

• Trait std::ffi::OsStr

- pub fn to_str(&self) -> Option<&str>

!57

Page 58: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-PlatformData Types

• OsStr and OsString

- Internal data depends on OS

- &OsStr is to OsString as &str is to String

• Trait std::ffi::OsStr

- pub fn to_str(&self) -> Option<&str>

• Trait std::os::unix::ffi::OsStrExt

- fn from_bytes(slice: &[u8]) -> &Self

- fn as_bytes(&self) -> &[u8]

!58

Page 59: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Cross-PlatformData Types

• OsStr and OsString

- Internal data depends on OS

- &OsStr is to OsString as &str is to String

• Trait std::ffi::OsStr

- pub fn to_str(&self) -> Option<&str>

• Trait std::os::unix::ffi::OsStrExt

- fn from_bytes(slice: &[u8]) -> &Self

- fn as_bytes(&self) -> &[u8]

• Trait std::os::windows::ffi::OsStrExt

- fn encode_wide(&self) -> EncodeWide

!59

Page 60: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Working with C APIs

• CStr and CString

- A borrowed reference to a nul-terminated array of bytes

- CStr is to CString as &str is to String

!60

Page 61: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Working with C APIs

• CStr and CString

- A borrowed reference to a nul-terminated array of bytes

- CStr is to CString as &str is to String

• Trait std::ffi::CStr

- pub unsafe fn from_ptr<'a>(ptr: *const c_char) -> &'a CStr

- pub fn to_str(&self) -> Result<&str, Utf8Error>

!61

Page 62: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

On Top of the Language

Page 63: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Unicode & i18n Crates

!63

• Encoding/Charsets

- Firefox is already using a Rust component for that!

• Rust Project

- String algorithms needed for a compiler

• Servo Project

- Basic string algorithms needed for a rendering engine

• Locale-aware API

- Actually not much available yet

- WIP by Mozilla, et al.

Page 64: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Rust Compiler & Libraries

!64

UNIC Experiment

UNIC: Unicode and

i18n Crates for Rust

Core / core Standard Library / std

UNIC

Character Utilities / unic::char::*

Unicode Character Databaseunic::ucd::[ bidi, name, age, normal, ident, segment, … ]

Unicode Bidirectional

Algorithm unic::bidi

Unicode Normalization

Algorithm unic::bidi

Unicode Segmentation

Algorithm unic::segment

Unicode IDNA

unic::idna

Unicode Emoji

unic::emoji…

Page 65: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

What’s this?

!65

Hello سالم

Page 66: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

How about this?

!66

Hello سالم

Page 67: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

A Case of Missing Bidi Context

How about Locale

Context?

!67

Hello سالمHello سالم

Page 68: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Programming Languages

!68

• Machine language

Page 69: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Programming Languages

!69

• Machine language

• Procedural

• GOTO

Page 70: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Programming Languages

!70

• Machine language

• Procedural

• GOTO

• Functional

Page 71: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Programming Languages

!71

• Machine language

• Procedural

• GOTO

• Functional

• Garbage collection

Page 72: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Programming Languages

!72

• Machine language

• Procedural

• GOTO

• Functional

• Garbage collection

• Strict memory management

Page 73: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!73

• Byte == Char

Page 74: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!74

• Byte == Char

• Contextual Charset

Page 75: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!75

• Byte == Char

• Contextual Charset

• Separation of text encoding & font encoding

Page 76: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!76

• Byte == Char

• Contextual Charset

• Separation of text encoding & font encoding

• Unified encoding

Page 77: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!77

• Byte == Char

• Contextual Charset

• Separation of text encoding & font encoding

• Unified encoding

• Contextual Local

Page 78: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Summary:Unicode & i18n

!78

• Byte == Char

• Contextual Charset

• Separation of text encoding & font encoding

• Unified encoding

• Contextual Local

• ???

Page 79: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

HOPE

Page 80: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Additional Resources

!80

• Rust Community

- rust-lang.org

- doc.rust-lang.org

- play.rust-lang.org

- users.rust-lang.org

- reddit.com/r/rust/

- rustup.rs

- crates.io

- unicode-rs.github.io

- newrustacean.com

• Servo, the Parallel Browser Engine Project

- servo.org

• UNIC: Unicode and Internationalization Crates for Rust

- https://github.com/open-i18n/rust-unic

Page 81: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment

Questions?

質問? ׁשְאֵלֹות?

प्रश्न?

질문?

سؤال؟پرسش؟

Page 82: Unicode Conference 44 - Behnam Esfahbod Software Engineer...Unicode Bidirectional Algorithm unic::bidi Unicode Normalization Algorithm unic::bidi Unicode Segmentation Algorithm unic::segment