PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris DAntoniMargus
Veanes
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6 All features of general purpose language Features needed
replace, match, char
Slide 7
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has
the features required by the task it is simple to use enables to
automatically reason about what the programs do compiles into
efficient code 7
Slide 8
OUTLINE Automata, transducers, and programs BEK and string
sanitizers BEX and string encoders FAST and tree manipulating
programs Whats next? 8
Slide 9
AUTOMATA, TRANSDUCERS, AND PROGRAMS 9
Slide 10
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has
the features required by the task, it is simple to use enables to
automatically reason about what the programs do compiles into
efficient code 10
Slide 11
11 type alphabet = A | T | C | G let rec all_TG (l: base list)
: bool = match l with [ ] -> true | h : : t -> (h = T || h =
G) && (all_TG t ) let rec all_AC (l: base list) : bool =
match l with [ ] -> true | h : : t -> (h = A || h = C)
&& (all_TG t ) let rec map_base (l: base list) : base list
= match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) |
T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base
t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base
list) : base list = match l with [ ] -> [ ] | A : : t -> A :
: ( filter_AC t ) | T : : t -> filter_AC t | G : : t ->
filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet
Languages of strings Transformations from strings to strings q0q0 T
G q0q0 A C all_TGall_AC A/T map_base T/A G/CC/G A/A T/ G/C/C
filter_AC
Slide 12
FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo
Slide 13
FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz
bbbzz abaUNDEFINED a
Slide 14
BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability
for automata: Intersection, union, complement Decidable emptiness
Decidable equivalence Can be minimized 14
Slide 15
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let
m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base
T/A G/CC/G q0q0 A/AT/ G/C/C filter_AC q0q0 A/T/ A G/CC/
m_f_DNA
Slide 16
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o (
all_AC) 16 input in all_TG map_base output in all_AC map_base only
defined if output in ( all_AC)
Slide 17
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base
o ( all_AC)) 17 input in all_TG map_base output in all_AC Inputs
for which map_base does not output in all_AC
Slide 18
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base
o ( all_AC)) all_TG = 18 input in all_TG map_base output in
all_AC
Slide 19
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let
m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base
list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ?
19
Slide 20
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has
the features required by the task it is simple to use enables to
automatically reason about what the programs do compiles into
efficient code 20
Slide 21
OUTLINE Automata, transducers, and programs BEK and string
sanitizers BEX and string encoders FAST and tree manipulating
programs Whats next? 21
Slide 22
[USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar
BEK analysis of string sanitizers P. Saxena
FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string
transformation function. PLDI'12 submission presentations 30
im.png' img.png' Sanitized dataUntrusted data Dec 8, 2011
Slide 31
COMPARING SANITIZERS 31
Slide 32
32 ' ' single quote html entity
Slide 33
33 some untrusted input
Slide 34
34 Library A Name: Around for: Availability: HtmlEncode Years
Readily available to C# developers some untrusted input
Slide 35
35 Library A Name: Around for: Availability: Library B Name:
Around for: Availability: HtmlEncode Years Readily available to C#
developers HtmlEncode Years Readily available to C# developers some
untrusted input
Slide 36
36 Library A Name: Around for: Availability: Library B Name:
Around for: Availability: HtmlEncode Years Readily available to C#
developers HtmlEncode Years Readily available to C# developers ' '
' '
Slide 37 ': builder.Append(">"); goto Label_00D5; case
'&': builder.Append("&"); goto Label_00D5; } else {
builder.Append("""); } Label_00D5: startIndex = num + 1; if
(startIndex < length) { num = IndexOfHtmlEncodingChars(s,
startIndex); if (num != -1) { goto Label_002A; } builder.Append(s,
startIndex, length-startIndex); } return builder.ToString(); }.NET
WebUtility MS AntiXSS private static string HtmlEncode(string
input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) {
if (string.IsNullOrEmpty(input)) { return input; } if
(characterValues == null) { InitialiseSafeList(); } if
(useNamedEntities && namedEntities == null) {
InitialiseNamedEntityList(); } // Setup a new character array for
output. char[] inputAsArray = input.ToCharArray(); int outputLength
= 0; int inputLength = inputAsArray.Length; char[] encodedInput =
new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for
(int i = 0; i < inputLength; i++) { char currentCharacter =
inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[]
tweekedValue; // Check for invalid values if (currentCodePoint ==
0xFFFE || currentCodePoint == 0xFFFF) { throw new
InvalidUnicodeValueException(currentCodePoint); } else if
(char.IsHighSurrogate(currentCharacter)) { if (i + 1 ==
inputLength) { throw new
InvalidSurrogatePairException(currentCharacter, '\0'); } // Now
peak ahead and check if the following character is a low surrogate.
char nextCharacter = inputAsArray[i + 1]; char nextCodePoint =
inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) {
throw new InvalidSurrogatePairException(currentCharacter,
nextCharacter); } // Look-ahead was good, so skip. i++; //
Calculate the combined code point long combinedCodePoint = 0x10000
+ ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00);
char[] encodedCharacter =
SafeList.HashThenValueGenerator(combinedCodePoint);
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (char.IsLowSurrogate(currentCharacter)) { throw new
InvalidSurrogatePairException('\0', currentCharacter); } else if
(encoderTweak != null && encoderTweak(currentCharacter, out
tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) {
encodedInput[outputLength++] = tweekedValue[j]; } else if
(useNamedEntities && namedEntities[currentCodePoint] !=
null) { char[] encodedCharacter = namedEntities[currentCodePoint];
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (characterValues[currentCodePoint] != null) { // character needs
to be encoded char[] encodedCharacter =
characterValues[currentCodePoint]; encodedInput[outputLength++] =
'&'; for (int j = 0; j < encodedCharacter.Length; j++) {
encodedInput[outputLength++] = encodedCharacter[j]; }
encodedInput[outputLength++] = ';'; } else { // character does not
need encoding encodedInput[outputLength++] = currentCharacter; }
finally { SyncLock.ExitReadLock(); } return new
string(encodedInput, 0, outputLength); }">
37 public static string HtmlEncode(string s) { if (s == null)
return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num ==
-1) return s; StringBuilder builder=new StringBuilder(s.Length+5);
int length = s.Length; int startIndex = 0; Label_002A: if (num >
startIndex) { builder.Append(s, startIndex, num-startIndex); } char
ch = s[num]; if (ch > '>') { builder.Append("");
builder.Append(((int)
ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';');
} else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '':
builder.Append(">"); goto Label_00D5; case '&':
builder.Append("&"); goto Label_00D5; } else {
builder.Append("""); } Label_00D5: startIndex = num + 1; if
(startIndex < length) { num = IndexOfHtmlEncodingChars(s,
startIndex); if (num != -1) { goto Label_002A; } builder.Append(s,
startIndex, length-startIndex); } return builder.ToString(); }.NET
WebUtility MS AntiXSS private static string HtmlEncode(string
input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) {
if (string.IsNullOrEmpty(input)) { return input; } if
(characterValues == null) { InitialiseSafeList(); } if
(useNamedEntities && namedEntities == null) {
InitialiseNamedEntityList(); } // Setup a new character array for
output. char[] inputAsArray = input.ToCharArray(); int outputLength
= 0; int inputLength = inputAsArray.Length; char[] encodedInput =
new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for
(int i = 0; i < inputLength; i++) { char currentCharacter =
inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[]
tweekedValue; // Check for invalid values if (currentCodePoint ==
0xFFFE || currentCodePoint == 0xFFFF) { throw new
InvalidUnicodeValueException(currentCodePoint); } else if
(char.IsHighSurrogate(currentCharacter)) { if (i + 1 ==
inputLength) { throw new
InvalidSurrogatePairException(currentCharacter, '\0'); } // Now
peak ahead and check if the following character is a low surrogate.
char nextCharacter = inputAsArray[i + 1]; char nextCodePoint =
inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) {
throw new InvalidSurrogatePairException(currentCharacter,
nextCharacter); } // Look-ahead was good, so skip. i++; //
Calculate the combined code point long combinedCodePoint = 0x10000
+ ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00);
char[] encodedCharacter =
SafeList.HashThenValueGenerator(combinedCodePoint);
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (char.IsLowSurrogate(currentCharacter)) { throw new
InvalidSurrogatePairException('\0', currentCharacter); } else if
(encoderTweak != null && encoderTweak(currentCharacter, out
tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) {
encodedInput[outputLength++] = tweekedValue[j]; } else if
(useNamedEntities && namedEntities[currentCodePoint] !=
null) { char[] encodedCharacter = namedEntities[currentCodePoint];
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (characterValues[currentCodePoint] != null) { // character needs
to be encoded char[] encodedCharacter =
characterValues[currentCodePoint]; encodedInput[outputLength++] =
'&'; for (int j = 0; j < encodedCharacter.Length; j++) {
encodedInput[outputLength++] = encodedCharacter[j]; }
encodedInput[outputLength++] = ';'; } else { // character does not
need encoding encodedInput[outputLength++] = currentCharacter; }
finally { SyncLock.ExitReadLock(); } return new
string(encodedInput, 0, outputLength); }
Slide 38 ': builder.Append(">"); goto Label_00D5; case
'&': builder.Append("&"); goto Label_00D5; } else {
builder.Append("""); } Label_00D5: startIndex = num + 1; if
(startIndex < length) { num = IndexOfHtmlEncodingChars(s,
startIndex); if (num != -1) { goto Label_002A; } builder.Append(s,
startIndex, length-startIndex); } return builder.ToString(); }
38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not,
what is a differentiating input? Can it generate any known bad
outputs?">
private static string HtmlEncode(string input, bool
useNamedEntities, MethodSpecificEncoder encoderTweak) { if
(string.IsNullOrEmpty(input)) { return input; } if (characterValues
== null) { InitialiseSafeList(); } if (useNamedEntities &&
namedEntities == null) { InitialiseNamedEntityList(); } // Setup a
new character array for output. char[] inputAsArray =
input.ToCharArray(); int outputLength = 0; int inputLength =
inputAsArray.Length; char[] encodedInput = new char[inputLength *
10]; SyncLock.EnterReadLock(); try { for (int i = 0; i <
inputLength; i++) { char currentCharacter = inputAsArray[i]; int
currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check
for invalid values if (currentCodePoint == 0xFFFE ||
currentCodePoint == 0xFFFF) { throw new
InvalidUnicodeValueException(currentCodePoint); } else if
(char.IsHighSurrogate(currentCharacter)) { if (i + 1 ==
inputLength) { throw new
InvalidSurrogatePairException(currentCharacter, '\0'); } // Now
peak ahead and check if the following character is a low surrogate.
char nextCharacter = inputAsArray[i + 1]; char nextCodePoint =
inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) {
throw new InvalidSurrogatePairException(currentCharacter,
nextCharacter); } // Look-ahead was good, so skip. i++; //
Calculate the combined code point long combinedCodePoint = 0x10000
+ ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00);
char[] encodedCharacter =
SafeList.HashThenValueGenerator(combinedCodePoint);
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (char.IsLowSurrogate(currentCharacter)) { throw new
InvalidSurrogatePairException('\0', currentCharacter); } else if
(encoderTweak != null && encoderTweak(currentCharacter, out
tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) {
encodedInput[outputLength++] = tweekedValue[j]; } else if
(useNamedEntities && namedEntities[currentCodePoint] !=
null) { char[] encodedCharacter = namedEntities[currentCodePoint];
encodedInput[outputLength++] = '&'; for (int j = 0; j <
encodedCharacter.Length; j++) { encodedInput[outputLength++] =
encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else
if (characterValues[currentCodePoint] != null) { // character needs
to be encoded char[] encodedCharacter =
characterValues[currentCodePoint]; encodedInput[outputLength++] =
'&'; for (int j = 0; j < encodedCharacter.Length; j++) {
encodedInput[outputLength++] = encodedCharacter[j]; }
encodedInput[outputLength++] = ';'; } else { // character does not
need encoding encodedInput[outputLength++] = currentCharacter; }
finally { SyncLock.ExitReadLock(); } return new
string(encodedInput, 0, outputLength); } public static string
HtmlEncode(string s) { if (s == null) return null; int num =
IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s;
StringBuilder builder=new StringBuilder(s.Length+5); int length =
s.Length; int startIndex = 0; Label_002A: if (num > startIndex)
{ builder.Append(s, startIndex, num-startIndex); } char ch =
s[num]; if (ch > '>') { builder.Append("");
builder.Append(((int)
ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';');
} else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '':
builder.Append(">"); goto Label_00D5; case '&':
builder.Append("&"); goto Label_00D5; } else {
builder.Append("""); } Label_00D5: startIndex = num + 1; if
(startIndex < length) { num = IndexOfHtmlEncodingChars(s,
startIndex); if (num != -1) { goto Label_002A; } builder.Append(s,
startIndex, length-startIndex); } return builder.ToString(); }
38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not,
what is a differentiating input? Can it generate any known bad
outputs?
Slide 39
39 PHP Trunk Changes to html.c, 1999--2011
Slide 40
40 PHP Trunk Changes to html.c, 19992011 R7,841 April 1999 135
loc R309,482 March 2011 1693 loc
Slide 41
41 PHP Trunk Changes to html.c, 19992011 R32,564 September 2000
ENT_QUOTES introduced R7,841 April 1999 135 loc R309,482 March 2011
1693 loc
Slide 42
42 PHP Trunk Changes to html.c, 19992011 R32,564 September 2000
ENT_QUOTES introduced R242,949 September 2007 $double_encode=true
R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
Slide 43
43 PHP Trunk Changes to html.c, 19992011 Safe to apply twice?
Safe to combine with other sanitizers?
Slide 44
MOTIVATION 44 Writing string sanitizers correctly is difficult
There is no cheap way to identify problems with sanitizers
Correctness is a moving target What if we could say more about
sanitizer behavior?
Slide 45
CONTRIBUTIONS 45 B EK Frontend: a small language for string
manipulation; similar to how sanitizers are written today Backend:
a model based on symbolic finite transducers with algorithms for
analysis and code generation B EK Frontend: a small language for
string manipulation; similar to how sanitizers are written today
Backend: a model based on symbolic finite transducers with
algorithms for analysis and code generation
Slide 46
CONTRIBUTIONS 46 B EK Frontend: a small language for string
manipulation; similar to how sanitizers are written today Backend:
a model based on symbolic finite transducers with algorithms for
analysis and code generation B EK Frontend: a small language for
string manipulation; similar to how sanitizers are written today
Backend: a model based on symbolic finite transducers with
algorithms for analysis and code generation Evaluation Converted
sanitizers from a variety of sources Checked properties like
reversibility, idempotence, equivalence, and commutativity
Evaluation Converted sanitizers from a variety of sources Checked
properties like reversibility, idempotence, equivalence, and
commutativity
Slide 47
47 s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program BEK
ARCHITECTURE
Slide 48
48 Symbolic Finite Transducers Z3 Transformation
Microsoft.Automata s := iter(c in t)[b := false;] { case (!b
&& c in "[\"\\]"): b := false; yield('\\', c); case (c ==
'\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek
Program BEK ARCHITECTURE
Slide 49
49 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program BEK
ARCHITECTURE
Slide 50
50 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
Slide 51
51 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
Slide 52
52 escape := iter(c in s)[b := false;] { case (!b && c
in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM:
ESCAPE QUOTES
Slide 53
53 escape := iter(c in s)[b := false;] { case (!b && c
in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM:
ESCAPE QUOTES iterate over the characters in string s
Slide 54
54 escape := iter(c in s)[b := false;] { case (!b && c
in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM:
ESCAPE QUOTES iterate over the characters in string s while
updating one boolean variable b Simple dedicated syntax
Slide 55
55 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
Slide 56
FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16
characters TOO MANY TRANSITIONS b/B z/Z &/&
Slide 57
SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in
[a-z] / x-32 x not in [a-z] / x
Slide 58
SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4
true/5 true/x-4 Predicates Sequence of functions Alphabet theory
has to be DECIDABLE Well use Z3 to check predicate
satisfiability
Slide 59
59 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
Slide 60
60 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE
Slide 61
SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet
theory has to be DECIDABLE Well use Z3 to check predicate
satisfiability
63 SFT A B inout SFT A inout SFT B CLOSED UNDER
COMPOSITION
Slide 64
SFT Algorithms 64 SFT A B inout SFT A inout SFT B
JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w))
COMPOSITION
Slide 65
65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I
outin SFT A
Slide 66
66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability
signature outin SFT A
Slide 67
67 B EK Frontend: a small language for string manipulation;
similar to how sanitizers are written today Backend: a model based
on symbolic finite transducers with algorithms for analysis and
code generation B EK Frontend: a small language for string
manipulation; similar to how sanitizers are written today Backend:
a model based on symbolic finite transducers with algorithms for
analysis and code generation Contributions Evaluation Converted
sanitizers from a variety of sources Checked properties like
reversibility, idempotence, equivalence, and commutativity
Evaluation Converted sanitizers from a variety of sources Checked
properties like reversibility, idempotence, equivalence, and
commutativity CONTRIBUTIONS
Slide 68
68 Can BEK model existing sanitizers? Can we use to check
interesting properties on real sanitizers? QUESTIONS?
Slide 69
Language Features 69 Data: 1x OWASP HTMLencode 13x Google
AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts
WHAT FEATURES ARE NEEDED?
Slide 70
Language Features 70 Majority (76%) of sanitizers can be ported
without extending the language With multi-character lookahead: 90%
WHAT FEATURES ARE NEEDED?
Slide 71
71 Data 4x MS internal HtmlEncode 3x for hire HtmlEncode based
on English- language specification (C#) Commutative? Equivalent?
CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
Slide 72
72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON
REAL SANITIZERS?
Slide 73
73 Short answer: Yes! EQ results take less than a minute to
obtain: 1234567 1 2 3 4 5 6 7 CAN WE CHECK INTERESTING PROPERTIES
ON REAL SANITIZERS?
Slide 74
74 CommutativitySelf-Equivalence DOES IT SCALE?
Slide 75
The Cheat Sheet 75 One out of seven implementations correctly
encodes all strings for use in both HTML and attribute contexts
WERE ALL SANITIZERS BROKEN?
Slide 76
76 B EK is a domain-specific language for writing string
sanitizers B EK can model programs without approximation using
symbolic finite transducers, enabling e.g., equivalence checks B EK
was evaluated using real-world sanitizers from a variety of
different sources Conclusion BEK IN A NUTSHELL
Slide 77
OUTLINE Automata, transducers, and programs BEK and string
sanitizers BEX and string encoders FAST and tree manipulating
programs Whats next? 77
Slide 78
BEX ANALYSIS OF STRING ENCODERS Loris DAntoniMargus Veanes
[VMCAI13, CAV13]
Slide 79
79 Hi, Im plain text! Nice to meet you! SGkgSSdtIHBsYWluI
HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder
Slide 80
NOT SO EASY TO GET RIGHT 80
Slide 81
WHEN ARE THEY CORRECT? 81 T Encoder T Decoder T Encoder TT
Slide 82
CAN WE USE TRANSDUCERS? 82 T Encoder T Decoder T Encoder o
Decoder = Identity
Slide 83
Language Features 83 Majority (76%) of sanitizers can be ported
without extending Bek With multi-character lookahead: 90% BEK: WHAT
FEATURES WERE NEEDED?
Slide 84
BASE64 encoder 3 Bytes 4 Base64 characters 84 Text contentMan
Bytes7797110 Bit Pattern010011010110000101101110 Index1922546
Base64 EncodedTWFu
Slide 85
85 HOW DO WE EXTEND BEK?
Slide 86
86 Symbolic Finite Transducers Z3 Transformation Analysis Does
it do the right thing? Counterexample \' vs. \\' Microsoft.Automata
s := iter(c in t)[b := false;] { case (!b && c in
"[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b;
yield(c); case (true): b := false; yield(c); }; Bek Program Code
Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite
transducers dont have registers
Slide 87
TRANSDUCERS WITH REGISTERS 87 12 x / [ r | (x>>6),
x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)4)], r :=
(x&0xF)
EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man p 3 qp x 1 FF x 2
FF x 3 FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF)
>6), x 3 &0x3F ] x1x1 x2x2 x3x3
Slide 94
EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man pq TWFu 3 qp x 1 FF
x 2 FF x 3 FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2
&0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3
Slide 95
MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS 95 10 x 1
>x 2 / [x 1 +x 2 ] Do they still have nice properties?
Slide 96
WHAT DO WE NEED? 96 T Encoder T Decoder T Encoder o Decoder =
Identity CompositionEquivalence
Slide 97
NEGATIVE RESULTS 97 ESFAs: equivalence is undecidable are not
closed under intersection are not closed under complement ESFTs
equivalence is undecidable are not closed under composition
Slide 98
A FRIENDLIER RESTRICTION 98
Slide 99
CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative
results use binary predicates and encoders do not use this feature
Only allow conjunctions of unary predicates q p x 1 5 x 2 =1 / [x 1
+x 2, x 1 ]
Slide 100
CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to
SFAs 10 x 1 >5 x 2 =1 0,1 0 x=1x>5 1
Slide 101
STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are
strictly more expressive than SFTs!! 10 x 1 >5 x 2 =1 / [x 1 +x
2 ] ?
Slide 102
WHAT DO WE NEED? 102 T Encoder T Decoder T Encoder o Decoder =
Identity CompositionEquivalence
Slide 103
RESULTS 103 Cartesian ESFTs equivalence is decidable are not
closed under composition
Slide 104
COMPOSITION IN PRACTICE 104
Slide 105
105 BEK WITH REGISTERS?
Slide 106
TRANSDUCERS WITH REGISTERS 106 12 x / [ r | (x>>6),
x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)4)], r :=
(x&0xF)
SYMBOLIC TREE TRANSDUCERS [PSI11] q(a.a>3,(x 1,x 2 ))
a.a+1,(a.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc
Domain expressiveness: infinite alphabets using predicates and
functions Structural expressiveness: cant delete a node without
reading it first 55+1 5-2 q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is
true 131 Alphabet theory has to be DECIDABLE Well use Z3 to check
predicate satisfiability
Slide 132
IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the
left child if it contains a script If we delete the node we cant
check that the left child contained a script divq q 132 Regular
Look-Ahead (RLA) ??
Slide 133
REGULAR LOOK AHEAD : Transformation: delete the left child if
it contains a script Rules can ask whether the children are in
particular languages p 1 : the language of trees that contain a
script node p 2 : the language of all trees Decidable properties:
type-checking, etc Domain expressiveness: infinite alphabets
Structural expressiveness: good enough to express our examples div
q p1p1 p2p2 q Transformation now is safe 133
Slide 134
DecidabilityComplexityStructuralExpressiveness Infinite
alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down
Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X
Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata
[Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV
Symbolic Tree Transducers RLAVV~V 134
Slide 135
COMPOSITION OF STT R This is not always possible!! Find the
biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1
o T 2
Slide 136
WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by
a Symbolic Tree Transducers with RLA if T 1 is deterministic All
our examples fall in this category 136 Alphabet theory has to be
DECIDABLE Well use Z3 to check predicate satisfiability
Slide 137
137 Symbolic Tree Transducers with RLA Z3 Transformation
Analysis Does it do the right thing? Counterexample \' vs. \\'
Microsoft.Automata s := iter(c in t)[b := false;] { case (!b
&& c in "[\"\\]"): b := false; yield('\\', c); case (c ==
'\\'): b := !b; yield(c); case (true): b := false; yield(c); };
Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
Slide 138
CASE STUDIES AND EXPERIMENTS 138
Slide 139
CASE STUDIES AND EXPERIMENTS Program Optimization:
Deforestation of functional programs Verification: HTML
sanitization Analysis of functional programs Augmented reality app
store 139 Infinite Alphabets: Integer Data types
Slide 140
DEFORESTATION Removing intermediate data structures from
programs ADVANTAGE: the program is a single transducer reads the
input list only once, thanks to transducers composition 140
alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList
IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) }
def mapC 2 : IList IList := compose mapC mapC
AR INTERFERENCE ANALYSIS Recognizers output data that can be
seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot .
144
Slide 145
APPS AS TREE TRANSFORMATIONS Applications that use recognizers
can be modeled as FAST programs 145 trans addHat: STree -> STree
Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h),
l, r) | Head(a) to Head(Hat(a))
Slide 146
COMPOSITION OF PROGRAMS Two FAST programs can be composed into
a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146
INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite
outputs of other apps Apps interfere when they annotate the same
node of a recognizers output We can compose them and check if they
interfere statically!! Put checker in the AppStore and analyze Apps
before approval Interfering apps Add cat earsAdd hat Add pin to a
cityBlur a city Amazon Buy Now button Malicious Buy Now button
148
Slide 149
INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs,
up to 85 functions each Check statically if they conflict pairwise
for ANY possible input Checked 99% of program pair in less than 0.5
sec! For an App store these are perfectly fine
Slide 150
TWO PENDING PATENTS 150
Slide 151
151 F AST is a domain-specific language for writing tree
manipulating programs F AST can model programs without
approximation using Symbolic tree transducers with regular
lookahead F AST was evaluated using real-world programs Conclusion
FAST IN A NUTSHELL
Slide 152
OUTLINE Automata, transducers, and programs BEK and string
sanitizers BEX and string encoders FAST and tree manipulating
programs Whats next? 152
Slide 153
WHATS NEXT 153
Slide 154
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has
the features required by the task, it is simple to use enables to
automatically reason about what the programs do compiles into
efficient code 154
Slide 155
DREX EFFICIENT STRING MANIPULATION Loris DAntoni Mukund
Raghothaman Here at POPL15! Rajeev Alur
Slide 156
DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B)
156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a
Execute this code in linear time left- to-right pass on the input
string!!
Slide 157
BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes
Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz
Slide 158
FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B)
Efficient data-parallel code 158 12 x / [ r+x, x+1], r := 0 x / [
x+4 ], r := (x-2) 02
Slide 159
PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL
15!! Loris DAntoni David Molnar Benjamin Livshits Margus Veanes
Robert Cochran
Slide 160
CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160
Specification
Slide 161
YOU CAN HELP TOO! 161
Slide 162
INTERESTING DIRECTIONS A transducer-based language for
WebSrapers Spradsheet transformations Compiler optimizations XML
processing Html rendering 162
Slide 163
SUMMARIZING 163
Slide 164
164 Transducer Model Z3 Transformation Analysis Does it do the
right thing? Analysis question Microsoft.Automata s := iter(c in
t)[b := false;] { case (!b && c in "[\"\\]"): b := false;
yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b
:= false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR
RECIPE FOR EACH TASK
Slide 165
BEK Fast and precise sanitizer analysis with BEK Hooimeijer,
Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state
transducers: algorithms and applications Veanes, Hooimeijer,
Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string
encoders and decoders DAntoni, Veanes, VMCAI13 Equivalence of
extended symbolic finite transducers DAntoni, Veanes, CAV13 Data
parallel string manipulating programs Veanes, Mytkowicz, Molnar,
Livshits, POPL15 FAST Fast: a transducer based language for tree
manipulatio DAntoni, Veanes, Livshits, Molnar, PLDI14 165