35
Caradoc: a Pragmatic Approach to PDF Parsing and Validation IEEE Security & Privacy LangSec Workshop 2016 Guillaume Endignoux Olivier Levillain Jean-Yves Migeon École Polytechnique, France EPFL, Switzerland ANSSI, France Thursday 26 th May, 2016 1 / 29

Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Caradoc: a Pragmatic Approach to PDF Parsingand Validation

IEEE Security & Privacy LangSec Workshop 2016

Guillaume Endignoux Olivier Levillain Jean-Yves Migeon

École Polytechnique, FranceEPFL, Switzerland

ANSSI, France

Thursday 26th May, 2016

1 / 29

Page 2: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Portable Document Format ?

A commonly used format, but many security issues:500+ reported vulnerabilities in Adobe Reader1 (since 1999).Discrepancies between implementations.Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG,etc.).

In our work, we aim at verifying PDFs from syntactic level.

Two approaches to validate files:Blacklist: does not detect new malware...Whitelist: higher rejection rate, but accepted files are clean.

1http://www.cvedetails.com2See for example PoC||GTFO

2 / 29

Page 3: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Portable Document Format ?

A commonly used format, but many security issues:500+ reported vulnerabilities in Adobe Reader1 (since 1999).Discrepancies between implementations.Syntax facilitates polymorphism2 (PDF+ZIP, PDF+JPEG,etc.).

In our work, we aim at verifying PDFs from syntactic level.

Two approaches to validate files:Blacklist: does not detect new malware...Whitelist: higher rejection rate, but accepted files are clean.

1http://www.cvedetails.com2See for example PoC||GTFO

2 / 29

Page 4: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

3 / 29

Page 5: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

4 / 29

Page 6: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

PDF syntax 101

A PDF document is made of objects:null

booleans: true, falsenumbers: 123, -4.56strings: (foo)names: /bararrays: [1 2 3], [(foo) /bar]

dictionaries: << /key (value) /foo 123 >>

references: 1 0 obj ... endobj and 1 0 R

streams: << ... >> stream ... endstream

5 / 29

Page 7: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Structure of a PDF file

HeaderObject

Object...

Reference tableTrailer

End-of-file

%PDF-1.7

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...

trailer<< /Size 6 /Root 1 0 R >>

startxref428%%EOF

Organization of a simple PDF file.

6 / 29

Page 8: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Structure of a PDF file

More complex structures:incremental updates,object streams,linearization.

HeaderObjects

...Table + trailer #1

End-of-file #1

Objects...

Table + trailer #2

End-of-file #2

%PDF-1.7

xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...trailer<< /Size 6 /Root 1 0 R >>

startxref428%%EOF

xref0 30000000002 65536 f0000000567 00001 n0000000000 00001 f6 10000001234 00000 ntrailer<< /Size 7 /Root 1 1 R /Prev 428 >>

startxref1347%%EOF

Original file

Incrementalupdate

Incremental update.

7 / 29

Page 9: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Logical structure of a PDF file

Document of 17 pages (about 1000 objects).

8 / 29

Page 10: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Graph organization

The graph of objects is organized into sub-structures, especiallytrees.

Page tree.Catalog Root of the page tree

Page 3Node Page 4

Page 1 Page 2

9 / 29

Page 11: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Graph organization

The table of contents uses doubly-linked lists.

Table of contents.

CatalogOutline root

ChapterChapter Chapter

SectionSection Section

10 / 29

Page 12: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Problematic structure

An attacker may write an invalid structure.

Invalid table of contents.

CatalogOutline root

ChapterChapter Chapter

SectionSection Section

11 / 29

Page 13: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Demonstration

Demonstration: two examples

Loop in the outline structurehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/outlines/cycle.pdf

Polymorphic filehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/polymorph/polymorph.pdf

These files were reported to software editors.

12 / 29

Page 14: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Demonstration

These problems may lead to several attacks:Attacks on the structure (denial of service).Evasion techniques (attacks taking advantage ofimplementation discrepancies).

13 / 29

Page 15: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

14 / 29

Page 16: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Solution proposals

Caradoc verifies a document at three levels:File syntax.Objects consistency (type checking).Higher-level verifications (graph, etc.).

15 / 29

Page 17: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization3 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.

When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.

— ISO 32000-1:2008, annex C.2

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar16 / 29

Page 18: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Syntax restriction

At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization3 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.

When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.

— ISO 32000-1:2008, annex C.2

3https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar16 / 29

Page 19: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

At object level: guarantee semantic consistency.

For this purpose: type checking algorithm.

17 / 29

Page 20: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Example on a Hello World file.

18 / 29

Page 21: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 22: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 23: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 24: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 25: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

trailer<< /Size 7

/Root 1 0 R/Info 6 0 R >>

1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj

2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj

3 0 obj <</Type /Page/MediaBox [0 0 700 200]/Parent 2 0 R/Contents 4 0 R/Resources << /Font << /F1 5 0 R >> >>

>> endobj

4 0 obj << /Length 35 >>streamBT /F1 100 Tf (Hello world !) Tj ETendstreamendobj

5 0 obj <</Name /F1/BaseFont /Helvetica/Type /Font/Subtype /Type1

>> endobj

6 0 obj <</Author (G. E.)

>> endobj

Constraint propagation.

19 / 29

Page 26: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Type checking

Types of a 17-page document.

actionpagedestinationannotationresourceoutlinecontent streamfontname treeother

20 / 29

Page 27: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

More complex verifications

At a higher level:

Verification of tree structures (page tree, outlines, etc.).Other verifications easily integrable in the future (fonts,images, existing analyses, etc.).

21 / 29

Page 28: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Table of contents

1 Syntactic and structural problems: a quick tour

2 Caradoc: a pragmatic solution

3 Application to real-world files

22 / 29

Page 29: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Implementation

Implementation in OCaml from the PDF specification4.

Validation workflow.

PDF

strict parser

relaxed parser

objects

graph ofreferences

extraction ofspecific objects

typechecking

list oftypes

graphchecking

other checksto develop

no errordetectednormalization

4https://www.adobe.com/devnet/pdf/pdf_reference.html23 / 29

Page 30: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Real-world files

10K files collected from random queries on a web search engine.

Some files are directly accepted.

Direct validation.

PDF

10000 files

strictparser parsed

1465 files

typechecking

typechecked

536 files

graphchecking

no errorfound

536 files

24 / 29

Page 31: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Normalization

Many files do not pass the first stage... But they can be normalizedbeforehand.

The relaxed parser supports common structures: incrementalupdates, object streams, etc.

Normalization.

PDF

10000 files

relaxed parser parsed

8993 files

cleaning objects normalized

8993 files

Some files were not normalized: encryption, unrecoverable syntaxerrors, etc.

25 / 29

Page 32: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Normalization

Validation after normalization.

normalized

8993 files

type checking type checked

1429 filestype error

1391 files

graph checkingno errorfound

1427 files

Our type-checker detected typos:/Blackls1 instead of /BlackIs1,/XObjcect instead of /XObject.

We identified incorrect tree structures in the wild.

26 / 29

Page 33: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Future work

What remains to be done:Complete the set of types.Check compression filters.Check graphic content.Check fonts, images, etc.

27 / 29

Page 34: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Conclusion

Summary of our contributions:We identified novel issues in PDF parsers.We proposed and formalized a simplified syntax for PDF.We implemented Caradoc to parse and validate PDF files.

Project page: https://github.com/ANSSI-FR/caradoc

28 / 29

Page 35: Caradoc: aPragmaticApproachtoPDFParsing andValidation · PortableDocumentFormat? Acommonlyusedformat,butmanysecurityissues: 500+reportedvulnerabilitiesinAdobeReader1 (since1999)

Questions ?

https://xkcd.com/1301/

29 / 29