About Tokens and Lexemes

1. About tokens and lexemes Ben Scholzen Game Developer Gameforge Productions GmbH

2. What we'll cover

Definition of a compiler, tokenizer and parser

3. Basic structure of a tokenizer and a parser 4. Where to optimize things for PHP 5. What about parser generators? 6. They are evil!

PHP_LexerGenerator, PHP_ParserGenerator, lemon-PHP

7. Create lots of function calls like lemon parsers in C 8. Are not working very performance-wise 9. Will eat up all your memory 10. Conclusion

Don't use them!

11. Let's get started 12. What a compiler is and how it works

Acts as frontend for the application

13. Converts human-readable data into machine-readable data 14. Consists of a two components:

The lexer:

Is a finite-state-machine

15. Reads the input stream 16. Clears up the input data 17. Creates a list of tokens The parser:

Gets tokens from the tokenizer

18. Converts them into a data structure 19. What a compiler is and how it works Lexer Parser TokensDocument Stream Structure 20. Sounds great, but where do I need it?

Formatting languages

BB-Code

21. Wiki-Codes Description languages

iCalendar / vCalendar

22. XML Even programming languages

JavaScript

23. PHP Anything else you want your program to understand 24. The lexer (or tokenizer) 25. What are tokens?

Categorized block of text

Token type

26. Corresponding block of text (lexeme) List of tokens represents an entire document 27. Example in PHP:$value = 5 * 7 ; 28. How the tokenizer works

Define possible states of the lexer

29. Tokenize the input in a loop

Scan with preg_match()

Strtok() is mostly too simple

30. Reading char-by-char is too slow 31. Use the offset parameter 32. Use the G assertion (^ won't work) Always store the current position 33. Use either a switch-statement or a structured arrayReturn the tokens 34. What we can optimize

Use little memory

Always just read a partial part of the document into memory

Via fopen() and fgets()

35. Requires previous knowledge about when tokens end Offer a method for the parser to get a partial bunch of tokens Speed up execution-time

Do no internal function-calls if applicable

36. Going into practice 37. The beginning

Use little memory

Via fopen() and fread()

Requires previous knowledge about when tokens end

38. Offer a method for the parser to get a partial bunch of tokens Speed up execution-time Do no internal function-calls if applicable 39. Throwing in a file 40. Preparing stuff 41. Base state 42. Operator state 43. Value state 44. Rounding it up 45. Some actual testing 46. And what we get

array(6) {

47. [0]=> 48. array(2) { 49. [0]=> 50. string(8) "variable" 51. [1]=> 52. string(6) "$value" 53. } 54. [1]=> 55. array(2) { 56. [0]=> 57. string(8) "operator" 58. [1]=> 59. string(1) "=" 60. } 61. [2]=> 62. array(2) { 63. [0]=> 64. string(6) "number" 65. [1]=> 66. string(1) "5" 67. }

68. array(2) { 69. [0]=> 70. string(8) "operator" 71. [1]=> 72. string(1) "*" 73. } 74. [4]=> 75. array(2) { 76. [0]=> 77. string(6) "number" 78. [1]=> 79. string(1) "7" 80. } 81. [5]=> 82. array(2) { 83. [0]=> 84. string(8) "operator" 85. [1]=> 86. string(1) ";" 87. } 88. } 89. The parser 90. So we have a bunch of tokens, what now?

Loop through the tokens and analyze them

91. Create an object-oriented tree-structure or interpret 92. Avoid non-tail recursion

Use tail-recursion (trampoline) instead

93. Saves you from hitting the stack limit That's it! 94. Summary Questions? 95. Where to go from here

Wikipedia: http://en.wikipedia.org/wiki/Compiler http://en.wikipedia.org/wiki/Parsing

96. About tail-recursion in PHP: http://www.alternateinterior.com/2006/09/tail-recursion-in-php.html 97. My blog: http://www.dasprids.de 98. Rate this talk: http://joind.in/635 99. Follow me on twitter: 100. http://www.twitter.com/dasprid 101. Thank you!

Technology

About Tokens and Lexemes