35
Text manipulation with/without parsec October 11, 2011 Vancouver Haskell UnMeetup Tatsuhiro Ujihisa Tuesday, October 11, 2011

Text Manipulation with/without Parsec

  • Upload
    ujihisa

  • View
    1.942

  • Download
    1

Embed Size (px)

DESCRIPTION

At Vancouver Haskell UnMeetup on Oct 11, 2011

Citation preview

Page 1: Text Manipulation with/without Parsec

Text manipulation with/without parsec

October 11, 2011 Vancouver Haskell UnMeetup

Tatsuhiro Ujihisa

Tuesday, October 11, 2011

Page 2: Text Manipulation with/without Parsec

• Tatsuhiro Ujihisa

• @ujm

• HootSuite Media inc

• Osaka, Japan

• Vim: 14

• Haskell: 5

Tuesday, October 11, 2011

Page 3: Text Manipulation with/without Parsec

Topics• text manipulation functions with/

without parsec

• parsec library

• texts in Haskell

• attoparsec library

Tuesday, October 11, 2011

Page 4: Text Manipulation with/without Parsec

Haskell for work• Something academical

• Something methematical

• Web app

• Better shell scripting

• (Improve yourself )

Tuesday, October 11, 2011

Page 5: Text Manipulation with/without Parsec

Text manipulation• The concept of text

• String is [Char]

• lazy

• Pattern matching

Tuesday, October 11, 2011

Page 6: Text Manipulation with/without Parsec

Example: split• Ruby/Python example

• 'aaa<>bb<>c<><>d'.split('<>')['aaa', 'bb', 'c', '', 'd']

• Vim script example

• split('aaa<>bb<>c<><>d', '<>')

Tuesday, October 11, 2011

Page 7: Text Manipulation with/without Parsec

split in Haskell• split :: String -> String -> [String]

• split "aaa<>bb<>c<><>d" "<>"["aaa", "bb", "c", "", "d"]

• "aaa<>bb<>c<><>d" `split` "<>"

Tuesday, October 11, 2011

Page 8: Text Manipulation with/without Parsec

Design of split• split "aaa<>bb<>c<><>d" "<>"

• "aaa" : split "bb<>c<><>d" "<>"

• "aaa" : "bb" : split "c<><>d" "<>"

• "aaa" : "bb" : "c" : split "<>d" "<>"

• "aaa" : "bb" : "c" : "" : split "d" "<>"

• "aaa" : "bb" : "c" : "" : "d" split "" "<>"

• "aaa" : "bb" : "c" : "" : "d" : []

Tuesday, October 11, 2011

Page 9: Text Manipulation with/without Parsec

Design of split• split "aaa<>bb<>c<><>d" "<>"

• "aaa" : split "bb<>c<><>d" "<>"

Tuesday, October 11, 2011

Page 10: Text Manipulation with/without Parsec

Design of split• split "aaa<>bb<>c<><>d" "<>"

• split' "aaa<>bb<>c<><>d" "" "<>"

• split' "aa<>bb<>c<><>d" "a" "<>"

• split' "a<>bb<>c<><>d" "aa" "<>"

• split' "<>bb<>c<><>d" "aaa" "<>"

• "aaa" : split "bb<>c<><>d" "<>"

Tuesday, October 11, 2011

Page 11: Text Manipulation with/without Parsec

• split "aaa<>bb<>c<><>d" "<>"

• split' "aaa<>bb<>c<><>d" "" "<>"

• split' "aa<>bb<>c<><>d" "a" "<>"

• split' "a<>bb<>c<><>d" "aa" "<>"

• split' "<>bb<>c<><>d" "aaa" "<>"

• "aaa" : split "bb<>c<><>d" "<>"

1 split :: String -> String -> [String]2 str `split` pat = split' str pat ""3 4 split' :: String -> String -> String -> [String]5 split' "" _ memo = [reverse memo]6 split' str pat memo = let (a, b) = splitAt (length pat) str in7 ______________________if a == pat8 _________________________then (reverse memo) : (b `split` pat)9 _________________________else split' (tail str) pat (head str : memo)

Tuesday, October 11, 2011

Page 12: Text Manipulation with/without Parsec

Another approach• Text.Parsec: v3

• Text.ParserCombinators.Parsec: v2

• Real World Haskell Parsec chapter

• csv parser

Tuesday, October 11, 2011

Page 13: Text Manipulation with/without Parsec

Design of split• split "aaa<>bb<>c<><>d" "<>"

• many of

• any char except for the string of "<>"

• that separated by "<>" or the end of string

Tuesday, October 11, 2011

Page 14: Text Manipulation with/without Parsec

1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat

Tuesday, October 11, 2011

Page 15: Text Manipulation with/without Parsec

1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat

Any char

Except for end of the string or the pattern to separate(without consuming text)

Tuesday, October 11, 2011

Page 16: Text Manipulation with/without Parsec

1 import qualified Text.Parsec as P 2 3 main = do 4 print $ abc1 "abc" -- True 5 print $ abc1 "abcd" -- False 6 print $ abc2 "abc" -- True 7 print $ abc2 "abcd" -- False 8 9 abc1 str = str == "abc"10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of11 Right _ -> True12 Left _ -> False

Tuesday, October 11, 2011

Page 17: Text Manipulation with/without Parsec

1 import qualified Text.Parsec as P 2 3 main = do 4 print $ parenthMatch1 "(a (b c))" -- True 5 print $ parenthMatch1 "(a (b c)" -- False 6 print $ parenthMatch1 ")(a (b c)" -- False 7 print $ parenthMatch2 "(a (b c))" -- True 8 print $ parenthMatch2 "(a (b c)" -- False 9 print $ parenthMatch2 ")(a (b c)" -- False10 11 parenthMatch1 str = f str 012 where13 f "" 0 = True14 f "" _ = False15 f ('(':xs) n = f xs (n + 1)16 f (')':xs) 0 = False17 f (')':xs) n = f xs (n - 1)18 f (_:xs) n = f xs n

1 parenthMatch2 str = 2 case P.parse (f >> P.eof ) "parenthMatch" str of 3 Right _ -> True 4 Left _ -> False 5 where 6 f = P.many (P.noneOf "()" P.<|> g) 7 g = do 8 P.char '(' 9 f10 P.char ')'

Tuesday, October 11, 2011

Page 18: Text Manipulation with/without Parsec

Parsec API• anyChar

• char 'a'

• string "abc"== string ['a', 'b', 'c']== char 'a' >> char 'b' >> char 'c'

• oneOf ['a', 'b', 'c']

• noneOf "abc"

• eofTuesday, October 11, 2011

Page 19: Text Manipulation with/without Parsec

Parsec API (combinator)• >>, >>=, return, and fail

• <|>

• many p

• p1 `manyTill` p2

• p1 `sepBy` p2

• p1 `chainl` op

Tuesday, October 11, 2011

Page 20: Text Manipulation with/without Parsec

Parsec API (etc)• try

• lookAhead p

• notFollowedBy p

Tuesday, October 11, 2011

Page 21: Text Manipulation with/without Parsec

texts in Haskell

Tuesday, October 11, 2011

Page 22: Text Manipulation with/without Parsec

three types of text• String

• ByteString

• Text

Tuesday, October 11, 2011

Page 23: Text Manipulation with/without Parsec

String• [Char]

• Char: a UTF-8 character

• "aaa" is String

• List is lazy and slow

Tuesday, October 11, 2011

Page 24: Text Manipulation with/without Parsec

ByteString• import Data.ByteString

• Base64

• Char8

• UTF8

• Lazy (Char8, UTF8)

• Fast. The default of snap

Tuesday, October 11, 2011

Page 25: Text Manipulation with/without Parsec

ByteString (cont'd)

• OverloadedStrings with Char8

• Give type expliticly or use with ByteString functions

1 {-# LANGUAGE OverloadedStrings #-}2 import Data.ByteString.Char8 ()3 import Data.ByteString (ByteString)4 5 main = print ("hello" :: ByteString)

Tuesday, October 11, 2011

Page 26: Text Manipulation with/without Parsec

ByteString (cont'd)

1 import Data.ByteString.UTF8 ()2 import qualified Data.ByteString as B3 import Codec.Binary.UTF8.String (encode)4 5 main = B.putStrLn (B.pack $ encode "こんにちは" :: B.ByteString)

Tuesday, October 11, 2011

Page 27: Text Manipulation with/without Parsec

Text• import Data.Text

• import Data.Text.IO

• always UTF8

• import Data.Text.Lazy

• Fast

Tuesday, October 11, 2011

Page 28: Text Manipulation with/without Parsec

Text (cont'd)

• UTF-8 friendly

1 {-# LANGUAGE OverloadedStrings #-}2 import Data.Text (Text)3 import qualified Data.Text.IO as T4 5 main = T.putStrLn ("こんにちは" :: Text)

Tuesday, October 11, 2011

Page 29: Text Manipulation with/without Parsec

Parsec supports• String

• ByteString

Tuesday, October 11, 2011

Page 30: Text Manipulation with/without Parsec

Attoparsec supports• ByteString

• Text

Tuesday, October 11, 2011

Page 31: Text Manipulation with/without Parsec

Attoparsec• cabal install attoparsec

• attoparsec-text

• attoparsec-enumerator

• attoparsec-iteratee

• attoparsec-text-enumerator

Tuesday, October 11, 2011

Page 32: Text Manipulation with/without Parsec

Attoparsec pros/cons• Pros

• fast

• text support

• enumerator/iteratee

• Cons

• no lookAhead/notFollowedBy

Tuesday, October 11, 2011

Page 33: Text Manipulation with/without Parsec

Parsec and Attoparsec

1 import qualified Text.Parsec as P2 3 main = print $ abc "abc"4 5 abc str = case P.parse f "abc" str of6 Right _ -> True7 Left _ -> False8 f = P.string "abc"

1 {-# LANGUAGE OverloadedStrings #-}2 import qualified Data.Attoparsec.Text as P3 4 main = print $ abc "abc"5 6 abc str = case P.parseOnly f str of7 Right _ -> True8 Left _ -> False9 f = P.string "abc"

Tuesday, October 11, 2011

Page 34: Text Manipulation with/without Parsec

return ()

Tuesday, October 11, 2011

Page 35: Text Manipulation with/without Parsec

Practice• args "f(x, g())"

-- ["x", "g()"]

• args "f(, aa(), bb(c))"-- ["", "aa()", "bb(c)"]

Tuesday, October 11, 2011