Upload
ujihisa
View
1.942
Download
1
Tags:
Embed Size (px)
DESCRIPTION
At Vancouver Haskell UnMeetup on Oct 11, 2011
Citation preview
Text manipulation with/without parsec
October 11, 2011 Vancouver Haskell UnMeetup
Tatsuhiro Ujihisa
Tuesday, October 11, 2011
• Tatsuhiro Ujihisa
• @ujm
• HootSuite Media inc
• Osaka, Japan
• Vim: 14
• Haskell: 5
Tuesday, October 11, 2011
Topics• text manipulation functions with/
without parsec
• parsec library
• texts in Haskell
• attoparsec library
Tuesday, October 11, 2011
Haskell for work• Something academical
• Something methematical
• Web app
• Better shell scripting
• (Improve yourself )
Tuesday, October 11, 2011
Text manipulation• The concept of text
• String is [Char]
• lazy
• Pattern matching
Tuesday, October 11, 2011
Example: split• Ruby/Python example
• 'aaa<>bb<>c<><>d'.split('<>')['aaa', 'bb', 'c', '', 'd']
• Vim script example
• split('aaa<>bb<>c<><>d', '<>')
Tuesday, October 11, 2011
split in Haskell• split :: String -> String -> [String]
• split "aaa<>bb<>c<><>d" "<>"["aaa", "bb", "c", "", "d"]
• "aaa<>bb<>c<><>d" `split` "<>"
Tuesday, October 11, 2011
Design of split• split "aaa<>bb<>c<><>d" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
• "aaa" : "bb" : split "c<><>d" "<>"
• "aaa" : "bb" : "c" : split "<>d" "<>"
• "aaa" : "bb" : "c" : "" : split "d" "<>"
• "aaa" : "bb" : "c" : "" : "d" split "" "<>"
• "aaa" : "bb" : "c" : "" : "d" : []
Tuesday, October 11, 2011
Design of split• split "aaa<>bb<>c<><>d" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
Tuesday, October 11, 2011
Design of split• split "aaa<>bb<>c<><>d" "<>"
• split' "aaa<>bb<>c<><>d" "" "<>"
• split' "aa<>bb<>c<><>d" "a" "<>"
• split' "a<>bb<>c<><>d" "aa" "<>"
• split' "<>bb<>c<><>d" "aaa" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
Tuesday, October 11, 2011
• split "aaa<>bb<>c<><>d" "<>"
• split' "aaa<>bb<>c<><>d" "" "<>"
• split' "aa<>bb<>c<><>d" "a" "<>"
• split' "a<>bb<>c<><>d" "aa" "<>"
• split' "<>bb<>c<><>d" "aaa" "<>"
• "aaa" : split "bb<>c<><>d" "<>"
1 split :: String -> String -> [String]2 str `split` pat = split' str pat ""3 4 split' :: String -> String -> String -> [String]5 split' "" _ memo = [reverse memo]6 split' str pat memo = let (a, b) = splitAt (length pat) str in7 ______________________if a == pat8 _________________________then (reverse memo) : (b `split` pat)9 _________________________else split' (tail str) pat (head str : memo)
Tuesday, October 11, 2011
Another approach• Text.Parsec: v3
• Text.ParserCombinators.Parsec: v2
• Real World Haskell Parsec chapter
• csv parser
Tuesday, October 11, 2011
Design of split• split "aaa<>bb<>c<><>d" "<>"
• many of
• any char except for the string of "<>"
• that separated by "<>" or the end of string
Tuesday, October 11, 2011
1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat
Tuesday, October 11, 2011
1 import qualified Text.Parsec as P2 3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of4 _______________________Right x -> x5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat
Any char
Except for end of the string or the pattern to separate(without consuming text)
Tuesday, October 11, 2011
1 import qualified Text.Parsec as P 2 3 main = do 4 print $ abc1 "abc" -- True 5 print $ abc1 "abcd" -- False 6 print $ abc2 "abc" -- True 7 print $ abc2 "abcd" -- False 8 9 abc1 str = str == "abc"10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of11 Right _ -> True12 Left _ -> False
Tuesday, October 11, 2011
1 import qualified Text.Parsec as P 2 3 main = do 4 print $ parenthMatch1 "(a (b c))" -- True 5 print $ parenthMatch1 "(a (b c)" -- False 6 print $ parenthMatch1 ")(a (b c)" -- False 7 print $ parenthMatch2 "(a (b c))" -- True 8 print $ parenthMatch2 "(a (b c)" -- False 9 print $ parenthMatch2 ")(a (b c)" -- False10 11 parenthMatch1 str = f str 012 where13 f "" 0 = True14 f "" _ = False15 f ('(':xs) n = f xs (n + 1)16 f (')':xs) 0 = False17 f (')':xs) n = f xs (n - 1)18 f (_:xs) n = f xs n
1 parenthMatch2 str = 2 case P.parse (f >> P.eof ) "parenthMatch" str of 3 Right _ -> True 4 Left _ -> False 5 where 6 f = P.many (P.noneOf "()" P.<|> g) 7 g = do 8 P.char '(' 9 f10 P.char ')'
Tuesday, October 11, 2011
Parsec API• anyChar
• char 'a'
• string "abc"== string ['a', 'b', 'c']== char 'a' >> char 'b' >> char 'c'
• oneOf ['a', 'b', 'c']
• noneOf "abc"
• eofTuesday, October 11, 2011
Parsec API (combinator)• >>, >>=, return, and fail
• <|>
• many p
• p1 `manyTill` p2
• p1 `sepBy` p2
• p1 `chainl` op
Tuesday, October 11, 2011
Parsec API (etc)• try
• lookAhead p
• notFollowedBy p
Tuesday, October 11, 2011
texts in Haskell
Tuesday, October 11, 2011
three types of text• String
• ByteString
• Text
Tuesday, October 11, 2011
String• [Char]
• Char: a UTF-8 character
• "aaa" is String
• List is lazy and slow
Tuesday, October 11, 2011
ByteString• import Data.ByteString
• Base64
• Char8
• UTF8
• Lazy (Char8, UTF8)
• Fast. The default of snap
Tuesday, October 11, 2011
ByteString (cont'd)
• OverloadedStrings with Char8
• Give type expliticly or use with ByteString functions
1 {-# LANGUAGE OverloadedStrings #-}2 import Data.ByteString.Char8 ()3 import Data.ByteString (ByteString)4 5 main = print ("hello" :: ByteString)
Tuesday, October 11, 2011
ByteString (cont'd)
1 import Data.ByteString.UTF8 ()2 import qualified Data.ByteString as B3 import Codec.Binary.UTF8.String (encode)4 5 main = B.putStrLn (B.pack $ encode "こんにちは" :: B.ByteString)
Tuesday, October 11, 2011
Text• import Data.Text
• import Data.Text.IO
• always UTF8
• import Data.Text.Lazy
• Fast
Tuesday, October 11, 2011
Text (cont'd)
• UTF-8 friendly
1 {-# LANGUAGE OverloadedStrings #-}2 import Data.Text (Text)3 import qualified Data.Text.IO as T4 5 main = T.putStrLn ("こんにちは" :: Text)
Tuesday, October 11, 2011
Parsec supports• String
• ByteString
Tuesday, October 11, 2011
Attoparsec supports• ByteString
• Text
Tuesday, October 11, 2011
Attoparsec• cabal install attoparsec
• attoparsec-text
• attoparsec-enumerator
• attoparsec-iteratee
• attoparsec-text-enumerator
Tuesday, October 11, 2011
Attoparsec pros/cons• Pros
• fast
• text support
• enumerator/iteratee
• Cons
• no lookAhead/notFollowedBy
Tuesday, October 11, 2011
Parsec and Attoparsec
1 import qualified Text.Parsec as P2 3 main = print $ abc "abc"4 5 abc str = case P.parse f "abc" str of6 Right _ -> True7 Left _ -> False8 f = P.string "abc"
1 {-# LANGUAGE OverloadedStrings #-}2 import qualified Data.Attoparsec.Text as P3 4 main = print $ abc "abc"5 6 abc str = case P.parseOnly f str of7 Right _ -> True8 Left _ -> False9 f = P.string "abc"
Tuesday, October 11, 2011
return ()
Tuesday, October 11, 2011
Practice• args "f(x, g())"
-- ["x", "g()"]
• args "f(, aa(), bb(c))"-- ["", "aa()", "bb(c)"]
Tuesday, October 11, 2011