18
record-oriented grep mlr-grep ryo1kato @github @gmail @twitter @facebook

multi-line record grep

Embed Size (px)

Citation preview

Page 1: multi-line record grep

record-oriented grep

mlr-grep

ryo1kato@github @gmail @twitter @facebook

Page 2: multi-line record grep

motivation

Want to "grep" multi-line entries in a file

✦ multi-line log files, or *.ini, etc. ✦ semi-structured text like an ifconfig output

2

Page 3: multi-line record grep

for example...$ cat data.txt[one]twothree[foo]barbaz[hoge]piyohuga

3

} want to extract entire record lines that contains a pattern, where a record

Page 4: multi-line record grep

Typical way

✦ grep -A 12 -B 34 -C 56 ✦ pcregrep --multiline ✦ awk -v RS='\n\n' "/$re/" ✦ perl -e …

4

Page 5: multi-line record grep

But✦ pcregrep : You often need a very long regex.

✦ Note that it's NOT about finding multiline pattern (a pattern containing '\n'), but extract multiline record containing a pattern.

✦ AWK : Possible with using RS (need gawk) ✦Actually it's difficult to do it right using pcregrep or awk.

✦ perl, python : well, if you go that far ...5

Page 6: multi-line record grep

But, do you want to write a one-liner / X script for these?

✦ zgrep ✦ grep -c (--count) ✦ grep -i (--ignore-case) ✦ grep -v (--invert-match) ✦ grep --color

6

Page 7: multi-line record grep

So I wrote it for you!✦mlr-grep

✦Multi-Line Record Grep

✦AWK, Haskell, Python ✦ named amlgrep, hmlgrep, and pmlgrep ✦ They have almost identical features.

7

Page 8: multi-line record grep

$ amlgrep 'ba' …[foo]barbaz

8

e.g.

} A whole record containing the pattern

Page 9: multi-line record grep

✦ amlgrep - AWK implementation ✦ Needs gawk. ✦ Fastest ✦ --rs regex is slightly broken in RHEL5. ✦ Auto extract *.gz, *.bz2, and *.xz files ✦ --color, --count, --invert-match ✦ AND, OR of multiple keywords.

✦ hmlgrep - Haskell implementation ✦ Has almost same feature set as AWK ver. ✦ Sometimes 1.5~2x slower, with files with short lines and many matches.

✦ pymlgrep - Python implementation ✦ Slowest (4x of AWK version) ✦ Doesn't support multiple keywords

9

Page 10: multi-line record grep

Multiple Keywords

10

Page 11: multi-line record grep

$ amlgrep [--or] h t [FILE][one]twothree[hoge]piyohuga

≒ egrep 'h|t',

but fewer key types. 11

Page 12: multi-line record grep

$ amlgrep --and h t [FILE][one]twothree

≒ egrep 'h.*t|t.*h' but fewer key types

12

Page 13: multi-line record grep

--timestamp

multi-line log files with each entry begins

with timestamps13

Page 14: multi-line record grep

$ cat datetime.log2014-01-23 12:34:56 log 1 foo bar2014-01-24 12:34:57 log 2 one two2014-01-25 12:34:58 log 3 hoge piyo

14

Page 15: multi-line record grep

$ amlgrep -t 'one' … 2014-01-24 12:34:57 log 2 one two

15

Page 16: multi-line record grep

$ amlgrep -t --dump foo

gawk -W re-interval -F \n -v RS='\n(((Mon|Tue|Wed|Thu|Fri|Sat),?[ \t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))' '-v' 'ORS=' 'oldRT $0 ~ /foo/ {i++;if(substr(oldRT,1,1)=="\n"){h=substr(oldRT,2)}else{h=oldRT};;gsub(/foo/,"&",h);print h;gsub(/foo/, "&");print;if(RT != "")printf "\n"} {oldRT=RT} END{if (i>0){exit 0}else{exit 1}}'

16

Page 17: multi-line record grep

Change the record separator✦ --rs '^$'

✦ Empty lines ✦ --rs '^----'

✦ Four or more dash ✦ --rs '^[[:alnum]]'

✦ Alphanumeric character on the first column. (For ifconfig like output)

✦ --rs '^\['

✦ A line begins with '[' (For *.ini files) ✦ --timestamp

≒ -rs '^(((Mon|Tue|Wed|Thu|Fri|Sat),?[\t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))'

17

Page 18: multi-line record grep

http://github.com/

ryo1kato/mlr-grep

18