Information Mining and Visualization of a LargeVolume of Legal Texts

Flávio Codeço Coelho, Renato Rocha Souza and Pablo deCamargo Cerdeira

Applied Mathematics School – Getulio Vargas Foundation

August 22, 2011

Conquering text

Scraping and indexing the world’s web pages has changed theworld...Should pagerank be our main measure of informationrelevance?What is possible if we go a little further?

It’s documents all the way down...

Luckily, we didn’t have to scanthem...We have to conquer aninformation mountain...

We had generous help...

Obtaining the Data

No API for access, a littleheuristics was necessaryScraping took more than 3months.1.3 million cases

Example: Photos

Navigating with Mechanize1

br = mechanize . Browser ()br . open ( ” http ://www. s t f . j u s . br/ p o r t a l / m in i s t ro / min i s t ro . asp? per iodo=s t f&t ipo=ant igu idade ” )i = 0l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )wh i l e 1 :

br . f o l l o w l i n k ( l i n k )i l = br . f i n d l i n k ( u r l r e g e x=’ imagem . asp ’ )u r l = ” http ://www. s t f . j u s . br/ p o r t a l ”+ i l . u r l . s t r i p ( ’ . . ’ )nome = i l . t ex tdownload photo ( ur l , nome . decode ( ’ l a t i n 1 ’ ) . s p l i t ( ’ [ ’ ) [ 0 ] )br . back ()t r y :

l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )i += 1

except LinkNotFoundError :break


HTML Parsing

Parsing scraped HTML

Beautiful Soup2 to the rescue!Firebug helped analyze page structure.Parsing was done during the scraping, to clean data forinsertion into MySQLSome parts of the page were stored in HTML for later parsing

sopa=Beaut i fu lSoup (d [ ’ dec i sao ’ ] . s t r i p ( ’ [ ] ’ ) , fromEncoding=’ ISO8859−1 ’ )r s = sopa . f i n d A l l ( ’ s t rong ’ , t ex t=re . compi le ( ’ˆ L e g i s l a ’ ) )


Extracting Even more Information

With Data on Local db, we started mining it:Tried to use the best SQL and Python had to offerPattern matching, aggregation, string matching3, etc...

Read from Db → Process → Write to DbSQL → Python → SQL


Regular expressions

Regular Expressions

re module, great, but tricky fordifferent encodings.Kodosa: visual debuggingindispensable!


rawst r = r ”””>∗\s ∗ ( [A−Z]{2 ,3}\ s∗−\s ∗ . [A−Z0−9]∗) |(CF) | ( ”CAPUT”)\ s+”””compi l e ob j = re . compi le ( rawstr , re .LOCALE)

Structuring the Data


Reflect the original structure of the dataStore additional info coming from raw textDesign data model with future analytical needs in mind

Databases and Drivers

MySQL (MariaDb4) was relational Db of ChoiceMySQLDb’s cursor.execute(’ select ∗ from ... ’)

Server side cursors were essential.MongoDb + PyMongo


What about ORMs?

Object-relational mappers are great but...SqlAlchemy5 used mostly in table creation and data insertion.For analytical purposes, server-side raw SQL, stored procs andviews can’t be beaten.We mostly used Elixir to design the tables.


Escaping from 2D dataBenefits:

Exploring MongoDba as analternative for AnalyticsAuto-sharding + Map/reduce!Escape costly Joins in MySQL

Tips:db.cursor( cursorclass=SSDictCursor)

Convert every string to UTF-8Pymongo’s transparentconversion of dictionaries toBSON

Understanding Text

Biggest challenge is extractingmeaning from decisionsIs a given decision pro- oragainst the defendant?What is the vote count onnon-unanimous decisions?

Natural Language Toolkit

Lots of batteriesincluded

Visualizing the Data

You can’t ask questions about what you don’t know...Data driven research

Standard Charting and Plotting: Matplotlib

Great for plotting summarystatisticsTogether with NetworkX canhelp visualizing some smallgraphs

Large Graph Visualization: Ubigraph

Ubigraph Rocks!a

Navigating Huge graphs gavepowerful insightsTakes advantage of multiplecores and GPU


Untangling Temporal patterns:

A bit of Python to create logs compatible with Gource6

This:Q = dbdec . execute ( ”SELECT r e l a t o r , processo , t ipo , p roc c l a s s e , duracao , UF, data dec FROM dec i sao WHERE DATE FORMAT( data dec , ’%Y%’)=”+”%s ”%ano+” ORDER BY data dec asc ” )decs = Q. f e t c h a l l ( )dura t ions = [ d [ 4 ] f o r d in decs ]cmap = cm. j e tnorm = normal ize (min( dura t ions ) , max( dura t ions )) #normal i z ing dura t ionswith open ( ’ d e c i s o e s %s . log ’%ano , ’w ’ ) as f :

f o r d in decs :c = rgb2hex (cmap(norm(d [ 4 ] ) ) [ : 3 ] ) . s t r i p ( ’#’ )path = ”/%s/%s/%s/%s ”%(d [ 5 ] , d [ 2 ] , d [ 3 ] , d [ 1 ] ) #/ State / t i po / p r o c c l a s s e / processol = ”%s |%s |%s |%s |%s\n”%(i n t ( time . mktime(d [ 6 ] . t imetup le ( ) ) ) , d [ 0 ] , ’A ’ , path , c )f . wr i t e ( l )

Generates this:885967200|MIN. SYDNEY SANCHES |A|/MG/Monocrática/INQUÉRITO/1606809|0000a4885967200|MIN. SYDNEY SANCHES |A|/MG/ Pre s i dênc i a /INQUÉRITO/1606809|0000a4


A snapshot of the Supreme Court activities: 1998

The Dynamics


Visual Python

It’s a Jungle Out There. . .

Division of labor in the supremecourtVPythona is great to quicklycreate complex animations.Here judges are trees, branchesare subjects and leaves are legaldecisions

Detailed X-ray of the innerworkings of the Supreme court92% of the cases are appeals ofa non-constitutional natureThese results led to the proposalof an amendment to theconstitution!More questions than answers!Python for data mining rocks!

To be continued...

Further automate and optimizeMore explorationsScale up the pipelineModel the life history of a legal process

FGV - Direito RioFGV - EMApBrazilian Supreme CourtAsla Sá (for kindly lending us her server)
