Los Angeles R users group - Nov 17 2010 - Part 2

Introduction to the Future of R

Avram AelonyNovember 2010

Wednesday, November 17, 2010

Talk Outline:

1. Strengths

II. Criticisms

III. Challenges

IV. Remedies and Solutions

V. The Future

Quick disclaimer:

- I don’t consider myself an R expert

- I don’t have a crystal ball informing of the Future

- This talk is about polite observations

- The future is dynamic

YMMD <- your-mileage-may-differ()

?Wednesday, November 17, 2010

R’s Strengths

- a many good things, too many to mention individually

... but let’s try...

Strengths of R

- A high quality statistical platform, yielding reproducible results

- Open Source, free and available

- Large, active community

- Intuitive language structure

- Data as rows and columns

- Package plugin architecture - there are many packages, top packages in widespread use

- Distributed contributions written/offered/controlled by many/multiple individuals

- Data processing for most individual needs.

- Emerging success and increasing corporate adoption e.g. some corporate needs (often used for prototyping and adhoc analytics)

Strengths of R

More succinctly... based on a paraphrasing of a post by Ted Dunning *

1. Library

II. Language

III. Community

* http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html

Criticisms of R

- Larger grievances: memory and inefficiency

“One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.”

http://www.matthewckeller.com/html/memory.html

- Small grievances: syntax, elegance, and managing complexity

“Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.”

-Bill Venables, quote from 2007 http://www.mail-archive.com/r-help@r-project.org/msg06853.html

“...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120...”

- comment taken from Gelman blog on the future of R. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html

However, greater challenges for R lie ahead

1. Big Data is coming...

II. Isn’t Big Data already here ?

How can we imagine an ideal environment to address Big Data?

- What is Big Data?

"Every 2 Days We Create As Much Information As We Did Up To 2003" - Eric Schmidt, Chairman & CEO, Google.

http://techcrunch.com/2010/08/04/schmidt-data/

"Data is abundant, Information is useful, Knowledge is precious." http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html

- Freshness, this data will self destruct in 5 seconds... !!

"How Much Time Do You Have Before Web‐Generated Leads Go Cold?" http://www.matrixintegratedmarketing.com/MIT.pdf

Get ready:“Web Scale Big Data - 100’s of Terabytes”

-John Sichi, Facebook, on intended usage with Hive.http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.

What is Big Data?

Wikipedia - http://en.wikipedia.org/wiki/Big_data

?Wednesday, November 17, 2010

Solving the “Big” Data problem

... as I see it,

there are 5 competing possible solution “avenues”

The “Big” Data problem:

Solution #1

Use R in Conjunction with other specialized tools.

Examples:

- R remains a language for small datasets but has “hooks” and “bridges” that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading, others...)

Solution #2

Packages that enable new functionality for reading and processing very large data sets

Examples:- Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment)- Kane & Emerson’s bigmemory - Adler et al.‘s ff package - Henrik Bengtsson’s R.huge package (deprecated) - (many new yet-to-be-developed possibilities here )

So.... enhance functions, but no enhancements to the core language

Solution #3

Same language but have R “do the right thing” under the hood.

Examples:

- Out of memory algorithms, think: “I see you’re trying to analyze a sizable amount of data...”

- Either seamlessly or after user approval to go ahead...

# perhaps, perhaps...d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE)

or if possible, enhance core language as well as functionality!!!

Solution #4 - Completely start over

http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf

http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdf

The Ihaka/Lang “Back to the Future” paper came out in 2008.

The Ihaka “Lessons Learned” 2010 paper mentions:

- the need of an “effective language for handling large-scale computations”

- nostalgia for Lisp

Have there been any Lisp-like advances since then?

What about Clojure ?

Solution #5 - Does Clojure fit the bill ?

H0: Clojure already has many of the things Ross Ihaka would ask for H1: Really?

-Rich Hickey http://clojure.org/rationale

Clojure may be seen as a solution, or as an example path for R to follow, improve upon, or choose to differ...

Clojure

-Rich Hickey http://clojure.org

- Core Clojure

- Incanter: "a Clojure-based, R-like platform for statistical computing and graphics" http://incanter.org/

- Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer

- Cascalog: “Data processing on Hadoop without the hassle” “a Clojure-based query language for Hadoop”

The problem with many new languages is that initially there are no libraries...

Clojure already has many, and can use any Java library directly as necessary.

What will the Future really hold for R ?

Thanks for listening...

Appendix:

A few slides on Clojure, and three powerful Clojure libraries:

IncanterInferCascalog

Clojure - a quick tour-Rich Hickey http://clojure.org

Please see http://incanter.org/docs/data-sorcery-new.pdf for an excellent intro to Incanter.

David Edgar Liebke’s Incanter

Below are example snippets from Incanter

Bradford Cross’ Infer : "a (Clojure) library for machine learning and statistical inference, designed

to be used in real production systems."

https://github.com/bradford/infer

Nathan Marz’s Cascalog: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html

Los Angeles R users group - Nov 17 2010 - Part 2

Documents

Los Angeles Lawyer - Nov. 2010 - Death of Copyright

Mark Entry (SMART Users) Classroom Training Nov 2012

Los Angeles to New York. Stephanie Alvarez Joyce Quintal P. 4 – U.S. History NOV – DEC 2009

RONALD McDONALD HOUSE LOS ANGELES · 2019. 9. 10. · ronald mc. donald house los angeles. sponsorship opportunities. a night to celebrate hope. thursday, nov. 7 2019. avalon hollywood

EFSET Users Insights - Presented at the EF EPI Press Conference (Nov 3, 2015)

MarkPlus on Indonesia Internet Users 2013 - Marketeers Nov 2013 edition

ANGELES UNIFIED SCHOOL DISTRICT 1 COMPLETE DOC...Nov 10, 2014 · LOS ANGELES UNIFIED SCHOOL DISTRICT MEMORANDUM MEM-6345.1 Division of Special Education 1 November 10, 2014 TITLE:

Nov. 25 - Dec. 1, 2010 S.G.V. EXAMINER B1 Light Up San ... Nov 25 - Dec 1 10.pdf · Nov. 25 - Dec. 1, 2010 S.G.V. EXAMINER B1 The San Gabriel Valley Examiner LOS ANGELES COUNTY WASHINGTON,

Real Estate VOL. 3 #19 NOV 8, 2017 INJURED? · ZONE 215 LOS ANGELES | EAST LOS ANGELES | DOWNTOWN LA | BOYLE HEIGHTS | CITY TERRACE 1125 Goodrich Blvd. Los Angeles, CA 90022 To place

Indonesia internet users 2012 - marketeers nov 2012 cover story - waizly

Información UNIVERSIDAD ESTATAL DE LOS ANGELES … › sites › default › files › users › u... · Información 5151 State Univeristy Drive King Hall D-150 Los Angeles, CA

brought to you by Independent Retail News...rising steadily with November 2017 uniqueuserssittingat71,730. Audience Growth Jan - Nov 2017 Sept - Nov 2017 Average users 631,693 67,922

BOARD OF EDUCATION OF THE CITY OF LOS ANGELES …Nov 15, 2016 · Whereas, California, Los Angeles, and our nation are blessed and enriched by the unparalleled diversity of our residents;

Presenting Digital Harmony to Users Ricky Erway OCLC Programs and Research MCN Nov. 8, 2007

Iván Fernández CIEMAT 2 nd EU-US DCLL Workshop, University of California, Los Angeles, Nov. 14-15 th, 2014

TMZ · Los Angeles, California 90067-2906 Telephone: Facsimile: Attorneys for. Plaintiff CHARLIE SHEEN CONFORMED COPY OF ORIGINAL FILED Los Angeles Suçmor Court NOV 22 2010 By Jennifer

38 InsightfulCRM The Internet Map_ An Internet users Guide to the Galaxy (11-Nov)

The Los Angeles Firefighter Nov / Dec 2012

Satellite Oceanography Users Workshop… · Satellite Oceanography Users Workshop, Melbourne, Australia 9 – 11 Nov 2015 Satellite Oceanography Users Workshop Agenda 3 Logistics

Asctivities - Sierra Club Angeles Chapter › sites › angeles... · Nepal Mountain Hiking Adventure Nov. 3-15, 2017 Price starts at $1,550 You will see the highest mountains in