Click here to load reader
Upload
zoltan-varju
View
384
Download
0
Embed Size (px)
DESCRIPTION
"Data is the new oil" as the saying goes. The recent developments in IT opened up the possibility of collecting, storing and analyzing large amounts of data. Norvig et al. argues [1] that given a large enough data set, naive algorithms outperform highly sophisticated ones. On the other hand, Bender and Good [2] suggest we have to review our theories about language in the light of the unprecedented amount of available empirical data. This approach is parallel to so-called probabilistic linguistics research program[3]. Using the Internet as a source of data is exciting and challenging. Information is usually encoded into text files and we have to employ natural language processing techniques to extract it. To cope with the sheer size of today's data sets, we have to adapt our algorithms to the modern parallel distributed processing systems. [1] Alon Halevy, Peter Norvig, and Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April, 2009 [2] Emily M. Bender and Jeff Good. 2010. A Grand Challenge for Linguistics: Scaling Up and Integrating Models. White paper contributed to NSF's SBE 2020 initiative. http://www.nsf.gov/sbe/sbe_2020/submission_detail.cfm?upld_id=81 (06.06.2012) [3]Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds): Probabilistic Linguistics, MIT Press, 2003
Citation preview
Adataradat“Nem a problemak megoldasa a nehez, hanem az, hogy
mikent vessuk fel oket.”
Varju Zoltan
Weblib Kft.
2012-06-23
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 1 / 6
A keresestol az adataradatig
Dean - Ghemawat: MapReduce: Simplified Data Processing onLarge Clusters
Halevy - Norvig - Pereira: The Unreasonable Effectiveness of Data
Hadoop
NoSQL (Couchbase, MondoDB, stb.)
statisztika - adatbanyaszat - gepi tanulas - adattudomany
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 2 / 6
A keresestol az adataradatig
Dean - Ghemawat: MapReduce: Simplified Data Processing onLarge Clusters
Halevy - Norvig - Pereira: The Unreasonable Effectiveness of Data
Hadoop
NoSQL (Couchbase, MondoDB, stb.)
statisztika - adatbanyaszat - gepi tanulas - adattudomany
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 2 / 6
A big data majd megold mindent?
Kelloen nagy adathalmazon egyszeru n-gram modellek jobbanteljesıtenek mint szofisztikalt tarsaik.
Nyelveszeti megkozelıtesben a generatıv iskola es a probabilisztikusmegkozelıtes viaskodik.
Bender - Good: A Grand Challenge for Linguistics: Scaling Upand Integrating Models
Radikalisan at kell gondolnunk eddigi elmeleteinket.
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 3 / 6
A big data majd megold mindent?
Kelloen nagy adathalmazon egyszeru n-gram modellek jobbanteljesıtenek mint szofisztikalt tarsaik.
Nyelveszeti megkozelıtesben a generatıv iskola es a probabilisztikusmegkozelıtes viaskodik.
Bender - Good: A Grand Challenge for Linguistics: Scaling Upand Integrating Models
Radikalisan at kell gondolnunk eddigi elmeleteinket.
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 3 / 6
Regi problemak uj kontosben
“In 1998, Merrill Lynch cited estimates that as much as 80% of allpotentially usable business information originates in unstructuredform.”
— http://en.wikipedia.org/wiki/Unstructured_data
Hogyan tudjuk kinyerni az informaciot a strukturalatlan adatokbol?
Szovegbanaszat es szovegfeldolgozas problemainak atfogalmazasamapreduce kerdesekre (Lin es Dyer: Data-Intensive TextProcessing with MapReduce)
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 4 / 6
A Hadoop okoszisztema megoldasai
Mahout http://mahout.apache.org/ - skalazhato algoritmusokgepi tanulasra Hadoop-on
Integralas analitikai eszkozokkel (pl. R): Cloudera, Greenplum,RevolutionAnalytics
Radoop http://signup.radoop.eu/ - a RapidMiner vizualiselemzokornyezetre epıtve kınal megoldasokat
InfoHarvester http://weblib.hu/termekeink/infoharvester -kifejezetten strukturatlan adatokkal foglalkozik, iranyıtott crawler azadatok begyujtesere, integralt analitikai es szovegbanyaszatimegoldasok
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 5 / 6
A Hadoop okoszisztema megoldasai
Mahout http://mahout.apache.org/ - skalazhato algoritmusokgepi tanulasra Hadoop-on
Integralas analitikai eszkozokkel (pl. R): Cloudera, Greenplum,RevolutionAnalytics
Radoop http://signup.radoop.eu/ - a RapidMiner vizualiselemzokornyezetre epıtve kınal megoldasokat
InfoHarvester http://weblib.hu/termekeink/infoharvester -kifejezetten strukturatlan adatokkal foglalkozik, iranyıtott crawler azadatok begyujtesere, integralt analitikai es szovegbanyaszatimegoldasok
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 5 / 6
Koszonom a figyelmet
Kereso Vilag http://kereses.blog.hu/
Szamıtogepes nyelveszethttp://szamitogepesnyelveszet.blogspot.com/
Twitter: @zoltanvarju
Email: [email protected]
Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 6 / 6