7th May 20131 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid

Embed Size (px)

DESCRIPTION

Relation to Business Intelligence Old BI approaches –Data Integration from RDBM –Data ware houses –OLAP –… New BI approaches –Other than RDBM data structures: Networks, Semantics Networks/Graphs in Telecom, Social Networks, Transactions, Linked Data … NoSQL: key value (Tokyo Cabinet), column stores (HBase), Graph databases, RDF(s) –In-Memory computing –Commodity PCs solutions for large data: MapReduce style - Hadoop, Pregel style – Giraph, Hama –Big unstructured data processing (on Hadoop): Sentiment analysis, topic detection, named entity detection 7th May 20133

Citation preview

7th May Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: Large-scale HPCN, Grid and MapReduce applications Intelligent and Knowledge oriented Technologies Experience from IST: 3 project in FP5: ANFAS, CrosGRID, Pellucid 6 project in FP6: EGEE II, K-Wf Grid, DEGREE (coordinator), EGEE, int.eu.grid, MEDIGRID 4 projects in FP7: Commius, Admire, Secricom, EGEE III Several National Projects (SPVV, VEGA, APVT) IKT Group Focus: Information Processing (Large Scale) Graph Processing Information Extraction and Retrieval Semantic Web Knowledge oriented Technologies Parallel and Distributed Information Processing Solutions: SGDB: Simple Graph Database gSemSearch: Graph based Semantic Search Ontea: Pattern-based Semantic Annotation ACoMA: KM tool inEMBET: Recommendation System Experts on MapReduce and IR (Nutch, Solr, Lucene) Director & leader of PDC: Dr. Ladislav Hluch URL: Large scale Text and Graph data processing Core Technology Web crawling Nutch + plugins Full text indexing and search lucene, Sorl Information Extraction Ontea, GATE All above large scale Hadoop, S4 Graph processing and Querying Simple Graph Database (SGDB) gSemSearch Neo4j Blueprints 7th May Underlined are the technologies developed by IISAS Relation to Business Intelligence Old BI approaches Data Integration from RDBM Data ware houses OLAP New BI approaches Other than RDBM data structures: Networks, Semantics Networks/Graphs in Telecom, Social Networks, Transactions, Linked Data NoSQL: key value (Tokyo Cabinet), column stores (HBase), Graph databases, RDF(s) In-Memory computing Commodity PCs solutions for large data: MapReduce style - Hadoop, Pregel style Giraph, Hama Big unstructured data processing (on Hadoop): Sentiment analysis, topic detection, named entity detection 7th May 20133 Ontea: Information Extraction Tool Regex patterns Gazetteers Resuls Key-value pairs Structured into trees graphs Transformers, Configuration Automatic loading of extractors Visual Annotation Tool Integration with external tools GATE, Stemers, Hadoop Multilingual tests English, Slovak, Spanish, Italian 7th May Text with annotations Tree of annotations Network /Graph of annotations Named Entity Recognition (NER) Combination of Existing NER ANNIE (GATE), Apache OpenNLP, Illinois NER, Illinois Wikifier, LingPipe, Open Calais Stanford NER,WikiMiner, Miscinator Machine Learning Decision Trees models Our approach was evaluated in best 6 from 17 word wide on MSM th May 20135 gSemSearch: Graph based Semantic Search Entity relation search in semantic networks/graphs Search, Navigation, Data Interaction Aiming at data integration of Structured data (Relational data, LinkedData) Unstructured Data (text, documents, communication) Applications: , Web, Text documents, LinkedData 17 April SemSets: Sematnic Search Answering list type questions: astronauts who walked on the Moon Wikipedia as text and networks/graph Text: IR methods, Lucene based Graph/network: sprading activation and SemSets Winning solution on Semantic Search Challenge April Eugene_Cernan 2.Alan_Bean 3.David_Scott 4.John_Young_(astronaut) 5.Neil_Armstrong 6.Pete_Conrad 7.Harrison_Schmitt 8.Alan_Shepard 9.Charles_Duke 10.Buzz_Aldrin 11.James_Irwin 12.Edgar_Mitchell SGDB: Simple Graph Database Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo4j for graph traversing operations Supports Blueprints API https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3 Graph Database Benchmarks Graph Traversal Benchmark for Graph Databases http://ups.savba.sk/~marek/gbench.htmlhttp://ups.savba.sk/~marek/gbench.html Blueprints API - possibility to test compliant Graph databases 7th May Source: Future Direction: Relations Discovery in Large Graph Data Motivation Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication ( , phone). Also text can be converted to graph. Interconnecting graph data and searching for relations is crucial. Approach Forming semantic trees and graphs from text, web, communication, databases and LinkedData User interaction with graph data in order to achieve integration and data cleansing Users will do it, if user effort have immediate impact on search results 7th May 20139