23
BIG DATA Satish A G

Bar camp bigdata

Embed Size (px)

DESCRIPTION

Presentation used during Big Data session @ Barcamp Bangalore

Citation preview

  • 1.BIG DATA Satish A G

2. What is Big Data? Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 3. General Trivia 4. 4 Vs 5. Characteristics 6. Why? 7. Differences 8. Criteria 9. USE CASE- Online Services 10. recruit sponsors and become an advocate to be the first human sensor and crowd- source knowledge to learn about how technology enables big data analytics What they want to do?? Employees Digital Nomads Technology Professionals Data Scientist Students Influencers explore new, diverse data to improve human knowledge explore how data is impacting mankind and be a data scientist for a day share this with my network, and participate in the experience 11. Who is a Data Scientist? Data science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. A practitioner of data science is called a data scientist 12. Profile of Data Scientist 13. Big Data Eco System 14. Data Analytics Lifecycle 15. One Approach 16. Techniques/ Analytical Methods 17. Text Analysis- Approach 18. Text Analysis - Process 19. MapReduce & HDFS MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster (a group of connected computers that work together so that in many respects they can be viewed as a single system) A MapReduce program comprises a Map() procedure that performs filtering & sorting and a Reduce() procedure that performs a summary operation. The "MapReduce System" orchestrates by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and fault tolerance, and overall management of the whole process. HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. 20. Hadoop Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Effectively, it implements MapReduce & provides a distributed file system (HDFS) It supports the running of applications on large clusters of commodity hardware. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. 21. Overview 22. Enterprise Visualization Softwares