51

Big Data Hadoop Online Training And Certification

Embed Size (px)

Citation preview

Slide 1

Content What Is Big Data What Is Hadoop Characteristics of Big Data Characteristics of HadoopBig Data Storage ConsiderationsUnderstanding Hadoop Technology and storageBigData TechnologiesHadoop HDFS ArchitectureWhy Big DataWhy HadoopFuture of Big DataFuture of Hadoop

What is Big Data Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks.

What is Hadoop Hadoopis a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Characteristics of Big Data We have all known about the 3Vs of huge information which are Volume, Variety and Velocity. Yet, Inderpal Bhandar, Chief Data Officer at Express Scripts noted in his presentation at the Big Data Innovation Summit in Boston that extra Vs IT, business and information researchers should be worried with, most eminently enormous information Veracity. There are 3 Types volume velocityVariety

Volume Volume Refers to the incomprehensible measures of data made reliably. We are not talking Terabytes yet rather Zetta bytes or Bronto bytes. In case we take all the data made on the planet between the absolute starting point and 2008, the same measure of data will soon be created reliably. This makes most data sets excessively immeasurable, making it impossible to store and dismember using standard database development. New tremendous data instruments usage coursed systems so we can store and dismember data transversely over databases that are specked around wherever on the planet.

Velocity

Velocity Refers to the speed at which new data is made and the pace at which data moves around. Essentially consider internet systems administration messages turning into a web sensation in seconds. Advancement grants us now to analyze the data while it is being delivered (as a less than dependable rule insinuated as in-memory examination), while never putting it into data bases. The Velocity is the pace at which the data is made, secured, dismembered and imagined. Some time recently, when group get ready was fundamental practice, it was common to get a redesign from the database reliably or even reliably. PCs and servers obliged liberal time to change the data and overhaul the databases. In the immense data time, data is made persistently or close steady. With the openness of Internet joined devices, remote or wired, machines and contraptions can go on their data the moment it is made.

Variety Variety Refers to the distinctive sorts of information we can now utilize. In the past we just centered around organized information that perfectly fitted into tables or social databases, for example, monetary information. Truth be told, 80% of the world's information is unstructured (content, pictures, feature, voice, and so on.) With enormous information innovation we can now examine and unite information of distinctive sorts, for example, messages, online networking discussions, photographs, sensor information, feature or voice recordings. Previously, all information that was made was organized information, it conveniently fitted in sections and lines yet those days are over. These days, 90% of the information that is created by an association is unstructured information. Information today comes in a wide range of organizations: organized information, semi-organized information, unstructured information and even complex organized information. The wide mixture of information obliges an alternate methodology and diverse strategies to store all crude information.

Characteristics of HadoopHadoop provides a reliable shared storage (HDFS) and analysis system (MapReduce).Hadoop is highly scalable and unlike the relational databases, Hadoop scales linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers.Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high-end hardware.Hadoop is highly flexible and can process both structured as well as unstructured data.Hadoop has built-in fault tolerance. Data is replicated across multiple nodes (replication factor is configurable) and if a node goes down, the required data can be read from another node which has the copy of that data. And it also ensures that the replication factor is maintained, even if a node goes down, by replicating the data to other available nodes.Hadoop works on the principle of write once and read multiple times.Hadoop is optimized for large and very large data sets. For instance, a small amount of data like 10 MB when fed to Hadoop, generally takes more time to process than traditional systems.

Big Data Storage Considerations Our experience building an industry leading Big Data storage platform has taught us a few things about the storage challenges faced by organizations. Customers have shared with us some of the general pros and cons of the storage options they have considered when choosing a storage platform.

Open SourceProsFree with community supportScalableRuns on inexpensive commercial-off-the-shelf (COTS) hardwareConsCommunity support is not sufficient and there is a reliance on outside consultancyInvestment to build and maintain in-house competencyIn-house support, testing and tuningNo guaranteed SLALong lead time to get into production

Conventional Storage SystemsProsEnterprise-class support and qualityLong term lifecycle/release managementAppliance based modelConsExpensive license and supportLocked-in/proprietary hardwareScalability and manageability issues such as file system, namespace, data protection, disaster prevention, etc

Software-defined StorageProsEnterprise-class support and qualityLong term lifecycle/release managementMassively scalable built for todays and emerging workloadsEasy to manage self healing, non disruptive upgradesRuns on inexpensive COTS hardwareConsSome solutions require additional software with a separate licenseScalability varies with solutionsData migration is required with some solutions

Understanding Hadoop technology and storageBecause Hadoop stores three copies of each piece of data, storage in a Hadoop cluster must be able to accommodate a large number of files. To support the Hadoop architecture, traditional storage systems may not always work. The links below explain how Hadoop clusters and HDFS work with various storage systems, including network-attached storage (NAS), SANs and object storage.

software vendors have gotten the message that Hadoop is hot -- and many are responding by releasing Hadoop connectors that are designed to make it easier for users to transfer information between traditional relational databases and the open source distributed processing system.Oracle, Microsoft and IBM are among the vendors that have begun offering Hadoop connector software as part of their overall big data management strategies. But it isnt just the relational database management system (RDBMS) market leaders that are getting in on the act.

Big Data Technologies Big Data information is a wide term for information sets so vast or complex that customary information preparing applications are lacking. Big Data Technologies are 9 TechnologiesCrowd sourcingData fusionData integrationGenetic algorithmMachine learningNatural language processingSignal processingTime seriesSimulation

Crowd sourcingCrowd sourcing, a present day business term authored in 2005, is characterized by Merriam-Webster as the procedure of soliciting so as to acquire required administrations, thoughts, or substance commitments from a substantial gathering of individuals, and particularly from an online group, as opposed to from customary workers or suppliers a portmanteau of "group" and "outsourcing, its more particular definitions are yet vigorously faced off regarding.

Data fusionInformation combination is the procedure of coordination of various information and learning speaking to the same certifiable item into a steady, exact, and valuable representation. combination of the information from 2 sources (measurement #1 & #2) can yield a classifier better than any classifiers taking into account measurement #1 or measurement #2 alone Information combination procedures are regularly arranged as low, middle of the road or high, contingent upon the handling stage at which combination takes place. Low level information combination consolidates a few wellsprings of crude information to create new crude information. The desire is that melded information is more educational and engineered than the first inputs.

Data integrationInformation joining includes consolidating information living in distinctive sources and furnishing clients with a brought together perspective of these data.[1] This procedure gets to be noteworthy in an assortment of circumstances, which incorporate both business (when two comparative organizations need to blend their databases) and investigative (joining examination results from diverse bioinformatics stores, for instance) areas. Information mix shows up with expanding recurrence as the volume and the need to share existing information explodes. It has turned into the center of broad hypothetical work, and various open issues stay unsolved

Genetic AlgorithmIn the field of counterfeit consciousness, a hereditary calculation (GA) is a pursuit heuristic that emulates the procedure of characteristic choice. This heuristic (likewise some of the time called a metaheuristic) is routinely used to produce valuable answers for advancement and pursuit problems.[1] Genetic calculations have a place with the bigger class of developmental calculations (EA), which create answers for streamlining issues utilizing systems roused by characteristic advancement, for example, legacy, change, determination, and hybrid.

Machine learningMachine learning is a subfield of PC science[1] that developed from the investigation of example acknowledgment and computational learning hypothesis in fake intelligence. Machine learning investigates the study and development of calculations that can gain from and make forecasts on data. Such calculations work by building a model from sample inputs keeping in mind the end goal to make information driven expectations or decisions instead of taking after entirely static project guidelines. Machine learning is firmly identified with and regularly covers with computational measurements; a teach that likewise works in expectation making. It has solid binds to scientific enhancement, which conveys systems, hypothesis and application spaces to the field. Machine learning is utilized in a scope of figuring assignments where outlining and programming express calculations is infeasible.

Natural language processing This article speaks the truth dialect handling by PCs. For the preparing of dialect by the human cerebrum, see Language handling in the mind. Normal dialect handling (NLP) is a field of software engineering, computerized reasoning, and computational etymology worried with the collaborations in the middle of PCs and human (characteristic) dialects. As being what is indicated, NLP is identified with the territory of humancomputer association. Numerous difficulties in NLP include normal dialect understanding, that is, empowering PCs to get importance from human or common dialect information, and others include characteristic

Signal processingSign preparing is an empowering innovation that incorporates the key hypothesis, applications, calculations, and executions of handling or moving data contained in a wide range of physical, typical, or unique configurations extensively assigned as signals. It utilizes numerical, measurable, computational, heuristic, and semantic representations, formalisms, and strategies for representation, demonstrating, investigation, union, revelation, recuperation, detecting, procurement, extraction, learning, security, or legal sciences

Time seriesA period arrangement is a grouping of information focuses, commonly comprising of progressive estimations made over a period interim. Cases of time arrangement are sea tides, numbers of sunspots, and the day by day shutting estimation of the Dow Jones Industrial Average. Time arrangement are every now and again plotted by means of line outlines. Time arrangement are utilized as a part of insights, sign preparing, example acknowledgment, econometrics, numerical money, climate anticipating, canny transport and direction forecasting, seismic tremor expectation, electroencephalography, control building, stargazing, correspondences designing, and to a great extent in any area of connected science and designing which includes worldly estimations.

SimulationSimulation is the operation's impersonation of a genuine procedure or framework over time. The demonstration of reenacting something first obliges that a model be created; this model speaks to the key qualities or practices/elements of the chose physical or theoretical framework or procedure. The model speaks to the framework itself, while the reenactment speaks to the framework's operation after some time.

Hadoop HDFS ArchitectureHadoop1 gives a disseminated filesystem and a structure for the investigation and change of expansive information sets utilizing the MapReduce [DG04] worldview. While the interface to HDFS is designed after the Unix filesystem, steadfastness to principles was relinquished for enhanced execution for the applications at hand.An imperative normal for Hadoop is the apportioning of information and calculation crosswise over numerous (thousands) of hosts, and the execution of utilization calculations in parallel near their information. A Hadoop group scales calculation limit, stockpiling limit and I/O transfer speed by essentially including thing servers. Hadoop groups at Yahoo! compass 40,000 servers, and store 40 petabytes of utilization information, with the biggest group being 4000 servers. One hundred different associations overall report utilizing Hadoop

Why Big DataData are now woven into every sector and function in the global economy, and, like other essential factors of production such as hard assets and human capital, much of modern economic activity simply could not take place without them. The use of Big Data large pools of data that can be brought together and analyzed to discern patterns and make better decisions will become the basis of competition and growth for individual firms, enhancing productivity and creating significant value for the world economy by reducing waste and increasing the quality of products and services

Why HadoopApache Hadoop enables big data applications for both operations and analytics and is one of the fastest-growing technologies providing competitive advantage for businesses across industries. Hadoop is a key component of the next-generation data architecture, providing a massively scalable distributed storage and processing platform. Hadoop enables organizations to build new data-driven applications while freeing up resources from existing systems. MapR is a production-ready distribution for Apache Hadoop.

Future of Big DataPlainly Big Data is in its beginnings, and is substantially more to be found. Presently is for the most organizations only a cool keyword, because it has an incredible potential and not many genuinely recognize what all is about. A clear sign that there is a whole other world to enormous data then is at present appeared available, is that the enormous programming organizations not have, or don't display their Big Data solutions, and those that have like Google, does not utilize it in ca business way. The organizations need to choose what kind of technique utilization to execute Big Data. They could utilize a more progressive approach and move all the information to the new Big Data environment, and all there porting, demonstrating and cross examination will be executed utilizing the new business intelligence in light of Big Data. This methodology is now utilized by many analytics driven associations that puts all the information on the Hadoop environment and build business knowledge arrangements on top of it.

Future of HadoopDynamic cachingMultiple network interface supportSupport NVRAMHardware Security Modules

Dynamic caching Access pattern based caching of hot data LRU, LRU2 Cache partial blocks Dynamic migration of data between storage tiers

Multiple network interface supportBetter aggregated bandwidth utilization Isolation of traffic

Support NVRAMBetter durability without write performance cost File system metadata to NVRAM for better throughput

Hardware Security Modules Better key management Processing that requires higher security only on these nodes Important requirement for Financials and Healthcare