We are in the era of data explosion. By 2020, there will be more than 50 billion connected devices and 200 billion sensors, and they will generate tens of zeta-bytes of data. With unprecedented data abundance, big data presents transformative opportunities yet brings significant challenges for businesses to exploit the data.
Bringing the Information Together: Data Integration
In addition to the enormous volume, big data comes from variety of sources. How to bring the data together and provide users with a unified view becomes significantly important. Data integration helps address the need and is nothing new. Early architectures for integrating data from heterogeneous databases include Integrated Public Use Microdata Series designed by University of Minnesota (IPUMS), mediated schema, semantic integration, and more. With big data growing rapidly and traditional systems unable to keep up, organizations started adopting data lake solutions like Apache Hadoop for data integration. A Data Lake helps create one repository for data in an organization, including structured data, semi-structured data, unstructured data, and binary data. The organization then expands its data lake with data processing and analytics capabilities to derive business insight. As data lakes are adopted, a new problem surfaces: the organization has existing systems which have been used for insight and decision making, so how can it bring these insights together? A data hub tries to address this problem by delivering a unified platform that can bring diverse data sources together and enable users to gain insight from a collection of frameworks for data processing, interactive analytics, and real time processing.
Getting the Right Information: Data Governance
Effective use of data integration for decision-making relies on more than having the best data lake or data hub architecture. For transportation, finance services, manufacturing, or other verticals, data comes from real-time and continuous feeding; the data is naturally messy. It is challenging for organizations to identify the “relevant” and “trustworthy” information. It requires continuous governance of the data coming in. The data needs to be well understood; noise and abnormality need to be removed; and the data needs to become meaningful to the problem being analyzed. Enterprise-grade data governance helps address several aspects: auditability, metadata and taxonomy management and data lifecycle management. Auditability ensures consistent ways to track data access, data origin and data usage. Metadata provides description of other data, e.g. attributes that define or describe a document. Taxonomy categorizes the data into hierarchical relationship. Successful metadata and taxonomy management ensure data is managed in a consistent way in the data lake/hub and provide a meaningful context for understanding their values. Data lifecycle management tracks the data from ingestion, storage, recovery and backup to retirement, ensures data is of high quality and is accessible to relevant users.
A Data Lake helps create one repository for data in an organization, including structured data, semi-structured data, unstructured data, and binary data
Generating Real Time Insight: Streaming Technologies
Great data means nothing if it does not deliver the needed business insight. Data analytics have been broadly used by organizations to gain insight for improving efficiency, outperforming competitions, and creating growth opportunities. However, with fast growth of IOT and connected devices, organizations are no longer satisfied with back-end analytics itself. There is a clear trend to have end-to-end real time analytics, which instantly brings information from sources together with decisions. Financial services, healthcare, transportation and other sectors are accelerating investment in streaming analytics from client to cloud and the data center. Apache Spark, Apache Flink, and Apache Storm* are open source examples supporting streaming analytics. Streaming analytics allows continuous ingestion of live streaming data from IOT or other connected devices, provides faster insights, and performs actions promptly before the data loses its value. For example, the financial industry with its wide range of products, services, and customer interaction channels, have seen fraud risks increase tremendously. Financial institutions are turning to streaming technologies for exploiting and analyzing huge volume of data instantly in order to identify and stop fraudulent behavior. Streaming analytics also opens the opportunity for healthcare providers to constantly monitor the health situation of patients, diagnose and detect disease at the earliest signs as well as bring prompt and personalized treatment. Companies like Intel have been working with healthcare providers on collaborative cancer cloud and genomics analytics.
Making the Insight Adaptive: Webscale Artificial Intelligence
Big data analytics needs to analyze large collections of data, extract complex data representations, and inference patterns in order to generate the insight. As the data volume, variety, velocity, and veracity grow exponentially, it is becoming more complicated to discover the patterns. Artificial Intelligence (AI), thanks to its abilities inspired by biological processes for hierarchical learning and layered data abstraction, helps address these problems. AI has been in place for decades, yet has not gained significant momentum until recent years due to inadequate compute power and inadequate data volume.
Training performance has been a conventional area for improving AI. However, the bigger challenge for AI in big data analytics is to improve its scalability, e.g. its data scalability, model scalability, and node scalability. When organizations apply AI to big data infrastructure, it’s found that the need to resolve the scalability is more urgent than improving the training performance. There have been some good attempts in improving the scalability. A few years ago, Google developed a software framework, DistBelief, to train a deep learning model with billions of parameters using tens of thousands of CPU cores for speech recognition and computer vision. In addition to model and compute scalability, DistBelief also supported data parallelism. In recent years, Intel reported its collaboration with customers to improve modeling and data scalability thru Apache Spark, resulting in machine learning capability running on 100s or even 1000s of CPU servers, with tens of billions of unique features, and/or billions of edge graphs. Another need from these organizations is an end to end pipeline for AI, from data ingestion, data management, feature management, feature engineering, to model training and validation, for simplifying AI implementation on big data infrastructure.
Ninety percent of the data in the world today was generated in past few years. The power of big data is that it can process large and complex data sets very fast and generate better insight than conventional methods. Having an effective analytics solution for insight takes well-designed data integration, data governance, streaming, AI, and holistically optimized solution stack. Although Data analytics has not reached maturity, thanks to fast technology advancement, real-time and meaningful insight it has become possible to make efficient business decisions that improve our lives.