

Data Lake

A Data Lake is a pool of unstructured and structured data, stored as-is, without a specific purpose in mind, that can be “built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations thereof”. Data is transformed and schema is applied to fulfill the needs of analysis.

A Data Lake allows multiple points of collection and multiple points of access for large volumes of data. A Data Lake is characterized by three key attributes:

Collect everything: A Data Lake contains all data, both raw sources over extended periods of time as well as any processed data.
Dive in anywhere: A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
Flexible access: A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online,search, in-memory and other processing engines.

Analyze Data Forward and Backward in Time

The Data Lake allows collection of data for future needs before it’s possible to know what those needs are, so it has tremendous potential. Data is not limited by the scope of thinking present when the data is captured, but is free to answer questions we don’t yet know to ask: “Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise”.

The basic concepts behind Hadoop were devised by Google to meet its need for a flexible, cost-effective data processing model that could scale as data volumes grew faster than ever. Yahoo, Facebook, Netflix, and others whose business models also are based on managing enormous data volumes quickly adopted similar methods. Costs were certainly a factor, as Hadoop can be 10 to 100 times less expensive to deploy than conventional data warehousing. Another driver of adoption has been the opportunity to defer labor-intensive schema development and data cleanup until an organization has identified a clear business need. And Data Lakes are more suitable for the less-structured data these companies needed to process.

Data Lakes compared to Data Warehouses

Depending on the requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

Characteristics	Data Warehouse	Data Lake
Data	Relational from transactional systems, operational databases, and line of business applications	Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema	Designed prior to the DW implementation (schema-on-write)	Written at the time of analysis (schema-on-read)
Performance	Fastest query results using higher cost storage	Query results getting faster using low-cost storage
Data Quality	Highly curated data that serves as the central version of the truth	Any data that may or may not be curated (ie. raw data)
Users	Business analysts	Data scientists, Data developers, and Business analysts (using curated data)
Agility	Less agile, fixed configuration - cumbersome and time-consuming to change the structure of a data warehouse due to the number of business processes tied to it	Highly agile, configure and reconfigure as needed - relatively easy to make changes to models and queries
Analytics	Batch reporting, BI and visualizations	Machine Learning, Predictive analytics, data discovery and profiling

Which Approach Should I Choose

The warehouse can continue to operate as it always has and you can start filling your lake with new data sources. You can also use it for an archive repository for your warehouse data that you roll off and actually keep it available to provide your users with access to more data than they have ever had before. As your warehouse ages, you may consider moving it to the data lake or you may continue to offer a hybrid approach.

Data Lake

Analyze Data Forward and Backward in Time

Data Lakes Born out of Social Media Giants

Data Lakes compared to Data Warehouses

Which Approach Should I Choose