Modern Data Lake
The data lake idea focuses on storing all analyzable data sets in raw or lightly processed form into the easily expandable scale-out Hadoop infrastructure to ensure that the fidelity of the data is preserved. The modern data lake will be a “managed” data lake, meaning one that uses a data lake management platform to manage data ingestion, apply metadata and enable data governance so that users know what’s in the lake and can use the data with confidence. These factors help to avoid data lake turning in to a “data swamp”.
Modern managed data lakes allow businesses to explore their data quickly and easily, identifying opportunities for business and process improvements across the organization as well as to better understand their customers. Cybersecurity and personal data protection play vital role in every organization. This is a very big data problem as threats come from internal and external sources of an organization. The General Data Protection Regulation (GDPR) is intended to standardize expectations and protect personally identifiable information on employees, clients, and applicable data subjects. This means cybersecurity data collection and analysis has to become proactive and always on. Data governance is important in securing the data in data lake with organization’s procedures and policies that manage the data usage, availability, quality, privacy and security of data.
Data Lake Market Growth
Streaming data from high-velocity, high-volume sources from small to large companies contributing to explosive growth of data lake market. Streaming data can be
Constantly generated, in small bursts like IoT, product logs
Accumulated quickly – Billions of records occupying hundreds of petabytes
Unstructured or semi-structured in its original form
Data Lake growth drivers
Evolution of data lakes on cloud such as Amazon S3 or Azure data lake store that can offload the maintenance effort, take care of auto-scaling, encryption and ease of management
Availability of choice and better accessibility with the emergence of data lake querying tools based on SQL and ad-hoc querying
Varied pricing options for agile businesses, especially with the separation of storage and compute methodologies in data lake.
Desire to give data scientists access to enterprise data for exploration, discovery, and insight creation
Effect of latest trends on Data Lake
Trends like IoT, AI, live streaming data have created enormous amount of data being thrown into data lakes. Though managing a data lake or data hub to derive business insights used to be expensive, moving data lakes to the cloud have eased some of those pains, along with elastic infrastructure. Cloud-based data lakes tend to offer better availability than what is guaranteed on-premises, faster time to value and time to deploy for new projects, data sources and applications already cloud-based and scalable. Data Lakes are evolving to handle different modes of data movement (streaming for example) and different formats of data that are either consumed or generated by the IoT or AI projects.
Data Lake Vs Data Warehouse
Data moves through a data lake much faster than a data warehouse, reducing latency and providing faster time to analyzed data. Data lakes uses “schema-on-read” approach, accelerating insights and lowering the cost of acquiring new types of high-volume data, sensor data, social media data and clickstream data. Data in data lakes are continuous and not confined to periodic updates, as with data warehouses, allowing for real-time analysis from streaming data.
Though there are significant differences between data lakes and data warehouses, they both complement each other in an enterprise environment. A data lake becomes a landing and staging area for the data warehouse and other downstream applications. Meanwhile, data warehouse serves as a repository of integrated, historical data that is dimensionally modelled and can be queries in an ad hoc manner by concurrent business users.
Data Lakes provide flexibility in the ability to apply structure to data after it has arrived in the Data Lake. Data warehouses often demand well-defined structures up front into which data is fit.
Limitations of Data Lake
Data lakes lack strong governance, metadata, security and other standard features of enterprise software, though managed data lakes are starting to address some of these limitations.
As companies move beyond piloting advanced analytics projects to run data lake in production and at scale, they have found the software tools quite expensive, they also require investments in people and products required to manage the open source functionality and infrastructure.
Data Lake play in industry verticals
The basic nature of the data lake concept is industry agnostic. A data lake is a storage system that can store large amounts of data in its original format until required by advanced analytic and visualization applications to derive insights. Verticals simply differ by how they use the data, in other words, how the data is transformed to be used by a specific organization or industry.
For example, government initiatives like Smart Cities, that can manage traffic, optimize power grids, enhance education systems, track vehicle patterns, connected vehicles and much more generate massive amounts of data. Data lakes and analytics can assist cities in building more sustainable, livable environments.
Banking industry is also moving fast towards making the most of growing and monetizing data. Entering the big data universe, banks have now come to rely on data lakes to bring all the data together and accelerate the data-to-action cycle. Banks accelerate the pace to go from data to insightful reports using Hadoop-based data lake synched with visualization and analytics tools.
SAS & Data Lake
SAS is a leader in big data management and advanced analytics solutions that help customers make better decisions, faster. The following capabilities, all supported by SAS, can enable data management and analytics for data lake and other valuable repositories:
Data integration - Data movement, in-database processing, and native data access to traditional and emerging data sources such as Hadoop.
Data quality - Cleanse, standardize, and enrich data in real time and batch with prebuilt rules.
Self-service big data preparation - Business users profile, cleanse, and transform data on Hadoop without writing code.
Business glossary and metadata management - Track lineage, business rules, descriptive details, and workflow for improved governance of your data assets.
Event stream processing - Analyze real-time streams of data in motion for better decisions.
Data virtualization - Provide blended, secure views of your data without moving it.
Hadoop support - Access, deliver, and process data inside Hadoop across both the data management and analytics life cycle.
Visualization and advanced analytics - Deliver cutting-edge visualization and analysis capabilities without requiring analytical skills.
... View more