Data Lake Failures: How to Avoid?
A data-lake is a single source of information for business users containing enterprise-level data that is used by various functional units. Therefore, whenever there’s a hindrance to the flow of the requisite information from an enterprise data-lake, the impact could be across the organization or just a few operational aspects, leading to business disruption that could be very costly.
Data-lake failures are generally of three types –
- A data-lake is totally down
- A data-lake is only partially available
- A data-lake is frequently unavailable
Regardless of which type of failure it is, the effects of disruption to data flow are multitudinous.
- Business users’ functioning (decision-making/policy-making) gets affected
- BI / Reporting team’s work and delivery is affected
- Downstream business applications consuming the data are rendered ineffective
- People/users don’t get the data/information they are waiting for
Obviously, the end-result of all these failures – impact on business – could be undesirable for the company. From operational log-jams, reduced output/performance, and customer service/delivery failure to collateral damage, investor sentiments, and dented top/bottom lines, organizations could face severe challenges.
We here at MSRCosmos believe in empowering our customers to be proactive rather than be faced with providing remedial measures after a data-lake failure has occurred – which, of course, we do when left with no viable alternative.
Accordingly, we propose to all our customers a multi-pronged approach for preventing data-lake failures.
The 5-pronged data-lake failure prevention strategy
Data-lake failures, like we mentioned above, vary in the degree/magnitude of their impact. There are numerous reasons as to why data-lake failures / data-flow disruptions occur. Outages could stem from various factors – users, policies, infrastructure, lack of preparedness, (lack of) timely intervention, etc. Thus, our failure prevention strategy is closely intertwined with the various ways in which failures occur.
1.Data-lake security and policies
The first one deals with data security and the associated policies governing the same.
Platform Access and Privileges
Who accesses the data platform and the extent of privileges under the access needs to be discrete, and constantly monitored as well. This is because some users – unintentionally or willfully – play around with the data. For example, someone accidentally deletes some records/data that may affect the data-flow – the results could be disastrous. Therefore, to nip such possibilities in the bud, you have to keep a check on user access.
A strong firewall around the enterprise network would not only make it difficult to breach but will also insulate (isolate) it from intruders.
Data encryption will help ensure your enterprise data is protected and safe.
Have role-based, data-level security which will ensure that only authorized personnel access documents, and only those that they have permission to access.
2.Performance evaluation and scaling
If the data-flow to the intended recipients, especially the downstream applications, is delayed owing to lack of ideal speed then the output/performance of those apps gets severely affected. Therefore, it is crucial to arrive at an ideal/optimal speed and maintain that always via a continuous, analytics-powered performance evaluation strategy.
Then comes the question of scalability which is critically important for undertaking big-data analytics. As business operations grow, it is inevitable that the data size also grows. Your data-lake supporting a few TBs of data won’t help, and will collapse in the face of extensive data pouring in. Therefore, you need to get optimal amount of licenses as well as have a scalable infrastructure.
Big-data analytics frameworks such as Hadoop and Spark are designed to scale horizontally. Thus, as the data and/or processing grows, you can just add more nodes to your cluster. This allows for continuous and seamless processing without any interruptions. However, to ensure this success, you also have to have linear scaling of the storage layer.
3.High Availability (HA) and Disaster Recovery (DR)
One of the most quintessential steps you’d have to take to prevent data-lake failures is to have the right HA measures. Having a spare server that gets automatically invoked should there be any issues with the master server, will also greatly reduce the chances of a data-lake failure.
Few HA approaches that can be adopted:
- Metadata HA
Metadata HA is most helpful, almost critical, in the case of long-running cluster operations, as it includes critical information about the location of application data and the associated/related replicas.
- MapReduce HA
MapReduce HA is helpful with job execution even when the related trackers and resource managers go down.
- NFS HA
Another effective HA measure is to mount the cluster via a HA-enabled NFS. This ensures continuous and undisrupted access to both, the data that’s streaming-in and also the applications that require random read/write operations.
- Rolling updates
Rolling upgrades is another good measure that helps minimize disruptions. Deploying updates (components) incrementally ensures there’s no downtime. Further, by undertaking maintenance or software upgrades on the cluster – a few nodes at a time, while the system continues to run – you can eliminate planned downtime.
Another critical step towards data-lake failure prevention is to have a sturdy disaster recovery (DR) set-up.
Incorporate a Hadoop distribution as part of your DR strategy. It gives you the capacity to take a snapshot of a cluster at the volume-level (all the data including files and database tables). Taking the snapshot of a cluster happens instantaneously and represents a consistent view of data as the state of the snapshot always remain same.
Have a Converged Data Platform –
Experience shows that back-ups alone may not be enough for disaster recovery. Therefore, it is prudent to set-up a converged data platform for big-data disaster recovery. It will allow you to manage multiple big-data clusters across several locations and infrastructure types (cloud / on-premises) irrespective of who the service provider is, thus ensuring that the data remains consistent and up-to-date between all clusters.
4.Effective data governance
Establish an effective data governance policy in terms of how the data-lake is organized, what kind of recovery mechanisms are in place, and whether or not there is adherence to correct/authentic access. These will help in easy regeneration of information that may have been/can be affected.
Semantic Consistency is achieved when two data units satisfy strong consistency by having the same semantic meanings and data values. In other words, a semantic layer is used to maintain meta-data that needs to be checked by downstream apps if there’s going to be any change in the data (columns) and, make the changes accordingly before starting. Therefore, it is highly advisable to have a semantic layer on top of your raw data.
We believe if these five steps are implemented properly, then is it safe to say that there could be zero or minimal data-lake failures.