Data lake technology typically involves a single namespace with petabytes of storage. It may seem tempting to split...
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
off a section of the primary space and use it as backup storage and DR archiving, since this can be done without copying any data, which is always attractive, even more so with unstructured big data.
One has to remember the purpose of backup and DR before heading there. Backup is a way of taking a copy of data offline, but making it readily available if required. Archiving for DR means the data is not only offline, but it is geographically distant from the other working copies.
Simply using a shard of the data lake won't cut it. The data is still local and still online within the working systems. This would leave the data open to any hack that deleted or corrupted files and doesn't provide an umbrella during hurricanes. A separate data lake use case for backup and archiving is one solution, but this involves copying the data between zones.
Can we fix the issues within the lake or must we look outside? There are a couple of major data lake use cases. Let's assume the data lake use case is all in the same public or private cloud. If we create a container around a replica of the data we want to archive, we can manipulate the virtual LAN structure to take access out of the user space, effectively making it offline. This isn't as good as tape sent to the salt mine, but, with encryption of data, it should keep out the hackers. Encryption might need to be a background job after the separation is made.
Archiving demands geodiversity where data is stored. This is really a requirement for most of the active data, as well as the archives -- there should be a remote copy. This is certainly true if the data is worth archiving. How this is done is a function of the data protection used with the active data. If simple replication is used, two replicas should be stored in cloud zones far enough away from each other that a disaster cannot hit both. Erasure coding is a bit more complex, since the erasure set can be distributed for protection, with the exact distribution dependent on the erasure coding choices.
Types of storage will affect the data lake use case. Mostly, DR goes to inexpensive bulk storage, which might mean a copy policy that "knows" storage profiles. Metadata and file system indices may be better off on SSD storage, where fast access can save a lot of restore time.
Data lakes are appealing, but proceed carefully
Explore key issues with data lakes
Make your data more useful for analysis
Dig Deeper on Disaster Recovery Planning-Management
Related Q&A from Jim O'Reilly
Cloud bursting is one way to manage spikes in demand, but it's difficult to achieve with certain apps. So which application types burst best to the ...continue reading
GPU instances help enterprises run more compute-intensive workloads on the public cloud. But what kinds of apps, specifically, are a good fit for ...continue reading
Choosing a cloud instance is not an easy task, and picking the wrong size can cost you in the long run. What are some red flags that suggest I need a...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.