Data lake technology typically involves a single namespace with petabytes of storage. It may seem tempting to split...
off a section of the primary space and use it as backup storage and DR archiving, since this can be done without copying any data, which is always attractive, even more so with unstructured big data.
One has to remember the purpose of backup and DR before heading there. Backup is a way of taking a copy of data offline, but making it readily available if required. Archiving for DR means the data is not only offline, but it is geographically distant from the other working copies.
Simply using a shard of the data lake won't cut it. The data is still local and still online within the working systems. This would leave the data open to any hack that deleted or corrupted files and doesn't provide an umbrella during hurricanes. A separate data lake use case for backup and archiving is one solution, but this involves copying the data between zones.
Can we fix the issues within the lake or must we look outside? There are a couple of major data lake use cases. Let's assume the data lake use case is all in the same public or private cloud. If we create a container around a replica of the data we want to archive, we can manipulate the virtual LAN structure to take access out of the user space, effectively making it offline. This isn't as good as tape sent to the salt mine, but, with encryption of data, it should keep out the hackers. Encryption might need to be a background job after the separation is made.
Archiving demands geodiversity where data is stored. This is really a requirement for most of the active data, as well as the archives -- there should be a remote copy. This is certainly true if the data is worth archiving. How this is done is a function of the data protection used with the active data. If simple replication is used, two replicas should be stored in cloud zones far enough away from each other that a disaster cannot hit both. Erasure coding is a bit more complex, since the erasure set can be distributed for protection, with the exact distribution dependent on the erasure coding choices.
Types of storage will affect the data lake use case. Mostly, DR goes to inexpensive bulk storage, which might mean a copy policy that "knows" storage profiles. Metadata and file system indices may be better off on SSD storage, where fast access can save a lot of restore time.
Data lakes are appealing, but proceed carefully
Explore key issues with data lakes
Make your data more useful for analysis
Dig Deeper on Disaster recovery planning - management
Related Q&A from Jim O'Reilly
OpenStack Cinder has added a revert-to-snapshot function, enabling enterprises to recover from corrupted data sets. However, if the feature falls ... Continue Reading
Don't let backup data encryption fall through the cracks. When encrypting backups, key management and compression are just two of the best practices ... Continue Reading
While tape is notably offline and thus protected from cyberattacks, the cloud could comprehensively surpass it for backup if service providers figure... Continue Reading