Recovery options: Copy-on write vs redirect-on-write snapshots

Snapshots are a very popular way to create virtual copies of an entire system in order to facilitate very quick (or even instant) recovery.  A properly designed snapshot-based recovery system can recover very large volumes in just minutes and can often do so to a point in time just minutes ago. In contrast, a typical restore of such size would likely take many hours and would typically lose at least a day’s worth of data.

There are two distinct approaches when it comes to creating snapshots: copy-on-write and redirect-on-write. Let’s talk about the advantages and disadvantages associated with each method, as they will greatly determine the impact on system performance, and therefore your ability to keep snapshots for a long time.

This article will deviate from the commonly employed term “volume” and instead uses the phrase “protected entity” to allude to the entity being safeguarded by a specific snapshot. Although it is true that the protected entity is typically a RAID volume, certain object-storage systems eschew RAID and may employ snapshots to protect alternative entities like object stores or NAS shares. In such cases, the protected entity may span several disk drives that do not necessarily reside within a RAID or LUN volume, hence the term “protected entity”.

All snapshot types share a common characteristic: they are virtual copies, not physical ones. Should any mishap happen to the protected entity, the snapshot alone would be useless. For instance, in the event of a triple disk failure within a RAID 6 volume, a snapshot would be of no assistance. An object storage system (or any storage system using erasure coding instead of RAID) may incorporate mechanisms to guard against a certain number of simultaneous failures, but mor than that number renders snapshots ineffective. To ensure protection against media failures, it therefore becomes imperative to replicate or back up the snapshot onto another device—essentially creating a physical copy from a virtual one.

Note: because the term snapshot is now being used by cloud vendors, it is important to understand that here we are considering only traditional snapshots such as those found in NAS and storage arrays. When you use your IaaS vendor’s product and create a snapshot, you are actually creating a physical copy of that volume in object storage. It is actually a backup, which should also be copied to another region and another account to protect it from a ransomware attack or natural disaster. These “snapshots” are very different than what we are discussing here.

When creating a snapshot, nothing happens on the array of hard drives housing the protected entity. The storage system simply takes note that the current state of the protected entity needs saving. The disparity between copy-on-write and redirect-on-write snapshots lies in how they store the previous versions of modified blocks, and these different methods have significant implications for performance.

How copy-on-write works.

Consider a copy-on-write system, which duplicates blocks before overwriting them with new data. In essence, when a block within the protected entity needs to be changed, the system copies that block to a separate snapshot area before it is overwritten. This approach uses three I/O operations for each write: one read and two writes. Prior to overwriting a block, its previous value must be read and written to a different location, followed by the write of the new data.

Should a process attempt to access the snapshot at a later time, it does so through the snapshot system, which is aware of which blocks have been changed since the snapshot was created. Unmodified blocks are read from the original protected entity, while the snapshot system retrieves the previous versions of modified blocks from their respective storage locations. This decision-making process for each block can incur significant computational overhead.

How redirect-on-write works.

In contrast, a redirect-on-write system utilizes pointers to represent all protected entities. When a block needs to be changed, the storage system simply redirects the pointer associated with that block to another block and writes the data there. The snapshot system maintains a record of all block locations constituting a given snapshot, which is essentially a list of pointers that correspond to the block locations. If a process seeks to access a specific snapshot, it can utilize these pointers to retrieve the blocks from their original locations. The fact that some blocks have been replaced and are now represented by different pointers holds no relevance to the snapshot process. Consequently, reading a snapshot within a redirect-on-write system incurs no computational overhead.

Redirect-on-write snapshots can reduce I/O operations by 200% when modifying a protected block (one write vs two reads and one write), all while eliminating any additional computational overhead associated with reading snapshots. Copy-on-write systems, in contrast, can significantly impact the performance of the protected entity in a negative way. The greater the number and duration of stored snapshots, the more pronounced the performance impact on the protected entity. This is precisely why copy-on-write snapshots are typically employed as temporary sources for backups. Making too many snapshots and keeping them around for long periods of time can reduce the performance of the protected entity by as much as 50%.

If you plan to use your storage system’s snapshot facility to create and hold many snapshots for long periods of time, you should really seek out a redirect-on-write system or something like it.  Whatever vendor you choose, make sure you communicate with them about how you plan to use snapshots and see if they think you will have performance issues. Then verify their claims with a proof-of-concept test. Snapshots are a great tool; just make sure to use them right.

Next read this:

Source