In this series we’ve focused on the cost impact of different design options for Cloud Volumes ONTAP (CVO) in AWS. Most recently we covered CVO licensing. In addition to licensing tiers, you will need to decide which license delivery option will prove the most cost effective: BYOL (annual commitment) or Marketplace (pay as you go). As usual, there’s no one right answer. Because it depends, a simple cost analysis is required to get it right. We discussed the basic inputs for that cost analysis in our last post. But there’s one more question that needs to be answered to complete the analysis. Should you deploy CVO on a single instance, or in an High Availability (HA) configuration?
The answer to this question is largely determined by use case. During deployment via Cloud Manager, NetApp provides some advice.
This advice is fairly straightforward….If it’s backup, DR, or you can withstand an outage without business impact, choose a single instance. If not, choose HA. But is the decision really that straightforward? Not quite.
While it makes sense for NetApp to provide a quick info bubble to help admins get going quickly, there’s a bit more to the decision. The info provided implies that the difference between the two options is about availability only. That’s not the case. It’s also about durability.
In our second post in this series we covered the different EBS types available for CVO and the cost and performance implications of each. There’s another important EBS attribute to consider: durability.
Here’s an excerpt from Amazon on EBS durability:
“Each volume is designed to protect against failures by replicating within the Availability Zone (AZ), offering 99.999% availability and an annual failure rate (AFR) of between 0.1%-0.2%.”
So EBS is more durable than an HDD, but it’s not exactly like storing your data in S3 (11 x 9s). It will not withstand AZ failure. AWS recommends that you protect data stored on EBS using snapshots stored in S3, which is durable across AZ’s. So the basic message from Amazon is EBS is durable, but you need to plan for failure.
Enterprise storage systems typically use some type of error correction to achieve data durability. RAID, erasure coding, and forward error correction are common mechanisms employed to ensure that drive or node failures don’t result in data loss.
Volumes served from on-prem NetApp appliances are backed by aggregates made up of HDD’s and SSD’s. These aggregates are configured in RAID-DP (2 parity drives) or RAID-TEC (3 parity drives) sets. When drives fail, their contents are rebuilt from parity and the file system, volumes, and shares remain in tact.
NetApp’s RAID-DP and RAID-TEC are not used in CVO. RAID0, which is simply striping and offers no protection from failure, is used. If an EBS volume fails on a single instance, so will the aggregate that is backed by that EBS volume. This will result in data loss. If you plan to use CVO for the only copy of your data, a single instance is not recommended.
Like single instance, RAID-DP and RAID-TEC are also not used in CVO HA. Instead, NetApp’s SyncMirror technology is employed. SyncMirror creates a second plex in an aggregate and syncs the data between the two plexes. If one plex is lost due to sub-component failure, the aggregate remains in tact.
The nodes in a CVO HA pair run in different AZ’s as do the EBS volumes attached to them. SyncMirror builds one plex from EBS volumes in one AZ and a second plex from another AZ. The result is highly durable aggregates backed by two plexes across AZ’s.
HA is more durable, but that durability does come at a cost. We recently conducted a detailed cost comparison between single instance and HA for a customer. We found that the additional instance, EBS storage, and CVO license required for HA resulted in a 35% cost increase when compared to single instance. The cost delta would have been higher had we not limited CVO running time to 8 hours per day and factored in license discounts.
So how does all of this apply to the DR use case we’ve been discussing throughout this series? Let’s go back to Netapp’s recommendation: “a single node system that is ideal for Disaster Recovery”. The reasoning here is that DR systems don’t store the primary or only copy of a data set. A DR system can go down or even suffer data loss without disrupting your business. This coupled with the fact that single instance is the most cost effective of the two options make it a logical choice for DR.
But it’s not the only choice. Losing your DR system might not disrupt your business but it does create a vulnerability. After a failure, it could take weeks or even months before replication initialization to CVO completes. Operating without a functional DR system for this long may be an unacceptable risk or even violate a regulatory requirement. If that’s the case, HA is a better option.
One final variable to consider is the use of NetApp’s snapmirror vault option. It stores older snapshots (longer retention) on the replication target. Since these older snapshots don’t exist on the source, they could be lost entirely in the event of an EBS failure on a single instance. This is another reason to consider HA for the DR use case.
As mentioned above, there are multiple replication options to consider. You can create relationships at the SVM or Volume level and can use mirror or vault policies. In our next post we’ll discuss these options, the setup process, and some limitations to be aware of.