The Reality of SSD Capacity: No-One Wants Over 16TB Per Drive

by Ian Cutress on March 13, 2019 11:00 AM EST

86 Comments | Add A Comment

86 Comments

One of the expanding elements of the storage business is that the capacity per drive has been ever increasing. Spinning hard-disk drives are approaching 20 TB soon, while solid state storage can vary from 4TB to 16TB or even more, if you’re willing to entertain an exotic implementation. Today at the Data Centre World conference in London, I was quite surprised to hear that due to managed risk, we’re unlikely to see much demand for drives over 16TB.

Speaking with a few individuals at the show about expanding capacities, storage customers that need high density are starting to discuss maximum drive size requirements based on their implementation needs. One message starting to come through is that storage deployments are looking at managing risk with drive size – sure, a large capacity drive allows for high-density, but in a drive failure of a large drive means a lot of data is going to be lost.

If we consider how data is used in the datacentre, there are several levels regarding how often the data is used. Long-term storage, known as cold storage, is accessed very infrequently and occupied with mechanical hard-drives with long-time data retention. A large drive failure at this level might lose substantial archival data, or require long build times. More regularly accessed storage, or nearline storage / warm storage, is accessed frequently but is often used as a localised cache from the long-term storage. For this case, imagine Netflix storing a good amount of its back-catalogue for users to access – a loss of a drive here requires accessing colder storage, and the rebuild times come in to play. For hot storage, the storage that has constant read/write access, we’re often dealing with DRAM or large database work with many operations per second. This is where a drive failure and rebuild can result in critical issues with server uptime and availability.

Ultimately the size of the drive and the failure rate leads to element of risks and downtime, and aside from engineering more reliant drives, the other variable for risk management is drive size. 16TB, based on the conversations I’ve had today, seems to be that inflection point; no-one wants to lose 16TB of data in one go, regardless of how often it is accessed, or how well a storage array has additional failover metrics.

I was told that sure, drives above 16TB do exist in the market, however aside from niche applications (such as risk is an acceptable factor for higher density), volumes are low. This inflection point, one would imagine, is subject to change based on how the nature of data and data analytics will change over time. Samsung’s PM983 NF1 drive tops out at 16 TB, and while Intel has shown samples of 8 TB units of its long ruler E1.L form factor, it has listed future drives using QLC up to 32TB. Of course, 16 TB per drive puts no limits on the number of drives per system – we have seen 1U units with 36 of these drives in the past, and Intel has been promoting up to 1 PB in a 1U form factor. It is worth noting that the market for 8 TB SATA SSDs is relatively small - no-one wants to rebuild that large a drive at 500 MB/s, which would take a minimum of 4.44 hours, bringing server uptime down to 99.95% rather than the 99.999% metric (5m22 per year).

86 Comments

View All Comments

Karatedog - Wednesday, March 13, 2019 - link
And this is why RAID Z3 exist, where you use 3 parity drive. We had a mail system, where if a 4 TB HDD died, rebuilding the RAID volume took 25 days under usual load (and not 4,44 hours, come on ppl:). The chance that the other 2 parity disks die in that 25 days is extremely rare.
cfenton - Wednesday, March 13, 2019 - link
There's a very good reason. The larger the array, the longer the resilvering time. Even if you don't lose any data, you're array is still going to be nearly useless while it's resilvering.
CaedenV - Wednesday, March 13, 2019 - link
Really? Nearly useless? The used IO of a storage array is typically limited by the fabric or network connection of the end users. Sure, in some rare occasions this becomes an issue, but typically the network connection is so ridiculously slow compared to the speed of the drives, that the performance hit of the silvering process would not even be noticed. These aren't HDDs where they are limited to mere hundreds of iops, these are SSDs, and multiple SSDs, each having hundreds of thousands of iops.
Performance hit? yes. Useless? hardly.
The only thing I see here is that they are expecting high failure rates on these large SSDs. Be that the nature of QLC flash, or hammering the SSDs with minimal GC over time, etc. They are essentially saying that they are expecting down time, and potentially enough failures that they would expect something more than normal RAID protections can guard against.

I mean... its not like they are clamoring over 16B HDDs as an alternative for hot storage compared to a 16TB SSD.
angry_spudger - Wednesday, March 13, 2019 - link
Degraded, yes, nearly useless? no.
Karatedog - Thursday, March 14, 2019 - link
Nitpicking: resilvering is for RAID 1, that's mirror, as classic bathroom mirrors were fixed by resilvering their backside. Other RAID configurations (5,6, Z) are 'rebuilt'.
Sry, old habit 😃
Joseph Christopher Placak - Wednesday, March 13, 2019 - link
The answer to RAID 5 or RAID 6 is to use RAID 10. It is super fast because of striping and each striped drive is mirrored and there are no parity calculations.
brshoemak - Wednesday, March 13, 2019 - link
RAID10 is superior to RAID5/6 in many ways, but keep in mind that with RAID10 you lose half your gross drive array capacity to mirroring of the stripes. RAID5 you only lose the capacity of one drive or two drives in RAID6 to parity. It's a balance of cost/performance/stability - pick two.

That being said, I wouldn't use RAID5 or RAID6 with the current multi-TB drive sizes unless I didn't plan on accessing the data often and had both an onsite and offsite backup of the data. Even then there would be an 'ick' factor involved.
asmian - Saturday, March 16, 2019 - link
NO! That is no answer at all. RAID10 is not superior in any way. It might be faster, because of no parity calculations, but it is vastly less secure. Quite apart from the hideously expensive waste of disk space duplicating everything without any parity beyond a four-disk array, it's a poor consumer-grade solution for people who don't really understand redundancy and just think faster is better. Raid10 is just a glorified RAID0 (striping) which is the most insecure way to store data as any single drive failure is fatal - striping is not even RAID, really, as none of the disks are actually redundant as per the acronym, but are all essential.

Assume one disk goes down in your RAID10. One "side" of the array is now compromised. The other side takes over duty, and the array rebuilds on a hot spare. But while it is rebuilding, the SAME disk on the other side fails, or just has a silent bit error incident. Either you lost the same disk on both sides now, which are parts of a stripe so you just lost the ENTIRE array, both sides failing catastrophically, or if you are very lucky you just rebuilt the array with a few bit errors in, compromising your data. If it was just stray bit errors then you probably won't notice or be notified that you need to restore the compromised files from backup (if you have one...) so the next time you backup you'll destroy the archived good version of the files with the compromised ones made during the rebuild. What an EXCELLENT outcome.

Remember too that random bit errors in a RAID10 or a RAID5 cannot be rectified. If one side of your RAID10 suddenly starts throwing bit errors, then the only way to confirm the data is to check the other side - but how can the controller know which holds the correct data now, and which is wrong? All it can tell is that they are different. The same is true with a single parity drive - is the error in one of the array drives or in the stored parity calculation? With a second parity or comparison the erroring drive can be uniquely identified and reported as bad.

With RAID6 you have TWO parity drives. In the unlikely event that a drive fails or errors during a rebuild, the parity calculations on the second drive will kick in and ensure that your array doesn't fail. That second parity also ensures that there is no risk of storing bit errors during a rebuild, since there is still a parity to check against. You have a second line of defence that simply isn't there in RAID5 or 10 to protect you while your array rebuilds. It might be slower, but RAID6 is vastly superior.

You might think that those second failure chances are unlikely, but the main problem is that the bigger the individual drives get in your array, the more chance there is of a bit error occurring while rebuilding, which is the worst time for it to happen. Having a second parity will protect you. Using drives that are actually designed for RAID environments (WD RE-class and similar) helps too, as they are guaranteed to have significantly better bit error rates than cheap consumer drives, by an order of magnitude.
ballsystemlord - Monday, March 18, 2019 - link
So they aught to use RAID 60.
It's doable and the amount of data that we use is scheduled to increase dramatically in the coming years without a similar increase, even if you have oodles of moo-la, in network BW (The cables are the limitation here and most of us are effected and doomed), RAM BW, HDD/SSD BW, and GPU compute power (not to mention the GPU's total RAM).
Nvidia has increased the size of their dies to humongous levels and even if AMD follows suite, it can't grow much bigger in an economical sense. Likewise with CPU's, but in the case of their core to core and core to RAM BW. RAM is getting faster, but that's only if you go to the non-spec modules. RAM latency is not decreasing and DDR5 is supposed to include solid state storage which recreates the need for redundancy as RAM modules fail due to said storage. HDD and SSD BW appear to be plateauing, although I'm really happy that HDD manufactures are taking my long time idea and using multiple heads on at least some of the HDDs. I'm confused as to why they did not do that before.
Not that I'm trying to paint a depressing picture, mind, but there are a lot of bottle necks in the current designs and actual technologies that need to be overcome.
malcolmh - Wednesday, March 13, 2019 - link
Well the enterprise market may be satisfied, but consumer drives are still highly capacity restricted. I find it pretty astonishing that capacities as small as 256Gb, and even 128Gb, are still mainstream consumer offerings for desktop internal and external drives, given modern data use.

At the upper end, consumer SSDs effectively top out at 1Tb (a few 2Tb may be available, but will cost you more than 8Tb of spinning rust). Capacity and price/Gb still has a long way to go in the consumer market.

The Reality of SSD Capacity: No-One Wants Over 16TB Per Drive

Related Reading

Post Your Comment

86 Comments

View All Comments

Karatedog - Wednesday, March 13, 2019 - link

cfenton - Wednesday, March 13, 2019 - link

CaedenV - Wednesday, March 13, 2019 - link

angry_spudger - Wednesday, March 13, 2019 - link

Karatedog - Thursday, March 14, 2019 - link

Joseph Christopher Placak - Wednesday, March 13, 2019 - link

brshoemak - Wednesday, March 13, 2019 - link

asmian - Saturday, March 16, 2019 - link

ballsystemlord - Monday, March 18, 2019 - link

malcolmh - Wednesday, March 13, 2019 - link

Log in

Don't have an account? Sign up now