SSDs have taken the storage industry by storm by filling the ever-widening latency gap between other computing resources and hard drives. Every major storage vendor has a flash product now, but what is interesting is the differences in their approaches. Many initially rushed to market with flash as a read cache for disks. Others have used gold-plated SLC flash or even PCIe flash cards. Yet others have put together a tray of SSDs with an open-source file system.
These early products have been unable to deliver the full benefits of flash because they do not address the hard problems of flash, or are simply too expensive for mainstream applications.
The Hard Problems of Flash
SSDs behave very differently from hard disks. The main complexity lies in the Flash Translation Layer (FTL), which provides the magic that makes a bunch of flash chips usable in enterprise storage. The FTL handles wear leveling, ECC for data retention, page remapping, GC, write caching, managing internal mapping tables, etc. However, these internal tasks conflict with user requests and manifest as two main issues: latency spikes and limited durability.
The main appeal of SSD is its low latency; however, it is not available consistently. And while write latency can be masked with write-back caching, read latency cannot be hidden. Typical SSD latencies are a couple of hundred microseconds, but some accesses can be interrupted by device internal tasks, and their latency can exceed tens of milliseconds or even seconds. That’s slower than a hard disk.
There are myriad flash internal tasks that can contribute to latency, such as GC at inopportune times or stalling user IO to periodically persist internal metadata. What further complicates the situation is the lack of coordination across an array of devices.
The most common way to use SSD is to configure a group of devices, typically in RAID 6. But since each device is its own eco-system completely unaware of others, the resultant performance of IOs to this array can become even more unpredictable since their internal tasks are not coordinated.
Unless the storage subsystem understands the circumstances under which latency spikes occur and can manage or proactively schedule them across the entire array, the end result will be inconsistent and have widely varying latency characteristics.
Although flash is great for IOPS, it has limited write cycles compared with a hard disk. And while SLC flash drives have higher endurance compared with MLC, they are too expensive for mainstream applications, and may still require over- provisioning to control write amplification. MLC flash is much more cost-effective, but if used natively will quickly wear out. Its lifetime is proportional to the amount of data written to it by both user as well as data produced by internal drive activity such as GC, page remapping, wear leveling, or data movement for retention.
The additional data written internally for each user write is referred to as write amplification, and is usually highly dependent on device usage patterns. It is possible to nearly eliminate write amplification by using the device in way that hardly ever triggers GC, but the techniques are not widely understood and may be drive-specific. Techniques for total data reduction such as dedupe and compression are more widely known, but hard to implement efficiently with low latency. Similarly, building a file system that has a low metadata footprint and IO overhead per user byte are also challenging, but yields high benefit.
One of the key requirements for enterprise storage is reliability. Given that write endurance is a challenge for SSD and the fact that suboptimal use patterns can further affect it, it is important to predict when data is at risk. SSD has failure modes that are different from hard disks. As SSDs wear out, they begin to experience more and more program failures, resulting in additional latency spikes. Furthermore, because of efficient wear leveling, SSDs can wear out very quickly as they near the end of their useful life.
It is important to understand when a device is vulnerable. This is not a simple matter of counting the number of bytes written to the device and comparing it with its rating. It means observing the device for various signs of failure and errors and taking action.
The bottom line is that SSD is a game changer, but needs to be implemented correctly in any storage system, and won’t be as effective if it’s bolted onto an existing system. If you’re evaluating any SSD storage products, make sure you understand how the file system uses flash and manages performance, latency spikes, write endurance and reliability.
SSD Drive Modeling
Flash storage can deliver 400 times greater raw performance than spinning disk, but leveraging it introduces the need for fundamental architectural changes. For comparison, the speed of sound — 768 mph at sea level — is “only” 250 times faster than the average speed of walking. To travel at supersonic speeds, engineers designed sophisticated aircraft systems specifically for high speeds. It may be possible to strap a rocket motor to one’s back and attempt to travel at 768 mph, but the result would be less than ideal.
Flash poses similar challenges to existing storage systems. MLC solid- SSDs are currently the most cost-effective approach and provide excellent random IO performance, but have several idiosyncrasies which make it unsuitable as a simple drop-in replacement for rotating magnetic disks.
Disk-based systems were created more than 20 years ago to cope with a decidedly different set of problems. Adapting these systems to use flash efficiently is comparable to attempting to adapt an 8-bit single-threaded operating system to use today’s multicore 64-bit architectures.
The Tintri VMstore appliance is designed from scratch to fully exploit flash technology for virtual environments. The custom Tintri OS is specifically designed to ensure robust data integrity, reliability, and durability in flash. MLC flash is a key technology that enables Tintri to deliver the intense random IO required to aggregate hundreds or even thousands of VMs on a single appliance.
Flash drives are programmed at the page level (512B to 4KB), but can only be erased at the block level (512KB to 2MB); sizes much larger than average IO requests. This asymmetry in write vs. erase sizes leads to write amplification, which, if not managed appropriately, creates latency spikes. Tintri employs sophisticated technology to eliminate both the write amplification and latency spike characteristics of MLC flash technology. This approach delivers consistent sub-millisecond latency from cost-effective MLC flash.
Tintri VMstore leverages the strengths of MLC flash while circumventing its weaknesses, providing a highly reliable and durable storage system suitable for enterprise applications.
MLC flash, in particular, can be vulnerable to durability and reliability problems in the underlying flash technology. Each MLC cell can only be overwritten 5,000 to 10,000 times before wearing out, so the file system must account for this and write evenly across cells.
Tintri uses an array of technologies including deduplication, compression, advanced transactional and GC techniques, and SMART (Self-Monitoring, Analysis and Reporting Technology) monitoring of flash devices to intelligently maximize the durability of MLC flash. Tintri also employs RAID 6, which protects systems against the impact of potential latent manufacturing or internal software defects from this new class of storage devices.
Although MLC is two to four times cheaper than its cousin SLC, it’s still about 20 times more expensive than SATA disks. To use flash cost efficiently, technologies like inline deduplication and compression are critical.
By design, nearly all active data will live exclusively in flash. To maximize flash usage, Tintri combines fast inline deduplication and compression with file system intelligence that automatically moves only cold data to slower media.
Inline deduplication and compression are also highly effective in virtualized environments where many VMs are deployed by cloning existing VMs, or have the same operating system and applications installed. Tintri VMstore flash is neither a pure read cache nor a separate pre-allocated storage tier. Instead, flash is intelligently utilized where its high performance will provide the most benefit.
FlashFirst design uses a variety of techniques to handle write amplification, ensure longevity and safeguard against failures, such as:
- Data reduction using deduplication and compression. This reduces data before it is stored in flash, resulting in fewer writes.
- Intelligent wear leveling and GC algorithms. This leverages not only information on flash devices, but also real-time active data from individual VMs for longer life and consistently low latency.
- SMART (Self-Monitoring, Analysis and Reporting Technology). This monitors flash devices for any potential problem, issuing alerts before they escalate.
- High performance dual parity RAID 6. This delivers higher availability than RAID-10 without inefficiency of mirroring, or the performance hit of traditional RAID 6 implementations.