While RAID technology has existed for decades and most storage companies have built their own RAID subsystems, the existing technology was all designed for disk-based systems. The adoption of solid state media into the mainstream changed a lot of design considerations in the storage stack. In the early stages of the design, we knew that the RAID subsystem needed to be designed with SSDs in mind, and that it would need to be a lot more sophisticated than previous designs. A later section discusses the challenges of early SSDs as well as newer challenges as SSDs evolved.
The Tintri RAID subsystem is built in the user-space file system; it is not in the kernel. The main disadvantage of this choice was that this RAID module could not be used for managing the OS disks and partitions, and depended on the Linux kernel ‘md’ RAID subsystem. However, the advantage was that it allowed a faster development time, a very sophisticated module, a tight integration with the rest of the file system, a lot more control over performance, and a much better turnaround time with debugging and feature addition.
The Tintri RAID module implements RAID 6, which is the default RAID level shipped on all Tintri systems. Its RAID 6 (dual parity) is implemented in a way that results in a very fast RAID subsystem. The parity is rotated across all the drives participating in the stripe, so that that IO load can be evenly distributed. It always performs only a full stripe write. This avoids errors from in-place updates, as well as eliminates the potential for overhead from write-amplification due to small random writes needing corresponding parity to be updated.
The stripe itself may be smaller than the RAID group size. So, for example, a RAID group of 20 disks may have a stripe length of 10+2. The RAID group headers are updated using a Paxos-based consensus protocol; the RAID never loses its brains, no matter how many drive push, pull or mixups are performed.
In addition to storing data, the RAID module also provides some auxiliary functions: It provides APIs to store small blobs of metadata to other interested subsystems, with the ability to perform atomic updates to them using the consensus protocol built within the RAID module. This is used by many modules to store, update, and retrieve small metadata like superblocks, at a low level. There is a rich API set, both internal and network-based, for querying all of the internal state like drive state, rebuild status, etc.
The IO Stack
The IO stack of the RAID module is specifically designed to meet the needs of SSDs while also managing HDDs. The entire stack is designed to be very low latency. There is a proportional scheduler built into the stack for the many “principles” that submit IO to the RAID module. The principles can be user read or write, metadata read or write, garbage collection read or write, scrub, scanners, etc. Each of these principles have a different priority in the system, and have different goals; they may be low latency or high throughput.
An independent subsystem may manage its own IO stack and queue to the RAID module to meet its performance goals. The RAID module then schedules between them to meet the overall performance goals, from the drives, at the system level.
The entire RAID stack is asynchronous; threads don’t block inside the RAID module. The IO pipeline is carefully managed for high throughput, and also for providing low latency for random reads (the low write latency comes from the use of NVRAM/NVDIMM).
As discussed in a later section, SSDs generally provide low latency to random reads, but can sometimes incur a high latency in the presence of writes. Tintri RAID associates a small timeout value (in ms) with every asynchronous read of high value (like user data read) and reconstructs the read from other devices if the current device being read from exceeds the latency threshold.
Additionally, the RAID subsystem performs device management at the group level to increase the possibility that no more than one device may incur latency spikes at a given time. The information needed to perform this kind of management is developed by characterizing the performance and behavior of each SSD model (discussed in more detail later). This combination of device- and group-level management of SSDs leads to consistently low latencies.
There is usually one or more spare associated with Tintri storage systems. An external module, called PlatMon, determines which spare must be used for a given degraded RAID group. While a device is in the ‘spare’ state, it is not managed by RAID. When a spare is provided to RAID, after validating that the drive is in good state, it is immediately used for rebuilding. Thus, a drive within a RAID is either in good state, missing state, failed state or rebuilding state.
The drive rebuilds are performed in a circular fashion. E.g., if there are a 1000 stripes and there is one drive failure and it has rebuilt to the first 300 stripes and a second drive failure occurs, then the second drive starts rebuilding at the same offset as the currently rebuilding first drive: stripe #300. When all the stripes are done rebuilding on the first drive, the second drive will have rebuilt stripes 300 to 1000. At this point we circle back to rebuilding stripes 1 to 299 on the second drive. This optimization is present for two reasons:
- Independently rebuilding each drive from independent offsets has higher read overhead in terms of data read.
- HDDs are not efficient with random seeks, so it is better to coordinate the offsets being read for the two rebuilds. HDDs are present on the hybrid storage platforms, therefore it is important to optimize for them.
Another rebuild optimization: only previously written stripes are ever rebuilt. If an SSD happens to fail or is pulled out for testing on a newly deployed system with only a few GBs of data written, the failed SSD will rebuild in a matter of minutes.
There is a profile stored for each SSD model type in the RAID subsystem. SSD products from different vendors vary in performance, endurance, write amplification, and latency profile. These characteristics also vary between models from the same vendor.
To account for these differences and deliver the highest availability and performance possible, Tintri studies the SSD then builds a profile that is most efficient for that SSD. The profile specifies max IO sizes, write pattern, queue lengths, read and write timeouts, latency optimizations, etc. for a given model. The RAID subsystem detects the drive model at runtime and applies the appropriate model to it. There are dozens of saved profiles in Tintri’s RAID groups.
Each stripe unit header carries a checksum of the stripe unit payload. In addition, each stripe unit carries metadata relating it to other units of the stripe and to the RAID group itself. Together, this information can be used to detect errors like bit flips, corruption, lost writes, misplaced writes, wrong offset reads or swapped writes at the RAID level.
The Tintri RAID groups are expandable. A RAID group may start out at size 10+2 and be expandable to 20+2. The purpose of expansion is to be able to add capacity without increasing the RAID parity overhead; instead of adding a new RAID group, the current RAID group is expanded.
Given the high density of modern SSDs, a single storage system with 24 slots can house more than 200 TBs of raw SSD capacity; and soon, those numbers will increase to more than 400, with 16TB SSDs.
In a high-density scenario, creating multiple RAID groups from multiple SSD shelves does not make a lot of sense; beyond a certain capacity the controller becomes a bottleneck and is unable to serve more data. The range of capacities that can be addressed with 24 slots of high density SSDs is very large. This is the reasoning behind the choice to design the RAID system to support dynamic expansion.
RAID expansion occurs without any overhead or downtime. After new disks are added, the expansion task is initiated, which will instantly add the new devices to the current RAID group. Expansion can occur one disk at a time, or with as many as 12 at a time.
Dual Controller Architecture
Tintri VMstore appliances have dual-controller architecture to provide high availability to storage consumers, and Tintri OS incorporates a number of data integrity features in its high availability architecture. The Tintri file system syncs data from the NVRAM on the active controller to that on the standby controller to ensure file system contents match. A strong fingerprint is calculated on the contents of NVRAM on the primary controller as data is synced and then verified with the fingerprint calculated independently by the secondary controller.
When Tintri OS receives a write request, data is buffered into the NVRAM device on the primary controller and forwarded to the NVRAM device on the secondary controller. After acknowledgement from the secondary controller that data was safely persisted on its NVRAM, the primary controller acknowledges the write operation to the hypervisor. This, combined with techniques discussed in NVRAM for fast buffering and fail protection, ensures controller failures have no impact on data integrity.
High Availability Deep Dive
For a deep dive on more design features that keep the system available and protect the integrity of data, see Appendix I: High Availability Deep Dive.