2012 LSF/MM Summit Summary - Day 2
Flash Media
The second day of LSF/MM summit started with flash media led by Steven Sprouse from SanDisk. He started with an introduction of lifetime terabyte writes, which is defined as:
physical capacity * write endurance
LTW = -------------------------------------
write amplification
Physical capacity is increasing but write endurance is decreasing as write cycle increases (every write cycle hurts NAND so that it stores data for shorter amount of time). Write amplification is affected by many factors, like block size, provisioning, trim etc.
Sprouse mentioned that some SSD vendors are starting to use hybrid SLC/MLC where SLC is used for frequent journal write and MLC for data blocks. Therefore the request for SSD to ask for information to differentiate metadata write and data write has been proposed, with tagging information, flash can better decide where to storage the data.
Another point made by Sprouse is the definition of “random write”. Different hardware has different capability of handling random write. For flash, anything smaller than erase block is random write. However, erase block size is changing. It was 64KB back in 2004 but now most are 1MB. So it is really necessary that flash vendors expose such information to OS developers.
To help flash to get better NAND geometry, possible ways that flash vendor and OS developers can cooperate are:
- OS tells flash what data may be dropped at the same time, so that flash can put these data in the same erase block. One example given by Ted is rpm/deb package files. They are likely to vanish at the same time during upgrade.
- Flash vendor reports some geometry information to upper layers, like block size, page size, stripe size, etc.
- Provide some tagging mechanisms so that userspace/fs can tag different data types, e.g., metadata and data, hot data and code data, etc.
- filesystems help on cleaning up (trim).
It is suggested that flash vendors make a list of information that they are willing to provide and OS developers can look at it and decide what can be useful. However, even if there are standard ways to query these information, vendors are not forced to fill them in correctly.
Flash Cache in the Operating System
The discussion mainly happened around integrating bcache. Flashcache is implemented by Facebook and based on device-mapper, which bcache is implemented by Google and performs better than flashcache. However, bcache need so many information dropped/hidden by DM that it bypasses DM completely. To integrate bcache with DM, there would be a large amount of changes to DM, and some changes to block layer as well. The discussion didn’t result in any real conclusion though.
High IOPS and SCSI/Block
Roland Dreier introduces two current modes of writing block drivers: register its own make_request, and using bolck layer make_request(). The former is too low level that it bypasses many useful block layer functions, while the latter is slow because of heavy lock contention. Jen Axboe proposed his multi-queue patches that implements per-CPU queue coupled with lightweight elevator. He promised to get these patches soon and it will help solve the performance problem.
Another issue discussed is Dell’s vision of accessing PCIe flash devices. Currently PCIe flash devices still provide block interfaces, but they are hoping to change to use memory interfaces to get better performance. However, it is pointed out that memory interface doesn’t do software error recovery. Any error reported by device is fatal failure so devices have to all error handling and only report error as the last resort.
LBA Hinting and New Storage Commands
Frederick Knight led the discussion of handling shingle drives. Shingle drives can largely increase disk density but require OS to write in band. There are three options:
- transparent to OS
- banding: let the host manage geometry and expose new SCSI commands for handling bands
- transparent with Hints: make it look like a normal disks but develop new SCSI commands to hint both ways between device and host what the data is and device characteristics are to try to optimize data placement
The second option is dropped by attendees immediately. And the first option looks more like current SSD’s situation and the third option is the same situation that flash vendors are pursuing.
Storage Manager
Lukas Czerner led the session to give an update of his command line storage manager. It mainly aims to be a generic storage manager and reduce the complexity of manage different storage devices and file systems. However, as underlying storage device/filesystems may provide different type-dependent options, the new storage manager reduce complexity iff users don’t need those misc options.
WRITE_SAME and UNMAP, FSTRIM
The session started with some complaint about current trim command. In the ATA TRIM command, there are only two bytes used for trim range, meaning the range can be at most 64K sectors, which is 32MB for each TRIM operation. Another problem is current block layer only allows for continuous trim range. As TRIM is non-queue, the overhead goes up a lot when there are a lot of distinct ranges to trim. Christoph Hellwig had some PoC to demonstrate multi-range trim in XFS and it showed only ~1% overhead compared with no trim case.
Besides trim, scatter/gatter SCSI commands have the same multi-range problem. There are two options to implement multi-range command in block layer, allowing single BIO to carry multiple ranges, and use a linked BIOs for the range. After discussion, the latter is adopted.
NFS server issues
Bruce Fields led the session with current status of knfsd with regard to features like change attributes, delegation/oplocks, share locks, delete-file recovery/server-side sillyrename and lock recovery.
During the discussion, people asked about pNFS and Boaz made a point that kernel pNFS server may not be happening because there is no developer interest in making it happen.