Ceph started as research project in 2004 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn’t want to write a storage layer from scratch. A complete filesystem takes a lot of time (10 years) to develop, stabilize, optimize, and mature.
However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since 2010. Abutalib likens this as the stages of grief: denial, anger, bargaining, …, and acceptance! Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which doesn’t use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.
The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS. When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes. Finally, having full control of I/O stack accelerates new hardware adoption. For example, while filesystems have hard time adapting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.
To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.
Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.
Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.