MNEMOFS

Mnemofs is a NAND Flash File System built for NuttX.

Usage

If there’s a NAND flash available at a location, for example, /dev/nand, you can mount it with mnemofs to a location like /mydir using:

mount -t mnemofs /dev/nand /mydir

The above command will only work if the device was already formatted using mnemofs. For a brand new device, or if you want to switch from an existing file system, this won’t work, and would need a format:

mount -t mnemofs -o forceformat /dev/nand /mydir

Unsure of whether you need to do a format? This will help:

mount -t mnemofs -o autoformat /dev/nand /mydir

This will format the device only if it can not detect mnemofs being already formatted onto it. Do note this includes cases where mnemofs is formatted to the device, but it’s been mutilated to the point of being unrecognizable.

After this, use it like a regular file system. That’s the job of a file system after all…to hide the storage device’s pecularities behind an abstraction. A file system is considered good if you don’t have to think about its existence during regular usage.

NAND Flashes

Programmatically, the NAND flash has some problems. The whole device can be condensed into three layers: blocks, pages and cells.

Cells represent the smallest unit of storage in NAND flashes, but are often ignored, as direct access is not allowed. If a cell stores one bit, it’s a Single Level Cell. There are MLC, TLC, etc. for more bits per cell. Often, the more bits per cell, the lesser is the wear resilience. Thus, higher bits per cell are easier to wear out and become unreliable.

Pages are the smallest readable or writable unit of the NAND flash. It’s made up of several cells, and can be expected to have a size of the similar order of 512 B.

Blocks are the smallest erasable unit of NAND flash. They are made up of several pages. If a page is already written, it needs to be erased before it can be written again. And since blocks are the smallest erasable unit, the entire block needs to be erased if the user wants to update the contents of one page.

The erase operation is what causes a block to wear out. If a block is worn out too much, it will lose its ability to reliably store data. An unreliable block can not guarantee that the data read from the pages in it is the same as what was written to it. This state is called as a bad block.

A manufacturer can also deem a block to be unreliable from their testing, and can mark them as bad blocks right from manufacture.

A good file system will aim to level out the wear between blocks as much as it can.

Design

There are various layers and components in mnemofs, and they interact with various layers on abstraction over each other.

Mnemofs works on a Copy-On-Write (CoW) basis, which means, if a page needs to be updated, it is copied over in memory, and then the change is applied, and the new data is written to a new location.

R/W Layer

This works with the NAND flash device driver directly. It can write an entire page, read an entire page, erase an entire block, check if a block is bad (from it’s bad block marker), or set a block as bad. It’s the simplest layer.

Block Allocator

The block allocator contains two arrays. One is a bit mask is for tracking the free pages, while the other is an array of numbers, one number for each block, denoting the number of pages in that block that are ready to be erased.

The block allocator allocates pages or blocks in a sequential manner to keep it fair for all pages, thus, ensuring wear levelling. It also starts from a random offset to prevent bias to the front of the device in case of multiple power losses and reinitialization in such casses. If a block is required it skips pages to the start of the next block. Since block allocations happen only in the journal, they happen in bulk and the number of skipped pages is very minimal.

Once reaching the end of the device, it cycles back to the front. Thus any skipped pages get the chance to be allocated in the next cycle.

CTZ Layer

This works with the R/W Layer, and acts as an abstraction layer for other components in mnemofs. Mnemofs uses CTZ lists to represent both files and directories in flash. CTZ lists of files contain only the data of the file, while CTZ lists of directories contain directory entries (direntries) for each FS Object (file or directory) grouped into it.

This layer abstracts away the complex division of flash space that’s present in CTZ skip lists, and allows users of this layer to not worry about the complexities of a CTZ skip list, and infact, to feel that the data is like a contiguous space.

This layer allows the user to specify a data offset, which refers to the offset into the actual data stored in the CTZ skip list (ie. excluding the pointers), and the number of bytes, and perform operations on the CTZ list almost like if it were a single array.

In mnemofs, each CTZ block takes up the space of exactly 1 page, and each pointer takes up 4 bytes.

Littlefs design document shows how a CTZ list can be identified using the index of the last CTZ block, and the page number of that CTZ block.

Journal

The journal in mnemofs is made out of n + 2 blocks. The last two block concern the master node. These blocks are arranged in a singly linked list.

Due to CoW policy, when a CTZ list is updated, it now has a new location. The first n blocks of the journal is responsible for storing logs containing information about this very update. It will contain the old location of the CTZ skip list, and the new location.

Thus, when the user requires an updated location of a CTZ list, they will first find the old location by traversing the FS tree in the flash, and then will traverse the journal to find the latest location.

So, the FS tree on the flash acts like a “base” state with updates stored in the journal. Each log in journal is followed by a checksum to verify if all of it was written properly. This helps in making it power loss resilient.

The journal, when combined with CoW, plays another important role. In pure CoW, any update to a CTZ file will result in it having a new location. This new location wil need to be updated in the parent, which itself will have a new location after the update, and so on till it reaches the root. The journal stops this propagation immediately. When the journal is full above a certain limit, it will flush, and apply all of these changes to the FS tree in one go. This helps in wear reduction.

The journal mainly works with the CTZ layer, and any updates to a CTZ list using this layer automatically adds a log for it in the journal.

The journal starts with a magic sequence, then the number of blocks in the journal (excluding master blocks), and then follows an array with the block numbers of the blocks in the journal (including the master blocks). Following this, logs are stored in the blocks.

Master Node and Root

The root of the file system is treated like any directory as far as its storage on the flash is concerned. This is because the master node acts as a parent to the root, and contains information of the root in a way identical to direntries.

The master node is stored in the master blocks. There are two master blocks, and both are duplicated of each other for backup. Each master block is a block, and thus have multiple pages in them. Each page contains one revision of the master node. The master nodes are written sequentially, and have a timestamp on them as well.

When a CTZ list is moved to a new location, the obsolete pages of the old CTZ list are marked for deletion.

LRU Cache

Mnemofs has a Least Recently Used (LRU) Cache component. The main use of this component is to reduce wear on the flash at the expense of memory.

The LRU is a kernel list of nodes. Each node represents an FS object. Each node also contains a kernel list of deltas. Each delta contains information about an update or deletion from the user (which is what all of the VFS write operations can be condensed to).

There’s a pre-configured limit for both deltas per node and nodes in the LRU.

If the delta list in a node is full, and another is to be added, all the existing deltas in the list are clubbed together and written to the flash using the CTZ layer. The layer also automatically adds a log for this update. When a node receives a delta, it is bumped from its current location in the LRU to be at the front. This way, the last node in the LRU is always the least used node.

If the node limit is reached in the LRU, and a new node is to be added to the LRU, then the final node (which is also the least recently used node), is popped from the LRU to make space for the new node. This popped node is then written to the flash using the CTZ layer as well.

The LRU helps in clubbing updates to a single FS object and thus helps in reducing the wear of the flash.

Journal Flush

The latest master node revision is the most useful out of the revisions. As in CoW it’s prudent to update the FS tree from bottom up, the root is the last one to get updated in the case of a journal flush.

The logs are just location updates. So, when a journal flush occurs, it will update the locations of all the children in the parent. This updates the parent, and then this update goes through the same procedure as any other update.

This is why it’s best to start the flush operation when the journal is filled up over a certain limit, instead of waiting it to be full. Why? Any log of a parent makes any log of its children written before it useless, as the updated location of the parent can be read to get the updated location of the child till that point in the logs.

So, it will be best to move up the FS tree from bottom during update and update the root last, since the root is the parent of every FS object.

Once the root is updated, all other journal logs become useless, and can be erased. The root’s log is not written in the first n blocks of the journal, but written as a new master node entry in the master blocks.

Once the new root is written, the first n blocks can be erased, and reallocated (due to the rules of wear levelling). The master blocks however have some conditions for reallocation. This is called moving of the journal.

Every time the first n blocks are cleared, a new master node is added. The only time a master block needs to be erased is when it becomes full. Thus, if there are p pages in a block, the master blocks will be moved along with the rest of the journal for every p journal flushes.

Before the new master node is written, none of the old pages should be erased to allow rollback to the previous FS tree state. The moment the new master node is updated, any block which has all of the pages in it ready for deletion will be erased to make space.

FS Object Layer

This layer provides an abstraction for iterating, adding, deleting or reading direntries.

This works with the LRU and the journal to get the latest data and thus the user of this layer does not have to worry about these underlying mnemofs components.

VFS Method Layer

VFS method layer contains methods exposed to the VFS. This layer works with the FS Object layer for direntry related tasks, or the LRU for file level read/write tasks.