NetBSD Internals

This book describes the NetBSD Operating System internals. The main idea behind it is to provide solid documentation for contributors that wish to develop extensions for NetBSD or want to improve its existing code. Ideally, there should be no need to reverse-engineer the system's code in order to understand how something works.

A lot of work is still required to finish this book: some chapters are not finished and some are not even started. Those parts that are planned but which are still pending to do are already part of the book but are clearly marked as incomplete by using a XXX marker.

This book is currently maintained by the NetBSD www team (<www@NetBSD.org>). Corrections, suggestions and extensions should be sent to that address.

Chapter 1. Memory management

Table of Contents

1.1. The UVM virtual memory manager
1.2. Managing wired memory

XXX: This chapter is extremely incomplete. It currently contains supporting documentation for Chapter 2, File system internals but nothing else.

1.1. The UVM virtual memory manager

UVM is the NetBSD's virtual memory manager.

1.1.1. UVM objects

An UVM object — or also known as uobj — is a contiguous region of virtual memory backed by a specific system facility. This can be a file (vnode), XXX What else?.

In order to understand what "to be backed by" means, here is a review of some basic concepts of virtual memory management. In a system with virtual memory support, the system can manage an address space bigger than the physical amount of memory available to it. The address space is broken into chunks of fixed size, namely pages, as is the physical memory, which is divided into page frames.

When the system needs to access a memory address, it can either find the page it belongs to (page hit) or not (page fault). In the former case, the page is already stored in main memory so its data can be directly accessed. In the latter case, the page is not present in main memory.

When a page fault occurs, the processor's memory management unit (MMU) signals the kernel through an exception and asks it to handle the fault: this can either result in a resolved page fault or in an error. Assuming that all memory accesses are correct (and hence there are no errors), the kernel needs to bring the requested page into memory. But where is the requested page? Is it in the swap space? In a file? Should it be filled with zeros?

Here is where the backing mechanism enters the game. A backing object defines where the pages should be read from and where shall them be stored after modifications, if any. Talking about implementation, reading a page from the backing object is preformed by a getpages function while writing to it is done by a putpages one.

Example: consider a 32-bit address space, a page size of 4096 bytes and an uobj of 40960 bytes (10 pages) starting at the virtual address 0x00010000; this uobj's backing object is a vnode that represents a text file in your file system. Assume that the file has not been read at all yet, so none of its pages are in main memory. Now, the user requests a read from offset 5000 and with a length of 4000. This offset falls into the uobj's second page and the ending address (9000) falls into the third page. The kernel converts these logical offsets into memory addresses (0x00011388 and 0x00012328) and reads all the data contained in between. So what happens? The MMU causes two page faults and the vnode's getpages method is called for each of them, which then reads the pages from the corresponding file, puts them into main memory and returns control to the caller. At this point, the read has been served.

Similarly, pages can be modified in memory after they have been brought to it; at some point, these changes will need to be flushed to the backing store, which happens with the backing object's putpages operation. There are multiple reasons for the flush, including the need to reclaim the least recently used page frame from main memory, explicitly synchronizing the uobj with its backing store (think about synchronizing a file system), closing a file, etc.

1.2. Managing wired memory

The malloc(9) and free(9) functions provided by the NetBSD kernel are very similar to their userland counterparts. They are used to allocate and release wired memory, respectively.

1.2.1. Malloc types

Malloc types are used to group different allocation blocks into logical clusters so that the kernel can manage them in a more efficient manner.

A malloc type can be defined in a static or dynamic fashion. Types are defined statically when they are embedded in a piece of code that is linked together the kernel during build time; if they are part of a standalone module, they are defined dynamically.

For static declarations, the MALLOC_DEFINE(9) macro is provided, which is then used somewhere in the global scope of a source file. It has the following signature:

`MALLOC_DEFINE(`	`type`,
	`short_desc`,
	`long_desc)`;

struct malloc_type *type;
const char *short_desc;
const char *long_desc;

The first parameter takes the name of the malloc type to be defined; do not let the type shown above confuse you, because it is an internal detail you ought not know. Malloc types are often named in uppercase, prefixed by M_. Some examples include M_TEMP for temporary data, M_SOFTINTR for soft-interrupt structures, etc.

The second and third parameters are a character string describing the type; the former is a short description while the later provides a longer one.

For a dynamic declaration, you must first define the type as static within the source file. Later on, the malloc_type_attach(9) and malloc_type_detach(9) functions are used to notify the kernel about the presence or removal of the type; this is usually done in the module's initialization and finalization routines, respectively.

Chapter 2. File system internals

Table of Contents

2.1. Structural overview
2.2. vnode interface overview
2.3. VFS interface overview
2.4. File systems overview
2.5. Initialization and cleanup
2.6. Mounting and unmounting
2.7. File system statistics
2.8. vnode management
2.9. The root vnode
2.10. Path name resolution procedure
2.11. File management
2.12. Symbolic link management
2.13. Directory management
2.14. Special nodes
2.15. NFS support
2.16. Step by step file system writing

This chapter describes in great detail the concepts behind file system development under NetBSD. It presents some code examples under the name of egfs, a fictitious file system that stands for example file system.

Throughout this chapter, the word file is used to refer to any kind of object that may exist in a file system; this includes directories, regular files, symbolic links, special devices and named pipes. If there is a need to mention a file that stores data, the term regular file will be used explicitly.

Understanding a complex body of code like the storage subsystem can be difficult. This chapter begins with a structural overview, explaining how specific file systems and the virtual file system (VFS) code interact. It continues with a description of both the vnode interface (the interface to files; Section 2.2, “vnode interface overview”) and the VFS interface (the interface to whole file systems; Section 2.3, “VFS interface overview”) and then summarizes the existing file systems. These sections should be read in order; they provide a general outline for the whole storage subsystem and a foundation for reading and understanding existing code.

The subsequent sections of this chapter dig into specific issues and constructs in detail. These sections may be read in any order, and are heavily cross-linked to one another to ease navigation. These later sections should be considered a reference guide rather than an introduction.

At the very end there is a section that summarizes, based on ready-to-copy-and-paste code examples, how to write a file system driver from scratch. Note that this section does not contain explanations per se but only links to the appropriate sections where each point is described.

2.1. Structural overview

The storage subsystem is divided into four basic parts. First and highest level is the VFS-level code, file system independent code that performs common functions on behalf of the rest of the kernel. This portion sits on top of the second part, the individual file systems. The third part, common or generic implementations of file system level logic, sits at the same conceptual level as file systems themselves but is file system independent and shared rather than being part of a single file system. The fourth part is lower level support code that file systems call into. This code is also file system independent. (A fifth portion, device drivers for storage buses and hardware, is not discussed in this chapter.)

The interface between the VFS-level code and the file systems is very clearly defined. It is made up of two parts, the vnode interface and the VFS interface, described in more detail in the next two sections. The other interfaces are much less clear, as is the upper interface that the VFS-level code provides to the system call layer and the rest of the kernel. Work is ongoing to clarify these interfaces.

Confusingly, the VFS-level code, the combination of the VFS and vnode interfaces, and the VFS interface alone are all sometimes referred to as "the VFS layer".

2.2. vnode interface overview

A vnode is an abstract representation of an active file within the NetBSD kernel; it provides a generic way to operate on the real file it represents regardless of the file system it lives on. Thanks to this abstraction layer, all kernel subsystems only deal with vnodes. It is important to note that there is a unique vnode for each active file.

A vnode is described by the struct vnode structure; its definition can be found in the src/sys/sys/vnode.h file and information about its fields is available in the vnode(9) manual page. The following analyzes the most important ideas related to this structure.

As the rule says, abstract representations must be specialized before they can be instantiated. vnodes are not an exception: each file system extends both the static and dynamic parts of an vnode as follows:

The static part — the data fields that represent the object — is extended by attaching a custom data structure to an vnode instance during its creation. This is done through the v_data field as described in Section 2.2.1, “The vnode data field”.
The dynamic part — the operations applicable to the object — is extended by attaching a vnode operations vector to a vnode instance during its creation. This is done through the v_op field as described in Section 2.2.3, “The vnode operations vector”.

2.2.1. The vnode data field

The v_data field in the struct vnode type is a pointer to an external data structure used to represent a file within a concrete file system. This field must be initialized after allocating a new vnode and must be set to NULL before releasing it (see Section 2.8.4, “Deallocation of a vnode”).

This external data structure contains any additional information to describe a specific file inside a file system. In an on-disk file system, this might include the file's initial cluster, its creation time, its size, etc. As an example, the NetBSD's Fast File System (FFS) uses the in-core memory representation of an inode as the vnode's data field.

2.2.2. vnode operations

A vnode operation is implemented by a function that follows the following contract: return an integer describing the operation's exit status and take a single void * parameter that carries a structure with the real operation's arguments.

Using an external structure to describe the operation's arguments instead of using a regular argument list has a reason: some file systems extend the vnode with additional, non-standard operations; having a common prototype makes this possible.

The following table summarizes the standard vnode operations. Keep in mind, though, that each file system is free to extend this set as it wishes. Also note that the operation's name is shown in the table as the macro used to call it (see Section 2.2.4, “Executing vnode operations”).

Table 2.1. vnode operations summary

Operation	Description	See also
`VOP_LOOKUP`	Performs a path name lookup.	See Section 2.10, “Path name resolution procedure”.
`VOP_CREATE`	Creates a new file.	See Section 2.11.1, “Creation of regular files”.
`VOP_MKNOD`	Creates a new special file (a device or a named pipe).	See Section 2.14, “Special nodes”.
`VOP_LINK`	Creates a new hard link for a file.	See Section 2.11.2, “Creation of hard links”.
`VOP_RENAME`	Renames a file.	See Section 2.11.4, “Rename of a file”.
`VOP_REMOVE`	Removes a file.	See Section 2.11.3, “Removal of a file”.
`VOP_OPEN`	Opens a file.
`VOP_CLOSE`	Closes a file.
`VOP_ACCESS`	Checks access permissions on a file.	See Section 2.11.8, “Access control”.
`VOP_GETATTR`	Gets a file's attributes.	See Section 2.11.6.1, “Getting file attributes”.
`VOP_SETATTR`	Sets a file's attributes.	See Section 2.11.6.2, “Setting file attributes”.
`VOP_READ`	Reads a chunk of data from a file.	See Section 2.11.5.4, “The read and write operations”.
`VOP_WRITE`	Writes a chunk of data to a file.	See Section 2.11.5.4, “The read and write operations”.
`VOP_IOCTL`	Performs an ioctl(2) on a file.
`VOP_FCNTL`	Performs a fcntl(2) on a file.
`VOP_POLL`	Performs a poll(2) on a file.
`VOP_KQFILTER`	XXX
`VOP_REVOKE`	Revoke access to a vnode and all aliases.
`VOP_MMAP`	Maps a file on a memory region.	See Section 2.11.5.3, “Memory-mapping a file”.
`VOP_FSYNC`	Synchronizes the file with on-disk contents.
`VOP_SEEK`	Test and inform file system of seek
`VOP_MKDIR`	Creates a new directory.	See Section 2.13.1, “Creation of directories”.
`VOP_RMDIR`	Removes a directory.	See Section 2.13.2, “Removal of directories”.
`VOP_READDIR`	Reads directory entries from a directory.	See Section 2.13.3, “Reading directories”.
`VOP_SYMLINK`	Creates a new symbolic link for a file.	See Section 2.12.1, “Creation of symbolic links”.
`VOP_READLINK`	Reads the contents of a symbolic link.	See Section 2.12.2, “Read of symbolic link's contents”.
`VOP_TRUNCATE`	Truncates a file.	See Section 2.11.6.2, “Setting file attributes”.
`VOP_UPDATE`	Updates a file's times.	See Section 2.11.7, “Time management”.
`VOP_ABORTOP`	Aborts an in-progress operation.
`VOP_INACTIVE`	Marks the vnode as inactive.	See Section 2.8.1, “vnode's life cycle”.
`VOP_RECLAIM`	Reclaims the vnode.	See Section 2.8.1, “vnode's life cycle”.
`VOP_LOCK`	Locks the vnode.	See Section 2.8.5, “vnode's locking protocol”.
`VOP_UNLOCK`	Unlocks the vnode.	See Section 2.8.5, “vnode's locking protocol”.
`VOP_ISLOCKED`	Checks whether the vnode is locked or not.	See Section 2.8.5, “vnode's locking protocol”.
`VOP_BMAP`	Maps a logical block number to a physical block number.	See Section 2.11.5.5, “Reading and writing pages”.
`VOP_STRATEGY`	Performs a file transfer between the file system's backing store and memory.	See Section 2.11.5.5, “Reading and writing pages”.
`VOP_PATHCONF`	Returns pathconf(2) information.
`VOP_ADVLOCK`	XXX
`VOP_BWRITE`	Writes a system buffer.
`VOP_GETPAGES`	Reads memory pages from the file.	See Section 2.11.5.2, “Getting and putting pages”.
`VOP_PUTPAGES`	Writes memory pages to the file.	See Section 2.11.5.2, “Getting and putting pages”.

2.2.3. The vnode operations vector

The v_op field in the struct vnode type is a pointer to the vnode operations vector, which maps logical operations to real functions (as seen in Section 2.2.2, “vnode operations”). This vector is file system specific as the actions taken by each operation depend heavily on the file system where the file resides (consider reading a file, setting its attributes, etc.).

As an example, consider the following snippet; it defines the open operation and retrieves two parameters from its arguments structure:

int
egfs_open(void *v)
{
        struct vnode *vp = ((struct vop_open_args *)v)->a_vp;
        int mode = ((struct vop_open_args *)v)->a_mode;

        ...
}

The whole set of vnode operations defined by the file system is added to a vector of struct vnodeopv_entry_desc-type entries, with each entry being the description of a single operation. The purpose of this vector is to define a mapping from logical operations such as vop_open or vop_read to real functions such as egfs_open, egfs_read. It is not directly used by the system under normal operation. This vector is not tied to a specific layout: it only lists operations available in the file system it describes, in any order it wishes. It can even list non-standard (and unknown) operations as well as lack some of the most basic ones. (The reason is, again, extensibility by third parties.)

There are two minor restrictions, though:

The first item always points to an operation used in case a non-existent one is called. For example, if the file system does not implement the vop_bmap operation but some code calls it, the call will be redirected to this default-catch function. As such, it is often used to provide a generic error routine but it is also useful in different scenarios. E.g., layered file systems use it to pass the call down the stack.

It is important to note that there are two standard error routines available that implement this functionality: vn_default_error and genfs_eopnotsupp. The latter correctly cleans up vnode references and locks while the former is the traditional error case one. New code should only use the former.
The last item always is a pair of null pointers.

Consider the following vector as an example:

const struct vnodeopv_entry_desc egfs_vnodeop_entries[] = {
        { vop_default_desc, vn_default_error },
        { vop_open_desc, egfs_open },
        { vop_read_desc, egfs_read },
        ... more operations here ...
        { NULL, NULL }
};

As stated above, this vector is not directly used by the system; in fact, it only serves to construct a secondary vector that follows strict ordering rules. This secondary vector is automatically generated by the kernel during file system initialization, so the code only needs to instruct it to do the conversion.

This secondary vector is defined as a pointer to an array of function pointers of type int (**vops)(void *). To tell the kernel where this vector is, a mapping between the two vectors is established through a third vector of struct vnodeopv_desc-type items. This is easier to understand with an example:

int (**egfs_vnodeop_p)(void *);
const struct vnodeopv_desc egfs_vnodeop_opv_desc =
        { &egfs_vnodeop_p, egfs_vnodeop_entries };

Out of the file-system's scope, users of the vnode layer will only deal with the egfs_vnodeop_p and egfs_vnodeop_opv_desc vectors.

2.2.4. Executing vnode operations

All vnode operations are subject to a very strict locking protocol among several other call and return contracts. Furthermore, their prototype makes their call rather complex (remember that they receive a structure with the real arguments). These are some of the reasons why they cannot be called directly (with a few exceptions that will not be discussed here).

The NetBSD kernel provides a set of macros and functions that make the execution of vnode operations trivial; please note that they are the standard call procedure. These macros are named after the operation they refer to, all in uppercase, prefixed by the VOP_string. Then, they take the list of arguments that will be passed to them.

For example, consider the following implementation for the access operation:

int
egfs_access(void *v)
{
        struct vnode *vp = ((struct vop_access_args *)v)->a_vp;
        int mode = ((struct vop_access_args *)v)->a_mode;
        struct ucred *cred = ((struct vop_access_args *)v)->a_cred;
        struct proc *p = ((struct vop_access_args *)v)->a_p;

        ...
}

A call to the previous method could look like this:

result = VOP_ACCESS(vp, mode, cred, p);

For more information, see the vnodeops(9) manual page, which describes all the mappings between vnode operations and their corresponding macros.

2.3. VFS interface overview

The kernel's Virtual File System (VFS) subsystem provides access to all available file systems in an abstract fashion, just as vnodes do with active files. Each file system is described by a list of well-defined operations that can be applied to it together with a data structure that keeps its status.

2.3.1. The mount structure

File systems are attached to the virtual directory tree by means of mount points. A mount point is a redirection from a specific directory^[1] to a different file system's root directory and is represented by the generic struct mount type, which is defined in src/sys/sys/mount.h.

A file system extends the static part of a struct mount object by attaching a custom data structure to its mnt_data field. As with vnodes, this happens when allocating the structure.

The kind of information that a file system stores in its mount structure heavily depends on its implementation. Generally, it will typically include a pointer (either physical or logical) to the file system's root node, used as the starting point for further accesses. It may also include several accounting variables as well as other information whose context is the whole file system attached to a mount point.

2.3.2. VFS operations

A file system driver exposes a well-known interface to the kernel by means of a set of public operations. The following table summarizes them all; note that they are sorted according to the order that they take in the VFS operations vector (see Section 2.3.3, “The VFS operations structure”).

Table 2.2. VFS operations summary

Operation	Description	Considerations	See also
`fs_mount`	Mounts a new instance of the file system.	Must be defined.	See Section 2.6, “Mounting and unmounting”.
`fs_start`	Makes the file system operational.	Must be defined.
`fs_unmount`	Unmounts an instance of the file system.	Must be defined.	See Section 2.6, “Mounting and unmounting”.
`fs_root`	Gets the file system root vnode.	Must be defined.	See Section 2.9, “The root vnode”.
`fs_quotactl`	Queries or modifies space quotas.	Must be defined.
`fs_statvfs`	Gets file system statistics.	Must be defined.	See Section 2.7, “File system statistics”.
`fs_sync`	Flushes file system buffers.	Must be defined.
`fs_vget`	Gets a vnode from a file identifier.	Must be defined.	See Section 2.8.3, “Allocation of a vnode”.
`fs_fhtovp`	Converts a NFS file handle to a vnode.	Must be defined.	See Section 2.15, “NFS support”.
`fs_vptofh`	Converts a vnode to a NFS file handle.	Must be defined.	See Section 2.15, “NFS support”.
`fs_init`	Initializes the file system driver.	Must be defined.	See Section 2.5, “Initialization and cleanup”.
`fs_reinit`	Reinitializes the file system driver.	May be undefined (i.e., null).	See Section 2.5, “Initialization and cleanup”.
`fs_done`	Finalizes the file system driver.	Must be defined.	See Section 2.5, “Initialization and cleanup”.
`fs_mountroot`	Mounts an instance of the file system as the root file system.	May be undefined (i.e., null).
`fs_extattrctl`	Controls extended attributes.	The generic `vfs_stdextattrctl` function is provided as a simple hook for file systems that do not support this operation.

The list of VFS operations may eventually change. When that happens, the kernel version number is bumped.

2.3.3. The VFS operations structure

Regardless of mount points, a file system provides a struct vfsops structure as defined in src/sys/sys/mount.h that describes itself type is. Basically, it contains:

A public identifier, usually named after the file system's name suffixed by the fs string. As this identifier is used in multiple places — and specially both in kernel space and in userland —, it is typically defined as a macro in src/sys/sys/mount.h. For example: #define MOUNT_EGFS "egfs".
A set of function pointers to file system operations. As opposed to vnode operations, VFS ones have different prototypes because the set of possible VFS operations is well known and cannot be extended by third party file systems. Please see Section 2.3.2, “VFS operations” for more details on the exact contents of this vector.
A pointer to a null-terminated vector of struct vnodeopv_desc * const items. These objects are listed here because, as stated in Section 2.2.3, “The vnode operations vector”, the system uses them to construct the real vnode operations vectors upon file system startup.

It is interesting to note that this field may contain more than one pointer. Some file systems may provide more than a single set of vnode operations; e.g., a vector for the normal operations, another one for operations related to named pipes and another one for operations that act on special devices. See the FFS code for an example of this and Section 2.14, “Special nodes” for details on these special vectors.

Consider the following code snipped that illustrates the previous items:

const struct vnodeopv_desc * const egfs_vnodeopv_descs[] = {
        &egfs_vnodeop_opv_desc,
        ... more pointers may appear here ...
        NULL
};

struct vfsops egfs_vfsops = {
        MOUNT_EGFS,
        egfs_mount,
        egfs_start,
        egfs_unmount,
        egfs_root,
        egfs_quotactl,
        egfs_statvfs,
        egfs_sync,
        egfs_vget,
        egfs_fhtovp,
        egfs_vptofh,
        egfs_init,
        NULL, /* fs_reinit: optional */
        egfs_done,
        NULL, /* fs_mountroot: optional */
        vfs_stdextattrctl,
        egfs_vnodeopv_descs
};

The kernel needs to know where each instance of this structure is located in order to keep track of the live file systems. For file systems built inside the kernel's core, the VFS_ATTACH macro adds the given VFS operations structure to the appropriate link set. See GNU ld's info manual for more details on this feature.

VFS_ATTACH(egfs_vfsops);

Standalone file system modules need not do this because the kernel will explicitly get a pointer to the information structure after the module is loaded.

2.4. File systems overview

2.4.1. On-disk file systems

On-disk file systems are those that store their contents on a physical drive.

Fast File System (ffs): XXX
Log-structured File System (lfs): XXX
Extended 2 File System (ext2fs): XXX
FAT (msdosfs): XXX
ISO 9660 (cd9660): XXX
NTFS (ntfs): XXX

2.4.2. Network file systems

Network File System (nfs): XXX
Coda (codafs): XXX

2.4.3. Synthetic file systems

Memory File System (mfs): XXX
Kernel File System (kernfs): XXX
Portal File System (portalfs): XXX
Pseudo-terminal File System (ptyfs): XXX
Temporary File System (tmpfs): XXX