Understanding the Linux File and Directory System

After extensive study and exploration of operating systems and having grasped the concepts of CPU and RAM virtualization, I am now eager to delve into another crucial aspect of operating systems — the file and directory system. In this introductory blog post, I aim to share my insights and provide an overview of how file systems operate. The content presented here is a synthesis of my extensive readings, and I have provided a resource in the designated section for further exploration.

Files

At its core, a file is a linear array of bytes that can be read from or written to. Each file is associated with a low-level name, often referred to as its inode number. This low-level name is typically not visible to the user. The primary responsibility of the file system is to store data persistently on disk and ensure that the data retrieved matches what was originally stored.

In most operating systems, the file system does not concern itself with the content or structure of the file. It treats all files as collections of bytes, regardless of whether they contain text, images, or code. This abstraction allows for flexibility and interoperability between different file types. When a file is accessed, the file system locates the corresponding data on disk and provides it to the requesting process.

Directories

In addition to files, the file system also employs the concept of directories to organize and manage the storage of files and other directories. Like files, directories have low-level names (inode numbers). However, their contents are specific to their purpose. A directory consists of a list of (user-readable name, low-level name) pairs. This mapping allows users to associate user-friendly names with the low-level names of files.

By nesting directories within other directories, users can create a hierarchical structure known as a directory tree or directory hierarchy. This tree-like structure provides a convenient way to organize and locate files within the file system. Each entry in a directory can refer to either a file or another directory, enabling the construction of a flexible and expandable storage system.

The Power of Naming

Naming plays a crucial role in computer systems, and the file system in UNIX-like systems exemplifies this. Virtually everything in such systems, including files, devices, pipes, and even processes, is named through the file system. This uniform naming approach simplifies the conceptual model of the system and enhances its modularity. When designing systems or interfaces, careful consideration should be given to the choice of names to ensure clarity and consistency. Everything is a file.

Harnessing the Convenience of File Names

One of the key advantages provided by the file system is the ability to assign meaningful names to files. Names serve as a crucial first step in accessing any resource, allowing users to easily identify and refer to specific files. Whether it’s a document, a multimedia file, or a program source code, the file system provides a convenient and consistent way to name and locate the files we need.

Exploring File Creation, Access, and Deletion in the Linux File System

In this section, we will dive into the fundamental operations of creating, accessing, and deleting files in the Linux file system. Although these operations may seem straightforward, we will uncover some intriguing aspects along the way, including the mysterious unlink() call used for file removal.

`open()` and File Descriptors

When a file is created or accessed in a UNIX system, the open() function is used. One notable feature of open() is that it returns a file descriptor, which is an integer unique to each process. The file descriptor serves as a private handle that grants the process permission to perform operations on the file.

Think of a file descriptor as a capability — a powerful tool that provides access to specific file-related operations. With a file descriptor, you can read from or write to the file, assuming you have the necessary permissions. Additionally, the file descriptor acts as a pointer to a file object, allowing you to invoke other methods like read() and write() to interact with the file’s content.

The Anatomy of `cat` and File Operations

Let’s take a closer look at the cat command as an example to understand how file operations work in practice. Consider the following sequence of commands and their corresponding system-call trace:

prompt> echo hello > foo
prompt> cat foo
hello
prompt>

When cat is executed on the file foo, it performs the following steps:

Opening the file: cat uses the open() system call to open the file foo for reading. In the system-call trace, we observe that the call to open() succeeds and returns a file descriptor of 3. The file is opened in read-only mode (O_RDONLY flag) and with support for 64-bit offsets (O_LARGEFILE flag).
Reading the file: After successfully opening the file, cat employs the read() system call to read a certain number of bytes from the file. The first argument passed to read() is the file descriptor, which informs the file system about the target file. In our example, the result of the read operation is the string “hello”, which is stored in a buffer.
Writing to standard output: The system-call trace also reveals a call to the write() system call with file descriptor 1. File descriptor 1 corresponds to the standard output, which is typically the screen. Here, the program writes the word “hello” to the screen using the file descriptor 1. (Sidenote: File descriptor 0 corresponds to the standard input, and file descriptor 2 means standard error.)
Closing the file: Once the file has been fully read, cat attempts to read more bytes. However, since there are no remaining bytes in the file, the read() system call returns 0. This indicates to the program that it has reached the end of the file. To signify that it is done with the file, cat invokes the close() system call, passing in the file descriptor associated with the file. As a result, the file is closed, and the reading operation is considered complete.

Writing Files and the Similar Workflow

Writing files follows a similar sequence of steps. First, a file is opened for writing using the open() function. Then, the write() system call is used to write data to the file, potentially in multiple iterations for larger files. Finally, the close() system call is employed to indicate that the writing process is finished.

For a hands-on experience with file writes, you can use the strace command to trace writes to a file, either in a program you have written or by tracing utilities like dd.

Reading and Writing at Specific Offsets (Random Access)

In the previous sections, we focused on the sequential reading and writing of files, where the data is processed from the beginning to the end. However, there are scenarios where it’s beneficial to read from or write to specific offsets within a file, especially when dealing with tasks like building an index for text documents or looking up specific information.

To accomplish this, we can utilize the lseek() system call, which allows us to set the file offset to a particular location within the file.

off_t lseek(int fildes, off_t offset, int whence);

Let’s explore the arguments of lseek():

The first argument (fildes) is the file descriptor, which identifies the file being accessed. We pass the file descriptor obtained from the open() system call to indicate which file to manipulate.
The second argument (offset) determines the position in the file where the file offset should be set. It specifies the number of bytes from a particular reference point.
The third argument (whence) determines how the seek operation is performed. There are three possible values for whence:
- SEEK_SET: Sets the offset to the specified number of bytes.
- SEEK_CUR: Sets the offset to the current location plus the specified number of bytes.
- SEEK_END: Sets the offset to the size of the file plus the specified number of bytes.

Each process maintains a current offset for each file it opens. This offset represents the position from which the next read or write operation will start within the file. When a read or write operation is performed, the offset is implicitly updated based on the number of bytes read or written. Additionally, the lseek() system call can explicitly change the offset based on the specified reference point.

The file offset is stored in the struct file, which we encountered earlier. Here is a simplified version of the struct file definition:

struct file {
  int ref;
  char readable;
  char writable;
  struct inode *ip;
  uint off;
};

In this structure, we can observe that the current offset (off) is maintained alongside other information such as the reference count (ref), readability (readable), writability (writable), and a pointer to the underlying file (ip).

These file structures collectively represent all the currently opened files in the system and are often referred to as the “open file table.” In the xv6 operating system, this table is implemented as follows:

struct {
  struct spinlock lock;
  struct file file[NFILE];
} ftable;

Let’s illustrate the behavior of file access and offsets through a few examples:

Sequential Read: Consider a process that opens a file named “file” (with a size of 300 bytes) and reads it by calling the read() system call repeatedly, reading 100 bytes at a time. The following table shows the relevant system calls, their return codes, and the value of the current offset for each operation:

System Calls                      Return Code     Current Offset
---------------------------------------------------------------------
fd = open("file", O_RDONLY);      3               0
read(fd, buffer, 100);            100             100
read(fd, buffer, 100);            100             200
read(fd, buffer, 100);            100             300
read(fd, buffer, 100);            0               300
close(fd);                        0               –

In this example, we can observe how the current offset gets initialized to zero when the file is opened. With each read() call, the offset is incremented by the number of bytes read. When the process attempts to read beyond the end of the file, the read() call returns zero, indicating that the entire file has been read.

2. Multiple File Descriptors: Let’s trace a process that opens the same file twice and performs a read operation on each file descriptor:

System Calls                       Return Code     OFT[10] Current Offset   OFT[11] Current Offset
------------------------------------------------------------------------------------------
fd1 = open("file", O_RDONLY);      3               0               –              -
fd2 = open("file", O_RDONLY);      4               0               0              –
read(fd1, buffer1, 100);           100             100             0              –
read(fd2, buffer2, 100);           100             100             100            –
close(fd1);                        0               –               100            –
close(fd2);                        0               –               –              –

In this example, two file descriptors (fd1 and fd2) are allocated, each referring to a different entry in the open file table (entries 10 and 11, as shown in the table heading). The current offset for each file descriptor is updated independently.

3. Random Access: Suppose a process uses lseek() to reposition the current offset before performing a read operation. In this case, only a single open file table entry is needed:

System Calls                   Return Code      Current Offset
------------------------------------------------------------------
fd = open("file", O_RDONLY);      3               0
lseek(fd, 200, SEEK_SET);         200             200
read(fd, buffer, 50);             50              250
close(fd);                        0               –

Here, the lseek() call sets the current offset to 200. The subsequent read() operation reads the next 50 bytes and updates the current offset accordingly.

In summary, random access allows us to read from or write to specific offsets within a file by utilizing the lseek() system call to modify the current offset associated with a file descriptor. This flexibility enables us to handle various file operations efficiently, such as indexing, searching, and manipulating specific sections of a file without the need to process the entire file sequentially.

Shared File Table Entries with fork() and dup()

In most cases, each file descriptor in a process is associated with a unique entry in the open file table. This one-to-one mapping means that each process has independent access to the file, with its own current offset maintained in the open file table. However, there are scenarios where file table entries can be shared between processes, leading to shared access and shared offsets.

One such case occurs when a parent process creates a child process using the fork() system call. During a fork(), the child process inherits a copy of the parent’s open file table, including all the file table entries. The reference count of the shared file table entry is incremented, indicating that it is now being shared between the parent and the child.

Consider the following code snippet:

#include <stdio.h>
#include <unistd.h>

int main() {
  int fd = open("file.txt", O_RDONLY);
  off_t offset;

  if (fork() == 0) {
    // Child process
    offset = lseek(fd, 10, SEEK_SET);
    printf("child: offset %ld\n", offset);
    exit(0);
  } else {
    // Parent process
    wait(NULL);
    offset = lseek(fd, 0, SEEK_CUR);
    printf("parent: offset %ld\n", offset);
  }

  close(fd);
  return 0;
}

In this example, the parent process opens the file “file.txt” and then creates a child process using fork(). The child process adjusts the current offset of the shared file descriptor using lseek() and then exits. The parent process waits for the child to complete and then checks the current offset of the shared file descriptor.

When we run this program, the output will be:

child: offset 10
parent: offset 10

Here, we can see that the child process’s adjustment of the current offset affects the shared file descriptor. Both the parent and the child processes have access to the same file table entry and share the same offset.

It’s worth noting that when a file table entry is shared, its reference count is incremented. The entry will only be removed when both processes close the file descriptor or exit. This ensures that the entry remains valid as long as it is being used by any of the processes sharing it.

Another scenario where file table entries can be shared is through the dup() system call, which duplicates an existing file descriptor. The duplicated file descriptor refers to the same file table entry as the original file descriptor, resulting in shared access and shared offsets.

Shared file table entries provide a mechanism for processes to interact and manipulate the same file concurrently. However, caution must be exercised to ensure proper synchronization and coordination between the processes to avoid data corruption or unexpected behavior.

Immediate Write with fsync()

In many cases, when a program calls the write() system call to write data to a file, the file system buffers the writes in memory for a certain period of time before actually committing them to persistent storage. This buffering improves performance by reducing the number of disk writes and optimizing the order of writes.

However, some applications require stronger guarantees about data durability and need to ensure that the data is immediately written to disk. This is particularly important for applications like database management systems (DBMS) that rely on correct recovery protocols.

To support such applications, most file systems provide additional control APIs, and in the UNIX world, this interface is known as fsync(int fd). When a process calls fsync() for a specific file descriptor, the file system responds by forcing all the dirty (i.e., not yet written) data associated with that file descriptor to be written to disk. The fsync() call ensures that the data is durably stored on disk before the function returns.

Here’s an example of using fsync():

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
    int fd = open("file.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) {
        perror("open");
        return 1;
    }

    const char *data = "Hello, world!";
    ssize_t bytesWritten = write(fd, data, strlen(data));
    if (bytesWritten == -1) {
        perror("write");
        close(fd);
        return 1;
    }

    if (fsync(fd) == -1) {
        perror("fsync");
        close(fd);
        return 1;
    }

    close(fd);
    return 0;
}

In this example, the program opens a file “file.txt” for writing and writes the string “Hello, world!” to the file using the write() system call. After that, it calls fsync() to ensure that the data is immediately written to disk before proceeding. Finally, it closes the file.

The fsync() call ensures that any modifications made to the file descriptor are durably stored on disk, providing a guarantee that the data will survive even if a system crash occurs shortly after the call. This is crucial for applications that require strong data durability guarantees.

It’s important to note that fsync() can be an expensive operation, as it involves synchronizing disk I/O. Therefore, it should be used judiciously and only when necessary to meet the application’s requirements for durability.

Renaming Files

The rename() system call provides an atomic way to rename files in most cases. This atomicity ensures that if a system crash occurs during the renaming process, the file will either be named with the old name or the new name, avoiding any inconsistent or intermediate states.

The ability to atomically rename files is crucial for certain types of applications that require consistent and atomic updates to file states.

Let’s consider a specific scenario to illustrate the use of rename(). Imagine you are using a file editor, such as Emacs, and you want to insert a line into the middle of a file. The file is named foo.txt. The editor can achieve this by following these steps:

int fd = open("foo.txt.tmp", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
write(fd, buffer, size); // write out the new version of the file
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");

In this example, the file editor first opens a new temporary file named foo.txt.tmp using the open() system call. It opens the file with the O_WRONLY flag to write to it, the O_CREAT flag to create the file if it doesn’t exist, and the O_TRUNC flag to truncate it if it already exists. The file is created with read and write permissions for the owner (S_IRUSR | S_IWUSR).

Next, the editor writes the new version of the file to the temporary file using the write() system call. The buffer contains the content to be written, and size represents the size of the content.

To ensure the data is durably written to disk, the editor calls fsync() to synchronize the file’s content and metadata with the underlying storage device.

After the synchronization is complete, the editor closes the temporary file using close().

Finally, the editor uses the rename() system call to atomically swap the temporary file foo.txt.tmp with the original file foo.txt. This atomic renaming operation guarantees that either the new version of the file with the inserted line or the old version of the file remains intact. There will be no inconsistent or intermediate states in the event of a system crash during the renaming process.

By using this approach, the file editor achieves an atomic update to the file’s content, ensuring data consistency and integrity.

It’s worth noting that the atomicity of the rename() operation depends on the underlying file system implementation. Most modern file systems provide atomic renames, but there might be exceptions in certain specialized file systems or rare circumstances. Therefore, it’s important to consult the documentation or specifications of the specific file system being used to ensure the atomicity guarantee.

Getting Information About Files

In addition to providing file access, file systems also store metadata about each file they contain. Metadata refers to the data that describes the characteristics and properties of a file, such as its size, permissions, timestamps, and other attributes. To retrieve the metadata of a file, we can use the stat() or fstat() system calls, which provide information about a file based on its pathname or file descriptor, respectively. These system calls populate a structure called stat with the relevant metadata.

Underlying file systems store this metadata in a data structure called an inode. Each file in the file system has a corresponding inode that holds its metadata. Inodes are typically stored on disk, and the file system maintains a cache of active inodes in memory to improve access performance.

By using the stat() or fstat() system calls, we can obtain information about files, including but not limited to:

File size: The size of the file in bytes.
Permissions: The access permissions for the file, such as read, write, and execute permissions for the owner, group, and others.
Owner and group: The user and group identifiers associated with the file.
Timestamps: The timestamps indicate when the file was last modified, accessed, and when its metadata (inode) was last changed.
File type: The type of file, such as a regular file, directory, symbolic link, etc.
File system ID: The identifier of the file system where the file is located.
Inode number: The unique identifier assigned to the file’s inode.
Device ID: The identifier of the device or partition where the file is stored.

By retrieving this metadata, applications can obtain detailed information about files, make decisions based on file properties, and perform various file-related operations.

It’s important to note that the specific fields and information available in the stat structure may vary depending on the operating system and file system in use. It’s recommended to refer to the documentation or man pages specific to your system for detailed information on the stat structure and the supported metadata fields.

Making Directories

To create a directory in a file system, the mkdir command or system call is used. When the mkdir command is invoked, it creates a new directory with the specified name and permissions. The following is an example of using the mkdir command:

prompt> mkdir foo

In this example, a directory named “foo” is created. The mkdir command returns 0 on success, indicating that the directory creation was successful.

When a directory is created, it is initially considered “empty,” meaning it does not contain any files or subdirectories. However, it does have two entries that are automatically created: “.” and “..”.

The “.” (dot) entry refers to the current directory itself. It allows programs to refer to the current directory explicitly.
The “..” (dot-dot) entry refers to the parent directory of the current directory. It allows programs to navigate to the parent directory easily.

You can view these special entries by using the ls command with the “-a” flag, which displays all files, including hidden files and directories:

prompt> ls -a
./  ../

Making Directories

prompt> mkdir foo

In this example, a directory named “foo” is created. The mkdir command returns 0 on success, indicating that the directory creation was successful.

The “.” (dot) entry refers to the current directory itself. It allows programs to refer to the current directory explicitly.
The “..” (dot-dot) entry refers to the parent directory of the current directory. It allows programs to navigate to the parent directory easily.

You can view these special entries by using the ls command with the “-a” flag, which displays all files, including hidden files and directories:

<code>prompt> ls -a
./  ../
</code>

In the output, “./” represents the current directory (dot), and “../” represents the parent directory (dot-dot).

When using the ls command with the “-l” flag, you can obtain detailed information about the directories, such as permissions, ownership, size, and timestamps:

prompt> ls -al
total 8
drwxr-x---  2 remzi remzi  6 Apr 30 16:17 ./
drwxr-x--- 26 remzi remzi 4096 Apr 30 16:17 ../

In the output, the first column indicates the file type and permissions. In this case, “d” indicates a directory, and “rwx” represents the read, write, and execute permissions for the owner, while “r-x” represents the read and execute permissions for the group and others. The subsequent columns provide information about the directory, such as the number of links, owner, group, size, last modification timestamp, and the directory name (“./” and “../”).

Creating directories is essential for organizing files and maintaining a hierarchical structure within a file system. Directories allow for the grouping and categorization of related files and provide a means for efficient file organization and retrieval.

Reading Directories

To read the contents of a directory in a file system, the readdir function or system call is used. The readdir function allows you to iterate over the entries in a directory and retrieve information about each entry.

In the provided code snippet, a program similar to ls is implemented to read the current directory and print information about each file and subdirectory:

#include <stdio.h>
#include <dirent.h>
#include <assert.h>

int main(int argc, char *argv[]) {
    DIR *dp = opendir(".");
    assert(dp != NULL);
    struct dirent *d;
    while ((d = readdir(dp)) != NULL) {
        printf("%lu %s\n", (unsigned long) d->d_ino, d->d_name);
    }
    closedir(dp);
    return 0;
}

In this program, the opendir function is used to open the current directory ("."). If the directory cannot be opened, the program asserts to ensure its successful opening. The opendir function returns a pointer to a DIR structure, which represents the open directory.

The program then uses a loop to iterate over the directory entries using the readdir function. The readdir function returns a pointer to a struct dirent structure, which contains information about each directory entry. The d_name field of the structure represents the name of the file or subdirectory.

The program prints the inode number (d_ino) and the name of each entry using printf. You can modify the output format to display additional information or customize it according to your requirements.

After processing all the directory entries, the program closes the directory using the closedir function to release the resources associated with it.

The struct dirent structure provides additional fields for retrieving information about each directory entry, such as the file type (d_type) and the record length (d_reclen). If you need more detailed information about a file, you can use the stat function or related system calls to obtain additional metadata.

Overall, reading directories allows you to enumerate the files and subdirectories contained within a directory and perform various operations on them, such as displaying their names, accessing their metadata, or performing further processing based on the directory contents.

Hard Links

In a file system, a hard link is a mechanism that allows multiple filenames to be associated with the same underlying file. The link() system call is used to create a hard link, and it takes two arguments: an existing file name (old pathname) and a new file name (new pathname). By creating a hard link, you essentially create another entry in the file system that points to the same inode (i.e., the low-level representation of the file).

For example, using the ln command-line program, you can create a hard link to an existing file:

prompt> echo hello > file
prompt> cat file
hello
prompt> ln file file2
prompt> cat file2
hello

In this example, we first create a file named file with the content “hello”. Then, we use the ln command to create a hard link named file2 that points to the same file. Both file and file2 now refer to the same underlying data.

When you examine the files using the ls command with the -i option, you can see that both files have the same inode number:

prompt> ls -i file file2
67158084 file
67158084 file2
prompt>

The unlink() system call is used to remove a file from the file system. It works by removing the link between the human-readable file name and the inode number associated with the file. When unlink() is called, it decrements the reference count (also known as the link count) within the inode. The reference count keeps track of how many hard links are associated with the inode. Only when the reference count reaches zero, indicating that no more links exist, will the file system free the inode and related data blocks, effectively deleting the file.

In the example above, even if we remove the file named file, we can still access the file through the remaining hard link, file2:

prompt> rm file
removed 'file'
prompt> cat file2
hello

The file system recognizes that the reference count for the inode associated with the file has not reached zero because there is still one remaining link (file2). Therefore, the file’s contents can still be accessed using the remaining link.

Hard links provide a way to have multiple names for the same file, and removing any one of the links does not affect the actual file data until all links are removed and the reference count reaches zero.

Conclusion

In this discussion of file systems, we explored several important concepts related to file management and operations. Here are the key points we covered:

File systems are responsible for managing files on storage devices and providing an organized structure for accessing and storing data.
File operations such as opening, reading, writing, and closing files are performed through system calls like open(), read(), write(), and close().
The file system maintains an open file table to keep track of open files, their current offsets, and other relevant information.
Processes can share file table entries when a parent process creates a child process using fork(), and each process has its own offset within the file.
To ensure data durability, the fsync() system call is used to force writes to disk, particularly important for applications like database management systems.
File metadata, such as file attributes and permissions, can be obtained using system calls like stat() and fstat(), providing information about the files stored in the file system.
Directories are special types of files that contain entries mapping filenames to inode numbers. When a directory is created, it initially contains entries for itself (“.”) and its parent directory (“..”).
The opendir() and readdir() functions allow programs to read the contents of a directory, retrieving information about the files and directories it contains.
Hard links provide a mechanism for creating multiple names (links) for the same file. The link() system call creates a hard link, associating another filename with an existing file’s inode.
Removing a file is done using the unlink() system call, which removes a link between a filename and the associated inode. The file is only deleted when the reference count (link count) reaches zero.

Understanding these fundamental concepts of file systems is crucial for developing efficient and reliable applications that interact with files and directories. By leveraging the various system calls and techniques available, developers can manipulate files, manage file metadata, read directory contents, and create multiple links to the same file, providing flexibility and control over file operations.

Resources
Operating Systems: Three Easy Pieces – Files and Directories

Understanding the Linux File and Directory System

Exploring File Creation, Access, and Deletion in the Linux File System

open() and File Descriptors

The Anatomy of cat and File Operations

Writing Files and the Similar Workflow

Reading and Writing at Specific Offsets (Random Access)

Shared File Table Entries with fork() and dup()

Immediate Write with fsync()

Renaming Files

Getting Information About Files

Making Directories

Making Directories

Reading Directories

Hard Links

Conclusion

Related Posts

File System Implementation

`open()` and File Descriptors

The Anatomy of `cat` and File Operations