The Design of the NetBSD I/O Subsystems


Textbook, 2016

287 Pages


Free online reading

2

Contents
Preface
14
I
Basics to Learn Filesystem
15
1
Welcome to the World of Kernel !
17
1.1
How Does a System Call Dive into Kernel from User Program ? . . .
17
1.1.1
Example: write system call . . . . . . . . . . . . . . . . . . .
17
1.1.2
Ultra SPARC 0x7c CPU Trap . . . . . . . . . . . . . . . . .
18
1.1.3
Jump to the File Descriptor Layer . . . . . . . . . . . . . . .
24
1.1.4
Arriving at Virtual Filesystem Operations . . . . . . . . . . .
28
1.2
General Data Structures in Kernel such as List, Hash, Queue, ... . .
30
1.2.1
Linked-Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
1.2.2
Tail Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
1.2.3
Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
1.3
Waiting and Sleeping in Kernel . . . . . . . . . . . . . . . . . . . . .
39
1.4
Kernel Lock Manager
. . . . . . . . . . . . . . . . . . . . . . . . . .
39
1.4.1
simplelock and lock . . . . . . . . . . . . . . . . . . . . . .
39
1.4.2
Simplelock Interfaces . . . . . . . . . . . . . . . . . . . . . . .
40
1.4.3
Lock Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . .
40
1.5
Kernel Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . .
43
1.6
Resource Pool Manager . . . . . . . . . . . . . . . . . . . . . . . . .
43
1.6.1
Design of Resource-Pool Manager . . . . . . . . . . . . . . . .
44
1.6.2
Initializing a pool . . . . . . . . . . . . . . . . . . . . . . . . .
44
1.6.3
Destroying a Pool
. . . . . . . . . . . . . . . . . . . . . . . .
45
1.6.4
Allocating Items from a Pool . . . . . . . . . . . . . . . . . .
45
1.6.5
Returning Items to a Pool . . . . . . . . . . . . . . . . . . . .
45
1.6.6
Using Cache to Speed Up . . . . . . . . . . . . . . . . . . . .
45
1.6.7
Other Resource-Pool Manager API . . . . . . . . . . . . . . .
46
2
I/O System
47
2.1
I/O Mapping from User to Device
. . . . . . . . . . . . . . . . . . .
47
2.1.1
Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1.2
I/O Queueing . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1.3
Interrupt Handling . . . . . . . . . . . . . . . . . . . . . . . .
48
2.2
Block Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.2.1
Entry Points for Block-Device Drivers . . . . . . . . . . . . .
48
2.2.2
Disk Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.3
Character Devices
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.3.1
Raw Devices and Physical I/O . . . . . . . . . . . . . . . . .
49
2.3.2
Entry Points for Character-Device Drivers . . . . . . . . . . .
54
2.4
Descriptor Management . . . . . . . . . . . . . . . . . . . . . . . . .
54
3

4
CONTENTS
2.4.1
File Descriptor, Descriptor Table, and File Entry . . . . . . .
54
2.4.2
What does the File Entry Points ? . . . . . . . . . . . . . . .
55
2.4.3
Movement of Data Inside the Kernel: uiomove function . . .
55
3
Virtual File System
59
3.1
Architecture of Virtual File System . . . . . . . . . . . . . . . . . . .
59
3.1.1
Why VFS is needed ? . . . . . . . . . . . . . . . . . . . . . .
59
3.1.2
What Is in the Vnode ? . . . . . . . . . . . . . . . . . . . . .
59
3.1.3
How to Call Vnode Operations ? . . . . . . . . . . . . . . . .
61
3.2
Virtual Filesystem Initialization . . . . . . . . . . . . . . . . . . . . .
65
3.2.1
Initializing the namei pathname buffer pool . . . . . . . . . .
66
3.2.2
Initializing the vnode table . . . . . . . . . . . . . . . . . . .
67
3.2.3
Initializing the Name Cache Buffer . . . . . . . . . . . . . . .
68
3.2.4
Initialize the Special Vnode Operations . . . . . . . . . . . .
68
3.3
Attaching Available Static File System . . . . . . . . . . . . . . . . .
73
3.3.1
Set vnode attribute to empty . . . . . . . . . . . . . . . . . .
73
3.3.2
How is vfs list initial initialized ? . . . . . . . . . . . . .
74
3.3.3
Establish a filesystem and initialize it . . . . . . . . . . . . .
77
3.3.4
Fast Filesystem Initialization . . . . . . . . . . . . . . . . . .
78
3.3.5
Soft Dependency Module Initialization . . . . . . . . . . . . .
78
3.3.6
UFS Initialization
. . . . . . . . . . . . . . . . . . . . . . . .
82
3.4
Virtual Filesystem Operations . . . . . . . . . . . . . . . . . . . . . .
84
3.5
References to Source Code . . . . . . . . . . . . . . . . . . . . . . . .
86
3.5.1
kern/vfs init.c - 334 lines, 7 functions
. . . . . . . . . . .
86
4
Buffer Cache
87
4.1
Buffer Cache Header . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.2
Buffer Cache Contents . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.2.1
Allocation Virtual Memory to Buffer Cache . . . . . . . . . .
89
4.2.2
Identifying Buffer . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.3
Buffer Hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.4
Buffer Cache Free Lists
. . . . . . . . . . . . . . . . . . . . . . . . .
90
4.4.1
LOCKED List . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.4.2
LRU List . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.4.3
AGE List . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.4.4
EMPTY List . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
4.5
Buffer Cache Initialization . . . . . . . . . . . . . . . . . . . . . . . .
92
4.5.1
Physical Memory Allocation . . . . . . . . . . . . . . . . . . .
92
4.5.2
Initialization of Hash and Free List . . . . . . . . . . . . . . .
98
4.6
Buffer Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6.1
Finding a Buffer Cache from Hash: incore function . . . . . 103
4.7
Managing Buffer Cache Free Lists
. . . . . . . . . . . . . . . . . . . 103
4.7.1
Reusing Old Buffer: bremfree function . . . . . . . . . . . . 103
4.7.2
Allocating a New Buffer: getnewbuf function . . . . . . . . . 106
4.7.3
Adjusting Buffer Size: allocbuf function . . . . . . . . . . . 108
4.7.4
Getting a Buffer: getblk function . . . . . . . . . . . . . . . 112
4.8
Allocating and Reading Filesystem with Buffer Cache . . . . . . . . 116
4.8.1
Just Read: bread function
. . . . . . . . . . . . . . . . . . . 120
4.8.2
Read Ahead Multiple Buffers: breadn function . . . . . . . . 121
4.8.3
Read Ahead a Single Buffer: breada function . . . . . . . . . 122
4.9
Releasing Buffer Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.9.1
Just Release: brelse function . . . . . . . . . . . . . . . . . . 123
4.9.2
Delayed Write: bdwrite function . . . . . . . . . . . . . . . . 128
4.9.3
Asynchronous Write: bawrite function . . . . . . . . . . . . 129

CONTENTS
5
4.9.4
Synchronous Write: bwrite function . . . . . . . . . . . . . . 130
4.10 References to Source Code . . . . . . . . . . . . . . . . . . . . . . . . 132
4.10.1 kern/vfs bio.c - 334 lines, 21 functions
. . . . . . . . . . . 132
5
Vnode
135
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2
Vnode Management Function . . . . . . . . . . . . . . . . . . . . . . 135
5.2.1
Vnode Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2.2
Reference Counts . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.3
Vnode Identifier
. . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.4
Links to Virtual File System Information . . . . . . . . . . . 139
5.2.5
Vnode Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.6
Type of Object . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.7
Vnode Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.2.8
Private Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.2.9
Other Vnode-Manipulating Functions . . . . . . . . . . . . . 141
5.3
Vnode Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4
Vnode Operation about Filesystem Hierarchy . . . . . . . . . . . . . 143
5.4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.4.2
componentname structure . . . . . . . . . . . . . . . . . . . . 144
5.4.3
Pathname Searching . . . . . . . . . . . . . . . . . . . . . . . 145
5.4.4
Name Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4.5
Name Change/Deletion . . . . . . . . . . . . . . . . . . . . . 146
5.4.6
Attribute Manipulation . . . . . . . . . . . . . . . . . . . . . 146
5.4.7
Object Interpretation
. . . . . . . . . . . . . . . . . . . . . . 146
5.4.8
Process Control . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.9
Object Management . . . . . . . . . . . . . . . . . . . . . . . 146
5.5
Vnode Operation about Storage . . . . . . . . . . . . . . . . . . . . . 147
5.5.1
Object Creation and Deletion . . . . . . . . . . . . . . . . . . 147
5.5.2
Attribute Update . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.5.3
Object Read and Write . . . . . . . . . . . . . . . . . . . . . 147
5.5.4
Change in Space Allocation . . . . . . . . . . . . . . . . . . . 147
5.6
High-Level Vnode Convenient Function
. . . . . . . . . . . . . . . . 147
5.6.1
Filesystem Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 147
5.6.2
General File I/O . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.6.3
Advanced I/O
. . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.7
References to Source Code . . . . . . . . . . . . . . . . . . . . . . . . 150
5.7.1
vfs subr.c - 2846 lines, 57 functions . . . . . . . . . . . . . . 150
5.7.2
vfs vnops.c - 808 lines, 19 functions
. . . . . . . . . . . . . 152
5.7.3
vfs syscalls.c - 3116 lines, 65 functions . . . . . . . . . . . 153
6
UVM
157
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2
UVM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2.1
Virtual Memory Space . . . . . . . . . . . . . . . . . . . . . . 158
6.2.2
Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2.3
Memory Object . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.4
Pager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.5
Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3
UVM External Interface . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4
Memory Mapping Files and Devices
. . . . . . . . . . . . . . . . . . 164
6.4.1
Attaching a Memory Object to Vnode: uvn attach . . . . . 164
6.4.2
Setting Vnode Size: uvn vnp setsize . . . . . . . . . . . . . 165
6.4.3
Clearing a Vnode: uvn vnp zerorange . . . . . . . . . . . . . 166

6
CONTENTS
6.5
Management of Physical Memory . . . . . . . . . . . . . . . . . . . . 168
6.5.1
Lock Management for Page Queue: uvm (un)lock pageq . . 168
6.5.2
Activating Physical Page: uvm pageactivate . . . . . . . . . 168
6.5.3
Making Unbusy a Page: uvm page unbusy . . . . . . . . . . . 169
6.5.4
Looking up a Page: uvm pagelookup . . . . . . . . . . . . . . 169
7
UBC
173
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.2
Traditional Accesses to File . . . . . . . . . . . . . . . . . . . . . . . 173
7.2.1
I/O Subsystem: read() and write() . . . . . . . . . . . . . 174
7.2.2
Virtual Memory Subsystem: mmap() . . . . . . . . . . . . . . 174
7.3
File Access with Unified Buffer Cache . . . . . . . . . . . . . . . . . 174
7.4
VFS Support for UVM . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.4.1
VOP GETPAGES Operation
. . . . . . . . . . . . . . . . . . . . 175
7.4.2
VOP PUTPAGES Operation
. . . . . . . . . . . . . . . . . . . . 175
7.5
UVM Support for I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.5.1
ubc alloc Function . . . . . . . . . . . . . . . . . . . . . . . 176
7.5.2
ubc release Function . . . . . . . . . . . . . . . . . . . . . . 176
7.6
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.6.1
Reading from Disk to Buffer with UBC . . . . . . . . . . . . 176
7.6.2
Writing from Buffer to Disk with UBC . . . . . . . . . . . . . 178
II
Analyzing Fast Filesystem
181
8
Naming
183
8.1
Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.1.1
Chunk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.1.2
Modification of Directory . . . . . . . . . . . . . . . . . . . . 184
8.2
Finding of Names in Directories . . . . . . . . . . . . . . . . . . . . . 184
8.2.1
Match Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.2.2
Search Performance Improvement . . . . . . . . . . . . . . . . 184
8.3
Pathname Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4
The Name Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.1
Vnode's Capability . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.2
Negative Caching . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4.3
Special Device Handling . . . . . . . . . . . . . . . . . . . . . 186
8.5
Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.5.1
Hard Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.5.2
Soft Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.5.3
The Differences . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.6
References to Source Code . . . . . . . . . . . . . . . . . . . . . . . . 186
8.6.1
vfs cache.c - 537 lines, 17 functions
. . . . . . . . . . . . . 186
8.6.2
vfs lookup.c - 777 lines, 4 functions
. . . . . . . . . . . . . 187
9
Inode
189
9.1
The Structures of an Inode
. . . . . . . . . . . . . . . . . . . . . . . 189
9.1.1
File Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.1.2
Inode Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.1.3
Inode for Root Directory
. . . . . . . . . . . . . . . . . . . . 193
9.2
Inode Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.2.1
Opening a File . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.2.2
Closing a File . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.3
Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

CONTENTS
7
9.3.1
Soft and Hard Quota . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.2
Quota Imposing Mechanism . . . . . . . . . . . . . . . . . . . 195
9.3.3
Quota Records . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.4
Active Quota Entry: dquot . . . . . . . . . . . . . . . . . . . 196
9.3.5
Consistency Maintenance . . . . . . . . . . . . . . . . . . . . 197
9.4
References to Source Code . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4.1
ufs bmap.c - 325 lines, 3 functions . . . . . . . . . . . . . . . 197
9.4.2
ufs ihash.c - 194 lines, 7 functions . . . . . . . . . . . . . . 197
9.4.3
ufs inode.c - 268 lines, 3 functions . . . . . . . . . . . . . . 197
9.4.4
ufs lookup.c - 1216 lines, 9 functions . . . . . . . . . . . . . 197
9.4.5
ufs quota.c - 960 lines, 20 functions
. . . . . . . . . . . . . 198
9.4.6
ufs readwrite.c - 481 lines, 4 functions
. . . . . . . . . . . 198
9.4.7
ufs vfsops.c - 262 lines, 8 functions
. . . . . . . . . . . . . 199
9.4.8
ufs vnops.c - 2074 lines, 30 functions . . . . . . . . . . . . . 199
10 Berkeley Fast File System
201
10.1 Filestore Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.1.1 Allocating and Freeing Objects . . . . . . . . . . . . . . . . . 201
10.1.2 Updating Inode Attribute . . . . . . . . . . . . . . . . . . . . 202
10.1.3 Manipulating Existing Objects . . . . . . . . . . . . . . . . . 203
10.1.4 Changing in Space Allocation . . . . . . . . . . . . . . . . . . 204
10.1.5 Virtual Memory System Support . . . . . . . . . . . . . . . . 204
10.2 Organization of the FFS . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.2.1 Superblock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.2.2 Cylinder Group . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2.3 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.3 Reading a File
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
10.3.1 Regular File Reading Algorithm: using UBC . . . . . . . . . 211
10.3.2 Non-regular File Reading Algorithm: without UBC . . . . . . 211
10.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 212
10.4 Writing a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.4.1 Regular File Writing Algorithm . . . . . . . . . . . . . . . . . 214
10.4.2 Non-regular File Writing Algorithm
. . . . . . . . . . . . . . 214
10.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.5 Layout Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.5.1 Inode Layout Policy . . . . . . . . . . . . . . . . . . . . . . . 220
10.5.2 Data Block Layout Policy . . . . . . . . . . . . . . . . . . . . 220
10.6 Data Block Allocation Mechanisms . . . . . . . . . . . . . . . . . . . 220
10.6.1 Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.6.2 Main Function that Does Allocation: ffs balloc . . . . . . . 221
10.6.3 Cylinder Overflow Algorithm: ffs hashalloc . . . . . . . . . 223
10.6.4 Global Policy 1 - Extending an Fragment: ffs realloccg . . 223
10.6.5 Global Policy 2 - Get a New Block: ffs alloc . . . . . . . . 224
10.6.6 Local Policy - Allocate a Block or Fragment: ffs alloccg
. 225
10.6.7 Searching Fragment Descriptor Table: ffs mapsearch . . . . 228
10.6.8 Rotational Layout Table . . . . . . . . . . . . . . . . . . . . . 231
10.7 Inode Allocation Mechanism . . . . . . . . . . . . . . . . . . . . . . . 231
10.7.1 Global Policy: ffs valloc . . . . . . . . . . . . . . . . . . . 231
10.7.2 Local Policy 1: ffs dirpref . . . . . . . . . . . . . . . . . . 232
10.7.3 Local Policy 2: ffs nodealloccg . . . . . . . . . . . . . . . . 232
10.8 Synchronous Operations . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.9 Filesystem Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.9.1 Large File Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.10References to Source Code . . . . . . . . . . . . . . . . . . . . . . . . 233

8
CONTENTS
10.10.1 fs.h - 574 lines . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.10.2 ffs vfsops.c - 1518 lines, 18 functions . . . . . . . . . . . . 235
10.10.3 ffs vnops.c - 580 lines, 6 functions . . . . . . . . . . . . . . 235
10.10.4 ffs alloc.c - 1895 lines, 18 functions . . . . . . . . . . . . . 236
10.10.5 ffs balloc.c - 552 lines, 2 functions
. . . . . . . . . . . . . 236
10.10.6 ffs inode.c - 582 lines, 3 functions . . . . . . . . . . . . . . 237
10.10.7 ffs subr.c - 295 lines, 7 functions . . . . . . . . . . . . . . . 237
10.10.8 ffs tables.c - 147 lines, 0 functions
. . . . . . . . . . . . . 237
10.10.9 ffs swap.c - 158 lines, 3 functions . . . . . . . . . . . . . . . 237
11 Mounting Root File System
239
11.1 System Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.2 Before Mounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
11.2.1 Creating stopped init process . . . . . . . . . . . . . . . . . 243
11.2.2 Finding Where is the Root File System . . . . . . . . . . . . 244
11.2.3 Executing Root Mount Hook . . . . . . . . . . . . . . . . . . 249
11.3 Let's Mount the Root File System ! . . . . . . . . . . . . . . . . . . . 251
11.3.1 Telling the VFS to Mount the Root Filesystem . . . . . . . . 251
11.3.2 Getting Vnode for Root Device . . . . . . . . . . . . . . . . . 254
11.3.3 Allocating Mount Structure . . . . . . . . . . . . . . . . . . . 254
11.3.4 Reading Superblock . . . . . . . . . . . . . . . . . . . . . . . 254
11.3.5 Mount ! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.4 What Must Be Done after Mount ? . . . . . . . . . . . . . . . . . . . 257
11.4.1 Find vnode for '/' -- root directory
. . . . . . . . . . . . . . 257
11.4.2 Set current working directory of init process . . . . . . . . . 257
11.4.3 Check File System Time . . . . . . . . . . . . . . . . . . . . . 257
11.4.4 Create Kernel Threads about File System . . . . . . . . . . . 257
11.4.5 Start Up init processor . . . . . . . . . . . . . . . . . . . . . 257
III
Storage Systems
259
12 Storage Device
261
12.1 Generic Disk Framework . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.1.1 disk Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.1.2 Disk Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12.1.3 Using the Framework
. . . . . . . . . . . . . . . . . . . . . . 265
12.2 Disk Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.2.1 What does it have ? . . . . . . . . . . . . . . . . . . . . . . . 267
12.2.2 disklabel structure . . . . . . . . . . . . . . . . . . . . . . . 268
12.2.3 Where is the Disk Label ? . . . . . . . . . . . . . . . . . . . . 270
12.2.4 General Disk Label Interfaces . . . . . . . . . . . . . . . . . . 270
12.2.5 Reading Diak Label: DIOCGDINFO . . . . . . . . . . . . . . . . 271
12.2.6 Writing In-Core Disk Label: DIOCSDINFO . . . . . . . . . . . 272
12.2.7 Writing On-Disk Disk Label: DIOCWDINFO . . . . . . . . . . . 272
12.2.8 Restrictions of Disk Label in sparc64 . . . . . . . . . . . . . . 274
12.3 Concatenated Disk Driver . . . . . . . . . . . . . . . . . . . . . . . . 274
12.3.1 Strcture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
12.3.2 Gloval Variables . . . . . . . . . . . . . . . . . . . . . . . . . 276
12.3.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

CONTENTS
9
13 Logical Volume Manager
279
13.1 RAIDframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
13.1.1 Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . 279
13.1.2 Component Labels . . . . . . . . . . . . . . . . . . . . . . . . 280
13.1.3 Hot Spares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.1.4 Hierarchical Organization . . . . . . . . . . . . . . . . . . . . 280
13.1.5 Kernel Configuration . . . . . . . . . . . . . . . . . . . . . . . 281
13.2 VERITAS Volume Manager . . . . . . . . . . . . . . . . . . . . . . . 282
13.2.1 Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . 282
13.2.2 Volume Manager Overview . . . . . . . . . . . . . . . . . . . 282
13.2.3 Physical Objects . . . . . . . . . . . . . . . . . . . . . . . . . 283
13.2.4 Volumes and Virtual Objects . . . . . . . . . . . . . . . . . . 283
Appendix
285
A. NetBSD Kernel Sources
. . . . . . . . . . . . . . . . . . . . . . . . . . 285
Bibliography
287


·
·
·
·
·
·
·
SungWon Chung
Busan, Korea
27 November 2002
11

12
Preface
Source Code Copyright
The NetBSD Foundation
The NetBSD specific code contains the following copyright notice.
/*-
* Copyright (c) 1998, 2000 The NetBSD Foundation, Inc.
* All rights reserved.
*
* This code is derived from software contributed to The NetBSD Foundation
* by Jason R. Thorpe of the Numerical Aerospace Simulation Facility,
* NASA Ames Research Center.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
*
notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
*
notice, this list of conditions and the following disclaimer in the
*
documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
*
must display the following acknowledgement:
*
This product includes software developed by the NetBSD
*
Foundation, Inc. and its contributors.
* 4. Neither the name of The NetBSD Foundation nor the names of its
*
contributors may be used to endorse or promote products derived
*
from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
* TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
* BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*/
University of California at Berkeley
All the source code in this book that is taken from the 4.4BSD-Lite release contains
the following copyright notice.
/*
* Copyright (c) 1989, 1993
*
The Regents of the University of California.
All rights reserved.
*
* This code is derived from software contributed
* to Berkeley by John Heidemann of the UCLA Ficus project.
*

Preface
13
* Source: * @(#)i405_init.c 2.10 92/04/27 UCLA Ficus project
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
*
notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
*
notice, this list of conditions and the following disclaimer in the
*
documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
*
must display the following acknowledgement:
*
This product includes software developed by the University of
*
California, Berkeley and its contributors.
* 4. Neither the name of the University nor the names of its contributors
*
may be used to endorse or promote products derived from this software
*
without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED.
IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
*
@(#)vfs_init.c
8.5 (Berkeley) 5/11/95
*/
Washington University
UVM code contains the following copyright notice.
/*
*
* Copyright (c) 1997 Charles D. Cranor and Washington University.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
*
notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
*
notice, this list of conditions and the following disclaimer in the
*
documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
*
must display the following acknowledgement:
*
This product includes software developed by Charles D. Cranor and
*
Washington University.

14
Preface
* 4. The name of the author may not be used to endorse or promote products
*
derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

Part I
Basics to Learn Filesystem
15


Chapter 1
Welcome to the World of
Kernel !
In this chapter, the procedure involved in mounting root filesystem is described.
1.1
How Does a System Call Dive into Kernel from
User Program ?
In this section, we present how a system call works with an example using a
filesystem related system call.
1.1.1
Example: write system call
Let's see how a system call such as write, used in the below user application
program, is processed.
------------------------------------------------------ hello.c
main() {
char *string = "Hello World ?";
write(0, string, strlen(string));
}
------------------------------------------------------ hello.c
write function is defined in libc library of GNU C compiler.
For sparc64
platform, the source code of write function in libc library is shown below.
----------------------------------------­ src/lib/libc/obj/write.S
1 #include "SYS.h"
2 RSYSCALL(write)
----------------------------------------­ src/lib/libc/obj/write.S
RSYSCALL macro used in the above platform-independent write.S code is defined
in a machine independent way in libc library source, src/lib/libc/arch/sparc64/SYS.h.
----------------------------------­ src/lib/libc/arch/sparc64/SYS.h
17

18
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
87 /*
88
* RSYSCALL is used when the system call should just return.
Here
89
* we use the SYSCALL_G7RFLAG to put the `success' return address in %g7
90
* and avoid a branch.
91
*/
92 #define RSYSCALL(x) \
93
ENTRY(x); mov (_CAT(SYS_,x))|SYSCALL_G7RFLAG,%g1; add %o7,8,%g7; \
94
t ST_SYSCALL; ERROR()
/src/lib/libc/arch/sparc64
----------------------------------­ src/lib/libc/arch/sparc64/SYS.h
Thus, write system call used in hello.c executes assembly code of line 93
where argument x is replaced with write. As a result, system call number is stored
%g1 register, and Ultra SPARC CPU Trap is occured by t instruction.
1.1.2
Ultra SPARC 0x7c CPU Trap
Ultra SPARC CPU trap number, ST SYSCALL is defined as
--------------------------------------- arch/sparc64/include/trap.h
106 #define T_SUN_SYSCALL
0x100
/* system call */
...
150 #define ST_SYSCALL
(T_SUN_SYSCALL & 0x7f)
--------------------------------------- arch/sparc64/include/trap.h
How Ultra SPARC CPU trap is processed is determined by CPU initialization
stage during bootstrap, according to arch/sparc64/sparc64/locore.s. This part
of the kernel source code is listed below.
--------------------------------------- arch/sparc64/include/trap.h
636 #define SYSCALL
VTRAP(0x100, syscall_setup)
...
...
...
805
.globl
_C_LABEL(trapbase)
806 _C_LABEL(trapbase):
807
b dostart; nop; TA8
! 000 = reserved -- Use it to boot
808
/* We should not get the next 5 traps */
809
UTRAP(0x001)
! 001 = POR Reset -- ROM should get this
810
UTRAP(0x002)
! 002 = WDR -- ROM should get this
811
UTRAP(0x003)
! 003 = XIR -- ROM should get this
...
1010
UTRAP(0x0fc); TA32
! 0x0fc fill_7_other
1011 TABLE(syscall):
1012
SYSCALL
! 0x100 = sun syscall
--------------------------------------- arch/sparc64/sparc64/locore.s
Remember that write function defined in libc library requests CPU Trap,
ST SYSCALL that is defined as 0x7c. So, according to line 1012 of arch/sparc64/include/trap.h,
jump to syscall setup label is made.
Source code from the syscall setup is shown below.

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?19
--------------------------------------- arch/sparc64/sparc64/locore.s
3905 /*
3906
* syscall_setup() builds a trap frame and calls syscall().
3907
* sun_syscall is same but delivers sun system call number
3908
* XXX
should not have to save&reload ALL the registers just for
3909
*
ptrace...
3910
*/
3911 syscall_setup:
3912 #ifdef TRAPS_USE_IG
3913
wrpr
%g0, PSTATE_KERN|PSTATE_IG, %pstate ! DEBUG
3914 #endif
3915
TRAP_SETUP(-CC64FSZ-TF_SIZE)
3916
3917 #ifdef DEBUG
3918
rdpr
%tt, %o1
! debug
3919
sth %o1, [%sp + CC64FSZ + STKB + TF_TT]! debug
3920 #endif
3921
3922
wrpr
%g0, PSTATE_KERN, %pstate
! Get back to normal globals
3923
stx %g1, [%sp + CC64FSZ + STKB + TF_G + ( 1*8)]
3924
mov %g1, %o1
! code
3925
rdpr
%tpc, %o2
! (pc)
3926
stx %g2, [%sp + CC64FSZ + STKB + TF_G + ( 2*8)]
3927
rdpr
%tstate, %g1
3928
stx %g3, [%sp + CC64FSZ + STKB + TF_G + ( 3*8)]
3929
rdpr
%tnpc, %o3
3930
stx %g4, [%sp + CC64FSZ + STKB + TF_G + ( 4*8)]
3931
rd
%y, %o4
3932
stx %g5, [%sp + CC64FSZ + STKB + TF_G + ( 5*8)]
3933
stx %g6, [%sp + CC64FSZ + STKB + TF_G + ( 6*8)]
3934
CHKPT(%g5,%g6,0x31)
3935
wrpr
%g0, 0, %tl
! return to tl=0
3936
stx %g7, [%sp + CC64FSZ + STKB + TF_G + ( 7*8)]
3937
add %sp, CC64FSZ + STKB, %o0
! (&tf)
3938
3939
stx %g1, [%sp + CC64FSZ + STKB + TF_TSTATE]
3940
stx %o2, [%sp + CC64FSZ + STKB + TF_PC]
3941
stx %o3, [%sp + CC64FSZ + STKB + TF_NPC]
3942
st
%o4, [%sp + CC64FSZ + STKB + TF_Y]
3943
3944
rdpr
%pil, %g5
3945
stb %g5, [%sp + CC64FSZ + STKB + TF_PIL]
3946
stb %g5, [%sp + CC64FSZ + STKB + TF_OLDPIL]
3947
3948
!! In the EMBEDANY memory model %g4 points to the start of the data segment.
3949
!! In our case we need to clear it before calling any C-code
3950
clr %g4
3951
wr
%g0, ASI_PRIMARY_NOFAULT, %asi
! Restore default ASI
3952
3953
call
_C_LABEL(syscall)
! syscall(&tf, code, pc)
3954
wrpr
%g0, PSTATE_INTR, %pstate
! turn on interrupts
3955
3956
/* see `proc_trampoline' for the reason for this label */

20
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
3957 return_from_syscall:
3958
wrpr
%g0, PSTATE_KERN, %pstate
! Disable intterrupts
3959
CHKPT(%o1,%o2,0x32)
3960
wrpr
%g0, 0, %tl
! Return to tl==0
3961
CHKPT(%o1,%o2,4)
3962
ba,a,pt %icc, return_from_trap
3963
nop
3964
NOTREACHED
--------------------------------------- arch/sparc64/sparc64/locore.s
Notice that in line 3953, jump to syscall function defined in arch/sparc64/sparc64/trap.c.
trap.c is somewhat complex since it supports system call emulation such as
Solaris or Ultra Linux. Critical part of trap.c managing NetBSD specific system
call is shown below.
--------------------------------------- arch/sparc64/sparc64/trap.s
1721 void
1722 syscall(tf, code, pc)
1723
register_t code;
1724
struct trapframe64 *tf;
1725
register_t pc;
1726 {
1727
int i, nsys, nap;
1728
int64_t *ap;
1729
const struct sysent *callp;
1730
struct proc *p;
1731
int error = 0, new;
1732
union args {
1733
register32_t i[8];
1734
register64_t l[8];
1735
} args;
...
1766
p = curproc;
...
1780
callp = p->p_emul->e_sysent;
1781
nsys = p->p_emul->e_nsysent;
...
1851
callp += code;
...
1876
error = copyin((caddr_t)(u_long)tf->tf_out[6] + BIAS +
1877
offsetof(struct frame64, fr_argx),
1878
&args.l[nap],
1879
(i - nap) * sizeof(register64_t));
...
1997
error = (*callp->sy_call)(p, &args, rval);
--------------------------------------- arch/sparc64/sparc64/trap.s
By line 1997, a function pointed by a function pointer is called. The function
pointer is set by line 1780 and line 1851. To explain what this function pointer
means, kernel structure for a process should briefly described.
Kernel structure to describe a process is struct proc and it contains so called
per-process emulation information in const struct emul *p emul structure.

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?21
-------------------------------------------------- sys/proc.h
149 struct proc {
...
173
pid_t
p_pid;
/* Process identifier */
...
218
const struct emul *p_emul;
/* Emulation information */
219
void
*p_emuldata;
/*
...
264 };
-------------------------------------------------- sys/proc.h
This member is used to run a SUN Solaris or Linux binary on NetBSD/sparc64.
However, for native NetBSD/sparc64 binary, this member is initialized by kern/init main.c
to point a const struct emul emul netbsd structure defined in kern/kern exec.c.
The source for this initialization is
-------------------------------------------------- sys/proc.h
165 /*
166
* System startup; initialize the world, create process 0, mount root
167
* filesystem, and fork to create init and pagedaemon.
Most of the
168
* hard work is done in the lower-level initialization routines including
169
* startup(), which does memory initialization and autoconfiguration.
170
*/
171 void
172 main(void)
173 {
174
struct proc *p;
...
188
/*
189
* Initialize the current process pointer (curproc) before
190
* any possible traps/probes to simplify trap processing.
191
*/
192
simple_lock_init(&proc0.p_raslock);
193
p = &proc0;
194
curproc = p;
...
274
p->p_emul = &emul_netbsd;
...
545
/*
546
* Okay, now we can let init(8) exec!
It's off to userland!
547
*/
548
start_init_exec = 1;
549
wakeup((void *)&start_init_exec);
550
551
/* The scheduler is an infinite loop. */
552
uvm_scheduler();
553
/* NOTREACHED */
554 }
-------------------------------------------------- sys/proc.h
and,

22
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
------------------------------------------------­ kern/kern exec.h
128 const struct emul emul_netbsd = {
129
"netbsd",
130
NULL,
/* emulation path */
131 #ifndef __HAVE_MINIMAL_EMUL
132
EMUL_HAS_SYS___syscall,
133
NULL,
134
SYS_syscall,
135
SYS_NSYSENT,
136 #endif
137
sysent,
138 #ifdef SYSCALL_DEBUG
139
syscallnames,
140 #else
141
NULL,
142 #endif
143
sendsig,
144
trapsignal,
145
sigcode,
146
esigcode,
147
setregs,
148
NULL,
149
NULL,
150
NULL,
151 #ifdef __HAVE_SYSCALL_INTERN
152
syscall_intern,
153 #else
154
syscall,
155 #endif
156
NULL,
157
NULL,
158 };
------------------------------------------------­ kern/kern exec.h
where the definition of struct emul structure is
------------------------------------------------­ kern/kern exec.h
94 struct emul {
95
const char
*e_name;
/* Symbolic name */
96
const char
*e_path;
/* Extra emulation path (NULL if none)*/
97 #ifndef __HAVE_MINIMAL_EMUL
98
int
e_flags;
/* Miscellaneous flags, see above */
99
/* Syscall handling function */
100
const int
*e_errno;
/* Errno array */
101
int
e_nosys;
/* Offset of the nosys() syscall */
102
int
e_nsysent;
/* Number of system call entries */
103 #endif
104
const struct sysent *e_sysent;
/* System call array */
105
const char * const *e_syscallnames; /* System call name array */
106
/* Signal sending function */
107
void
(*e_sendsig) __P((int, sigset_t *, u_long));
108
void
(*e_trapsignal) __P((struct proc *, int, u_long));

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?23
109
char
*e_sigcode;
/* Start of sigcode */
110
char
*e_esigcode;
/* End of sigcode */
111
/* Set registers before execution */
112
void
(*e_setregs) __P((struct proc *, struct exec_package *,
113
u_long));
114
115
/* Per-process hooks */
116
void
(*e_proc_exec) __P((struct proc *,
117
struct exec_package *));
118
void
(*e_proc_fork) __P((struct proc *p,
119
struct proc *parent));
120
void
(*e_proc_exit) __P((struct proc *));
121
122 #ifdef __HAVE_SYSCALL_INTERN
123
void
(*e_syscall_intern) __P((struct proc *));
124 #else
125
void
(*e_syscall) __P((void));
126 #endif
127
/* Emulation specific sysctl */
128
int
(*e_sysctl) __P((int *, u_int , void *, size_t *,
129
void *, size_t, struct proc *p));
130
int
(*e_fault) __P((struct proc *, vaddr_t, int, int));
131 };
------------------------------------------------­ kern/kern exec.h
The emul netbsd structure has a member whose definition is const struct sysent
*e sysent and this member points, by the initialization of kern/kern exec.c, to
struct sysent sysent[] array of structure which is defined in init sysent.c as
------------------------------------------------­ kern/init sysent.h
71 struct sysent sysent[] = {
72
{ 0, 0, 0,
73
sys_nosys },
/* 0 = syscall (indir) */
74
{ 1, s(struct sys_exit_args), 0,
75
sys_exit },
/* 1 = exit */
76
{ 0, 0, 0,
77
sys_fork },
/* 2 = fork */
78
{ 3, s(struct sys_read_args), 0,
79
sys_read },
/* 3 = read */
80
{ 3, s(struct sys_write_args), 0,
81
sys_write },
/* 4 = write */
82
{ 3, s(struct sys_open_args), 0,
83
sys_open },
/* 5 = open */
...
1227
sys_nosys },
/* 511 = filler */
1228 };
------------------------------------------------­ kern/init sysent.h
where struct sysent is defined as
------------------------------------------------­ kern/init sysent.h

24
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
131 extern struct sysent {
/* system call table */
132
short
sy_narg;
/* number of args */
133
short
sy_argsize;
/* total size of arguments */
134
int
sy_flags;
/* flags. see below */
135
sy_call_t *sy_call;
/* implementing function */
136 } sysent[];
------------------------------------------------­ kern/init sysent.h
Now, based on the description up to now, we can exactly understand what the line
1997
means. Actually, It means that jump to the sys write function.
1.1.3
Jump to the File Descriptor Layer
sys write function is defined in sys generic.c as
------------------------------------------------­ kern/sys generic.c
278 /*
279
* Write system call
280
*/
281 int
282 sys_write(struct proc *p, void *v, register_t *retval)
283 {
284
struct sys_write_args /* {
285
syscallarg(int)
fd;
286
syscallarg(const void *)
buf;
287
syscallarg(size_t)
nbyte;
288
} */ *uap = v;
289
int
fd;
290
struct file
*fp;
291
struct filedesc *fdp;
292
293
fd = SCARG(uap, fd);
294
fdp = p->p_fd;
295
296
if ((fp = fd_getfile(fdp, fd)) == NULL)
297
return (EBADF);
298
299
if ((fp->f_flag & FWRITE) == 0)
300
return (EBADF);
301
302
FILE_USE(fp);
303
304
/* dofilewrite() will unuse the descriptor for us */
305
return (dofilewrite(p, fd, fp, SCARG(uap, buf), SCARG(uap, nbyte),
306
&fp->f_offset, FOF_UPDATE_OFFSET, retval));
307 }
308
309 int
310 dofilewrite(struct proc *p, int fd, struct file *fp, const void *buf,
311
size_t nbyte, off_t *offset, int flags, register_t *retval)
312 {
313
struct uio
auio;
314
struct iovec
aiov;

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?25
315
size_t
cnt;
316
int
error;
317 #ifdef KTRACE
318
struct iovec
ktriov;
319 #endif
320
321
error = 0;
322
aiov.iov_base = (caddr_t)buf;
/* XXX kills const */
323
aiov.iov_len = nbyte;
324
auio.uio_iov = &aiov;
325
auio.uio_iovcnt = 1;
326
auio.uio_resid = nbyte;
327
auio.uio_rw = UIO_WRITE;
328
auio.uio_segflg = UIO_USERSPACE;
329
auio.uio_procp = p;
330
331
/*
332
* Writes return ssize_t because -1 is returned on error.
Therefore
333
* we must restrict the length to SSIZE_MAX to avoid garbage return
334
* values.
335
*/
336
if (auio.uio_resid > SSIZE_MAX) {
337
error = EINVAL;
338
goto out;
339
}
340
341 #ifdef KTRACE
342
/*
343
* if tracing, save a copy of iovec
344
*/
345
if (KTRPOINT(p, KTR_GENIO))
346
ktriov = aiov;
347 #endif
348
cnt = auio.uio_resid;
349
error = (*fp->f_ops->fo_write)(fp, offset, &auio, fp->f_cred, flags);
350
if (error) {
351
if (auio.uio_resid != cnt && (error == ERESTART ||
352
error == EINTR || error == EWOULDBLOCK))
353
error = 0;
354
if (error == EPIPE)
355
psignal(p, SIGPIPE);
356
}
357
cnt -= auio.uio_resid;
358 #ifdef KTRACE
359
if (KTRPOINT(p, KTR_GENIO) && error == 0)
360
ktrgenio(p, fd, UIO_WRITE, &ktriov, cnt, error);
361 #endif
362
*retval = cnt;
363
out:
364
FILE_UNUSE(fp, p);
365
return (error);
366 }
------------------------------------------------­ kern/sys generic.c

26
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
Do you think it is the whole kernel source code to process write system call ?
Unfortunately, there remains somewhat long way for us to walk before we reach the
realm of the fast filesystem code. / :)
See the line 349 of kern/sys/generic.c. You may wonder how the f ops
member of fp structure is set. It is initialized when open system call is executed
as,
------------------------------------------------­ kern/vfs syscalls.c
986 /*
987
* Check permissions, allocate an open file structure,
988
* and call the device open routine if any.
989
*/
990 int
991 sys_open(p, v, retval)
992
struct proc *p;
993
void *v;
994
register_t *retval;
995 {
996
struct sys_open_args /* {
997
syscallarg(const char *) path;
998
syscallarg(int) flags;
999
syscallarg(int) mode;
1000
} */ *uap = v;
1001
struct cwdinfo *cwdi = p->p_cwdi;
1002
struct filedesc *fdp = p->p_fd;
1003
struct file *fp;
1004
struct vnode *vp;
1005
int flags, cmode;
1006
int type, indx, error;
1007
struct flock lf;
1008
struct nameidata nd;
1009
1010
flags = FFLAGS(SCARG(uap, flags));
1011
if ((flags & (FREAD | FWRITE)) == 0)
1012
return (EINVAL);
1013
/* falloc() will use the file descriptor for us */
1014
if ((error = falloc(p, &fp, &indx)) != 0)
1015
return (error);
1016
cmode = ((SCARG(uap, mode) &~ cwdi->cwdi_cmask) & ALLPERMS) &~ S_ISTXT;
1017
NDINIT(&nd, LOOKUP, FOLLOW, UIO_USERSPACE, SCARG(uap, path), p);
1018
p->p_dupfd = -indx - 1;
/* XXX check for fdopen */
1019
if ((error = vn_open(&nd, flags, cmode)) != 0) {
1020
FILE_UNUSE(fp, p);
1021
ffree(fp);
1022
if ((error == ENODEV || error == ENXIO) &&
1023
p->p_dupfd >= 0 &&
/* XXX from fdopen */
1024
(error =
1025
dupfdopen(p, indx, p->p_dupfd, flags, error)) == 0) {
1026
*retval = indx;
1027
return (0);
1028
}
1029
if (error == ERESTART)
1030
error = EINTR;

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?27
1031
fdremove(fdp, indx);
1032
return (error);
1033
}
1034
p->p_dupfd = 0;
1035
vp = nd.ni_vp;
1036
fp->f_flag = flags & FMASK;
1037
fp->f_type = DTYPE_VNODE;
1038
fp->f_ops = &vnops;
1039
fp->f_data = (caddr_t)vp;
1040
if (flags & (O_EXLOCK | O_SHLOCK)) {
1041
lf.l_whence = SEEK_SET;
1042
lf.l_start = 0;
1043
lf.l_len = 0;
1044
if (flags & O_EXLOCK)
1045
lf.l_type = F_WRLCK;
1046
else
1047
lf.l_type = F_RDLCK;
1048
type = F_FLOCK;
1049
if ((flags & FNONBLOCK) == 0)
1050
type |= F_WAIT;
1051
VOP_UNLOCK(vp, 0);
1052
error = VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, type);
1053
if (error) {
1054
(void) vn_close(vp, fp->f_flag, fp->f_cred, p);
1055
FILE_UNUSE(fp, p);
1056
ffree(fp);
1057
fdremove(fdp, indx);
1058
return (error);
1059
}
1060
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
1061
fp->f_flag |= FHASLOCK;
1062
}
1063
VOP_UNLOCK(vp, 0);
1064
*retval = indx;
1065
FILE_SET_MATURE(fp);
1066
FILE_UNUSE(fp, p);
1067
return (0);
1068 }
------------------------------------------------­ kern/vfs syscalls.c
You can check that this code segment is described by the page 205-207 of a book
titled as `the design and implementation of the 4.4BSD operating system`
For more important, see 1038 of vfs syscalls.c. Did you have a sense what
this means ? By this code line, the f ops member of fp structure in line 349 of
kern/sys generic.c points vnops global variable which is defined as,
------------------------------------------------­ kern/vfs vnops.c
82 struct
fileops vnops = {
83
vn_read, vn_write, vn_ioctl, vn_fcntl, vn_poll,
84
vn_statfile, vn_closefile, vn_kqfilter
85 };
------------------------------------------------­ kern/vfs vnops.c

28
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
where the definition of struct fileops is embedded in the definition of file struc-
ture as,
-------------------------------------------------- sys/file.h
53 /*
54
* Kernel descriptor table.
55
* One entry for each open kernel vnode and socket.
56
*/
57 struct file {
58
LIST_ENTRY(file) f_list;
/* list of active files */
59
int
f_flag;
/* see fcntl.h */
60
int
f_iflags;
/* internal flags */
61 #define DTYPE_VNODE
1
/* file */
62 #define DTYPE_SOCKET
2
/* communications endpoint */
63 #define DTYPE_PIPE
3
/* pipe */
64 #define DTYPE_KQUEUE
4
/* event queue */
65 #define DTYPE_MISC
5
/* misc file descriptor type */
66
int
f_type;
/* descriptor type */
67
u_int
f_count;
/* reference count */
68
u_int
f_msgcount;
/* references from message queue */
69
int
f_usecount;
/* number active users */
70
struct ucred
*f_cred;
/* creds associated with descriptor */
71
struct fileops {
72
int
(*fo_read)
(struct file *fp, off_t *offset,
73
struct uio *uio,
74
struct ucred *cred, int flags);
75
int
(*fo_write)
(struct file *fp, off_t *offset,
76
struct uio *uio,
77
struct ucred *cred, int flags);
78
int
(*fo_ioctl)
(struct file *fp, u_long com,
79
caddr_t data, struct proc *p);
80
int
(*fo_fcntl)
(struct file *fp, u_int com,
81
caddr_t data, struct proc *p);
82
int
(*fo_poll)
(struct file *fp, int events,
83
struct proc *p);
84
int
(*fo_stat)
(struct file *fp, struct stat *sp,
85
struct proc *p);
86
int
(*fo_close)
(struct file *fp, struct proc *p);
87
int
(*fo_kqfilter)
(struct file *fp, struct knote *kn);
88
} *f_ops;
89
off_t
f_offset;
90
caddr_t
f_data;
/* descriptor data, e.g. vnode/socket */
91 };
-------------------------------------------------- sys/file.h
Based on the above code, line 349 of kern sysgeneric.c makes a jump to
vn write.
1.1.4
Arriving at Virtual Filesystem Operations
The vn write function is defined in vfs vnops.c as,
------------------------------------------------­ kern/vfs vnops.c

1.1. HOW DOES A SYSTEM CALL DIVE INTO KERNEL FROM USER PROGRAM ?29
526 /*
527
* File table vnode write routine.
528
*/
529 static int
530 vn_write(fp, offset, uio, cred, flags)
531
struct file *fp;
532
off_t *offset;
533
struct uio *uio;
534
struct ucred *cred;
535
int flags;
536 {
537
struct vnode *vp = (struct vnode *)fp->f_data;
538
int count, error, ioflag = IO_UNIT;
539
540
if (vp->v_type == VREG && (fp->f_flag & O_APPEND))
541
ioflag |= IO_APPEND;
542
if (fp->f_flag & FNONBLOCK)
543
ioflag |= IO_NDELAY;
544
if (fp->f_flag & FFSYNC ||
545
(vp->v_mount && (vp->v_mount->mnt_flag & MNT_SYNCHRONOUS)))
546
ioflag |= IO_SYNC;
547
else if (fp->f_flag & FDSYNC)
548
ioflag |= IO_DSYNC;
549
if (fp->f_flag & FALTIO)
550
ioflag |= IO_ALTSEMANTICS;
551
VOP_LEASE(vp, uio->uio_procp, cred, LEASE_WRITE);
552
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
553
uio->uio_offset = *offset;
554
count = uio->uio_resid;
555
error = VOP_WRITE(vp, uio, ioflag, cred);
556
if (flags & FOF_UPDATE_OFFSET) {
557
if (ioflag & IO_APPEND)
558
*offset = uio->uio_offset;
559
else
560
*offset += count - uio->uio_resid;
561
}
562
VOP_UNLOCK(vp, 0);
563
return (error);
564 }
------------------------------------------------­ kern/vfs vnops.c
By the functions used in line line 551, 555 -- VOP LEASE, VOP WRITE -- are calls
to virtual filesystem operations. Before describing the jump to virtual file system
code by this function, we should explain architecture and source code for virtual
filesystem layer in NetBSD/sparc64. Therefore, we postpone further description to
the next chapter telling about virtual filesystem layer implementation.
Starting from write system call in hello.c, we arrived just before the filesystem
code. Isn't it interesting ?

30
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
1.2
General Data Structures in Kernel such as List,
Hash, Queue, ...
In NetBSD, general data structure manipulation macros are provided. Conceptu-
ally, they are equivalent to templates of C++ language. To use these macro, the
only thing to do is including sys/queue.h header file.
Those built-in macros in NetBSD supports five types of data structures: singly-
linked lists, linked-lists, simple queues, tail queues, and circular queues. They are
used by the various parts of kernel. For example, buffer cache uses lists and tail
queues.
All five data structures support the following four functionality:
· Insertion of a new entry at the head of the list
· Insertion of a new entry before or after any element in the list
· Removal of any entry in the list
· Forward traversal through the list
All doubly linked types of data structures (lists, tail queues, and circle queues)
additionally allow:
· Insertion of a new entry before any element in the list.
· O(1) removal of any entry in the list.
However, code size and execution time of operations (except for removal) is about
twice that of the singly-linked data structures.
1.2.1
Linked-Lists
Linked lists are the simplest of the doubly linked data structures.
Here is an example using linked lists.
An Example
------------------------------------------------------
LIST_HEAD(listhead, entry) head;
struct listhead *headp;
/* List head. */
struct entry {
...
LIST_ENTRY(entry) entries;
/* List. */
...
} *n1, *n2, *np;
LIST_INIT(&head);
/* Initialize the list. */
n1 = malloc(sizeof(struct entry));
/* Insert at the head. */
LIST_INSERT_HEAD(&head, n1, entries);
n2 = malloc(sizeof(struct entry));
/* Insert after. */
LIST_INSERT_AFTER(n1, n2, entries);
n2 = malloc(sizeof(struct entry));
/* Insert before. */
LIST_INSERT_BEFORE(n1, n2, entries);
/* Forward traversal. */

1.2. GENERAL DATA STRUCTURES IN KERNEL SUCH AS LIST, HASH, QUEUE, ...31
LIST_FOREACH(np, &head, entries)
np-> ...
/* Delete. */
while (LIST_FIRST(&head) != NULL)
LIST_REMOVE(LIST_FIRST(&head), entries);
if (LIST_EMPTY(&head))
/* Test for emptiness. */
printf("nothing to do\n");
------------------------------------------------------
From now on, with this example, we will describe how to use built-in macros about
linked-lists.
List Definition
A list is headed by a structure defined by the LIST HEAD macro. This macro is
defined in sys/queue.h as,
------------------------------------------------------ sys/queue.h
87 /*
88
* List definitions.
89
*/
90 #define LIST_HEAD(name, type)
\
91 struct name {
\
92
struct type *lh_first;
/* first element */
\
93 }
------------------------------------------------------ sys/queue.h
This structure contains a single pointer to the first element on the list. The elements
are doubly linked so that an arbitrary element can be removed without traversing
the list. New elements can be added to the list after an existing element, before an
existing element, or at the head of the list. A LIST HEAD structure is declared as
follows:
------------------------------------------------------
LIST_HEAD(HEADNAME, TYPE) head;
------------------------------------------------------
where HEADNAME is the name of the structure to be defined, and TYPE is the type of
the elements to be linked into the list. A pointer to the head of the list can later
be declared as:
------------------------------------------------------
struct HEADNAME *headp;
------------------------------------------------------
(The names head and headp are user selectable.)

32
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
Declaring Entry
The macro LIST ENTRY declares a structure that connects the elements in the list.
This macro is defined in sys/queue.h as,
------------------------------------------------------ sys/queue.h
98 #define LIST_ENTRY(type)
\
99 struct {
\
100
struct type *le_next;
/* next element */
\
101
struct type **le_prev;
/* address of previous next element */ \
102 }
------------------------------------------------------ sys/queue.h
List Initialization
The macro LIST INIT initializes the list referenced by head. This marco is defined
as,
------------------------------------------------------ sys/queue.h
128 #define LIST_INIT(head) do {
\
129
(head)->lh_first = NULL;
\
130 } while (/*CONSTCOND*/0)
------------------------------------------------------ sys/queue.h
Entry Insertion
LIST INSERT HEAD macro inserts the new element elm at the head of
the list.
LIST INSERT AFTER inserts the new element elm after the element listelm.
LIST INSERT BEFORE inserts the new element elm before the element
listelm.
Those macros are defined in sys/queue.h as,
------------------------------------------------------ sys/queue.h
132 #define LIST_INSERT_AFTER(listelm, elm, field) do {
\
133
QUEUEDEBUG_LIST_OP((listelm), field)
\
134
if (((elm)->field.le_next = (listelm)->field.le_next) != NULL)
\
135
(listelm)->field.le_next->field.le_prev =
\
136
&(elm)->field.le_next;
\
137
(listelm)->field.le_next = (elm);
\
138
(elm)->field.le_prev = &(listelm)->field.le_next;
\
139 } while (/*CONSTCOND*/0)
140
141 #define LIST_INSERT_BEFORE(listelm, elm, field) do {
\
142
QUEUEDEBUG_LIST_OP((listelm), field)
\
143
(elm)->field.le_prev = (listelm)->field.le_prev;
\
144
(elm)->field.le_next = (listelm);
\
145
*(listelm)->field.le_prev = (elm);
\
146
(listelm)->field.le_prev = &(elm)->field.le_next;
\
147 } while (/*CONSTCOND*/0)
148

1.2. GENERAL DATA STRUCTURES IN KERNEL SUCH AS LIST, HASH, QUEUE, ...33
149 #define LIST_INSERT_HEAD(head, elm, field) do {
\
150
QUEUEDEBUG_LIST_INSERT_HEAD((head), (elm), field)
\
151
if (((elm)->field.le_next = (head)->lh_first) != NULL)
\
152
(head)->lh_first->field.le_prev = &(elm)->field.le_next;\
153
(head)->lh_first = (elm);
\
154
(elm)->field.le_prev = &(head)->lh_first;
\
155 } while (/*CONSTCOND*/0)
------------------------------------------------------ sys/queue.h
From the definition, macros beginning with QUEUEDEBUG is assertion macros. They
are meaningful only if QUEUEDEBUG macro is defined. These macros are defined as
------------------------------------------------------ sys/queue.h
107 #if defined(_KERNEL) && defined(QUEUEDEBUG)
108 #define QUEUEDEBUG_LIST_INSERT_HEAD(head, elm, field)
\
109
if ((head)->lh_first &&
\
110
(head)->lh_first->field.le_prev != &(head)->lh_first)
\
111
panic("LIST_INSERT_HEAD %p %s:%d", (head), __FILE__, __LINE__);
112 #define QUEUEDEBUG_LIST_OP(elm, field)
\
113
if ((elm)->field.le_next &&
\
114
(elm)->field.le_next->field.le_prev !=
\
115
&(elm)->field.le_next)
\
116
panic("LIST_* forw %p %s:%d", (elm), __FILE__, __LINE__);\
117
if (*(elm)->field.le_prev != (elm))
\
118
panic("LIST_* back %p %s:%d", (elm), __FILE__, __LINE__);
119 #define QUEUEDEBUG_LIST_POSTREMOVE(elm, field)
\
120
(elm)->field.le_next = (void *)1L;
\
121
(elm)->field.le_prev = (void *)1L;
122 #else
123 #define QUEUEDEBUG_LIST_INSERT_HEAD(head, elm, field)
124 #define QUEUEDEBUG_LIST_OP(elm, field)
125 #define QUEUEDEBUG_LIST_POSTREMOVE(elm, field)
126 #endif
------------------------------------------------------ sys/queue.h
Entry Removal
The macro LIST REMOVE removes the element elm from the list.
This marco is defined as
------------------------------------------------------ sys/queue.h
157 #define LIST_REMOVE(elm, field) do {
\
158
QUEUEDEBUG_LIST_OP((elm), field)
\
159
if ((elm)->field.le_next != NULL)
\
160
(elm)->field.le_next->field.le_prev =
\
161
(elm)->field.le_prev;
\
162
*(elm)->field.le_prev = (elm)->field.le_next;
\
163
QUEUEDEBUG_LIST_POSTREMOVE((elm), field)
\
164 } while (/*CONSTCOND*/0)
------------------------------------------------------ sys/queue.h

34
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
List Access
LIST EMPTY macro return true if the list head has no elements.
LIST FIRST macro returns the first element of the list head.
LIST FOREACH macro traverses the list referenced by head in the forward
direction, assigning each element in turn to var.
LIST NEXT macro returns the element after the element elm.
Those macros are defined as
------------------------------------------------------ sys/queue.h
166 #define LIST_FOREACH(var, head, field)
\
167
for ((var) = ((head)->lh_first);
\
168
(var);
\
169
(var) = ((var)->field.le_next))
170
171 /*
172
* List access methods.
173
*/
174 #define LIST_EMPTY(head)
((head)->lh_first == NULL)
175 #define LIST_FIRST(head)
((head)->lh_first)
176 #define LIST_NEXT(elm, field)
((elm)->field.le_next)
------------------------------------------------------ sys/queue.h
Now, if you see again the previous example, you would fully understand how it
works !
1.2.2
Tail Queues
Tail queues add the following one additional functionality:
· Entries can be added at the end of a list.
However,
· All list insertions and removals, except insertion before another
element, must specify the head of the list.
· Code size is about 15slower than linked-lists.
An Example
------------------------------------------------------
TAILQ_HEAD(tailhead, entry) head;
struct tailhead *headp;
/* Tail queue head. */
struct entry {
...
TAILQ_ENTRY(entry) entries;
/* Tail queue. */
...
} *n1, *n2, *np;
TAILQ_INIT(&head);
/* Initialize the queue. */
n1 = malloc(sizeof(struct entry));
/* Insert at the head. */

1.2. GENERAL DATA STRUCTURES IN KERNEL SUCH AS LIST, HASH, QUEUE, ...35
TAILQ_INSERT_HEAD(&head, n1, entries);
n1 = malloc(sizeof(struct entry));
/* Insert at the tail. */
TAILQ_INSERT_TAIL(&head, n1, entries);
n2 = malloc(sizeof(struct entry));
/* Insert after. */
TAILQ_INSERT_AFTER(&head, n1, n2, entries);
n2 = malloc(sizeof(struct entry));
/* Insert before. */
TAILQ_INSERT_BEFORE(n1, n2, entries);
/* Forward traversal. */
TAILQ_FOREACH(np, &head, entries)
np-> ...
/* Reverse traversal. */
TAILQ_FOREACH_REVERSE(np, &head, tailhead, entries)
np-> ...
/* Delete. */
while (TAILQ_FIRST(&head) != NULL)
TAILQ_REMOVE(&head, TAILQ_FIRST(&head), entries);
if (TAILQ_EMPTY(&head))
/* Test for emptiness. */
printf("nothing to do\n");
------------------------------------------------------
If you read the previous subsection about linked-list, you will not have any
problem in understanding the above example.
Therefore, instead of describing
usages for each macro, we show the definition of those macros.
------------------------------------------------------ sys/queue.h
310 /*
311
* Tail queue definitions.
312
*/
313 #define TAILQ_HEAD(name, type)
\
314 struct name {
\
315
struct type *tqh_first; /* first element */
\
316
struct type **tqh_last; /* addr of last next element */
\
317 }
318
319 #define TAILQ_HEAD_INITIALIZER(head)
\
320
{ NULL, &(head).tqh_first }
321
322 #define TAILQ_ENTRY(type)
\
323 struct {
\
324
struct type *tqe_next;
/* next element */
\
325
struct type **tqe_prev; /* address of previous next element */
\
326 }
327
328 /*
329
* Tail queue functions.
330
*/
331 #if defined(_KERNEL) && defined(QUEUEDEBUG)
332 #define QUEUEDEBUG_TAILQ_INSERT_HEAD(head, elm, field)
\
333
if ((head)->tqh_first &&
\
334
(head)->tqh_first->field.tqe_prev != &(head)->tqh_first)
\

36
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
335
panic("TAILQ_INSERT_HEAD %p %s:%d", (head), __FILE__, __LINE__);
336 #define QUEUEDEBUG_TAILQ_INSERT_TAIL(head, elm, field)
\
337
if (*(head)->tqh_last != NULL)
\
338
panic("TAILQ_INSERT_TAIL %p %s:%d", (head), __FILE__, __LINE__);
339 #define QUEUEDEBUG_TAILQ_OP(elm, field)
\
340
if ((elm)->field.tqe_next &&
\
341
(elm)->field.tqe_next->field.tqe_prev !=
\
342
&(elm)->field.tqe_next)
\
343
panic("TAILQ_* forw %p %s:%d", (elm), __FILE__, __LINE__);\
344
if (*(elm)->field.tqe_prev != (elm))
\
345
panic("TAILQ_* back %p %s:%d", (elm), __FILE__, __LINE__);
346 #define QUEUEDEBUG_TAILQ_PREREMOVE(head, elm, field)
\
347
if ((elm)->field.tqe_next == NULL &&
\
348
(head)->tqh_last != &(elm)->field.tqe_next)
\
349
panic("TAILQ_PREREMOVE head %p elm %p %s:%d",
\
350
(head), (elm), __FILE__, __LINE__);
351 #define QUEUEDEBUG_TAILQ_POSTREMOVE(elm, field)
\
352
(elm)->field.tqe_next = (void *)1L;
\
353
(elm)->field.tqe_prev = (void *)1L;
354 #else
355 #define QUEUEDEBUG_TAILQ_INSERT_HEAD(head, elm, field)
356 #define QUEUEDEBUG_TAILQ_INSERT_TAIL(head, elm, field)
357 #define QUEUEDEBUG_TAILQ_OP(elm, field)
358 #define QUEUEDEBUG_TAILQ_PREREMOVE(head, elm, field)
359 #define QUEUEDEBUG_TAILQ_POSTREMOVE(elm, field)
360 #endif
361
362 #define TAILQ_INIT(head) do {
\
363
(head)->tqh_first = NULL;
\
364
(head)->tqh_last = &(head)->tqh_first;
\
365 } while (/*CONSTCOND*/0)
366
367 #define TAILQ_INSERT_HEAD(head, elm, field) do {
\
368
QUEUEDEBUG_TAILQ_INSERT_HEAD((head), (elm), field)
\
369
if (((elm)->field.tqe_next = (head)->tqh_first) != NULL)
\
370
(head)->tqh_first->field.tqe_prev =
\
371
&(elm)->field.tqe_next;
\
372
else
\
373
(head)->tqh_last = &(elm)->field.tqe_next;
\
374
(head)->tqh_first = (elm);
\
375
(elm)->field.tqe_prev = &(head)->tqh_first;
\
376 } while (/*CONSTCOND*/0)
377
378 #define TAILQ_INSERT_TAIL(head, elm, field) do {
\
379
QUEUEDEBUG_TAILQ_INSERT_TAIL((head), (elm), field)
\
380
(elm)->field.tqe_next = NULL;
\
381
(elm)->field.tqe_prev = (head)->tqh_last;
\
382
*(head)->tqh_last = (elm);
\
383
(head)->tqh_last = &(elm)->field.tqe_next;
\
384 } while (/*CONSTCOND*/0)
385
386 #define TAILQ_INSERT_AFTER(head, listelm, elm, field) do {
\
387
QUEUEDEBUG_TAILQ_OP((listelm), field)
\
388
if (((elm)->field.tqe_next = (listelm)->field.tqe_next) != NULL)\

1.2. GENERAL DATA STRUCTURES IN KERNEL SUCH AS LIST, HASH, QUEUE, ...37
389
(elm)->field.tqe_next->field.tqe_prev =
\
390
&(elm)->field.tqe_next;
\
391
else
\
392
(head)->tqh_last = &(elm)->field.tqe_next;
\
393
(listelm)->field.tqe_next = (elm);
\
394
(elm)->field.tqe_prev = &(listelm)->field.tqe_next;
\
395 } while (/*CONSTCOND*/0)
396
397 #define TAILQ_INSERT_BEFORE(listelm, elm, field) do {
\
398
QUEUEDEBUG_TAILQ_OP((listelm), field)
\
399
(elm)->field.tqe_prev = (listelm)->field.tqe_prev;
\
400
(elm)->field.tqe_next = (listelm);
\
401
*(listelm)->field.tqe_prev = (elm);
\
402
(listelm)->field.tqe_prev = &(elm)->field.tqe_next;
\
403 } while (/*CONSTCOND*/0)
404
405 #define TAILQ_REMOVE(head, elm, field) do {
\
406
QUEUEDEBUG_TAILQ_PREREMOVE((head), (elm), field)
\
407
QUEUEDEBUG_TAILQ_OP((elm), field)
\
408
if (((elm)->field.tqe_next) != NULL)
\
409
(elm)->field.tqe_next->field.tqe_prev =
\
410
(elm)->field.tqe_prev;
\
411
else
\
412
(head)->tqh_last = (elm)->field.tqe_prev;
\
413
*(elm)->field.tqe_prev = (elm)->field.tqe_next;
\
414
QUEUEDEBUG_TAILQ_POSTREMOVE((elm), field);
\
415 } while (/*CONSTCOND*/0)
416
417 /*
418
* Tail queue access methods.
419
*/
420 #define TAILQ_EMPTY(head)
((head)->tqh_first == NULL)
421 #define TAILQ_FIRST(head)
((head)->tqh_first)
422 #define TAILQ_NEXT(elm, field)
((elm)->field.tqe_next)
423
424 #define TAILQ_LAST(head, headname) \
425
(*(((struct headname *)((head)->tqh_last))->tqh_last))
426 #define TAILQ_PREV(elm, headname, field) \
427
(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))
428
429 #define TAILQ_FOREACH(var, head, field)
\
430
for ((var) = ((head)->tqh_first);
\
431
(var);
\
432
(var) = ((var)->field.tqe_next))
433
434 #define TAILQ_FOREACH_REVERSE(var, head, headname, field)
\
435
for ((var) = (*(((struct headname *)((head)->tqh_last))->tqh_last));
\
436
(var);
\
437
(var) = (*(((struct headname *)((var)->field.tqe_prev))->tqh_last)))
------------------------------------------------------ sys/queue.h

38
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
1.2.3
Hash
The hash implementation in NerBSD kernel is so simple that it has only two func-
tions: hashinit, hashfree. Only just read the source code will be adequate de-
scription for how to use hash in kernel.
------------------------------------------------­ kern/kern subr.c
314 /*
315
* General routine to allocate a hash table.
316
* Allocate enough memory to hold at least `elements' list-head pointers.
317
* Return a pointer to the allocated space and set *hashmask to a pattern
318
* suitable for masking a value to use as an index into the returned array.
319
*/
320 void *
321 hashinit(elements, htype, mtype, mflags, hashmask)
322
u_int elements;
323
enum hashtype htype;
324
int mtype, mflags;
325
u_long *hashmask;
326 {
327
u_long hashsize, i;
328
LIST_HEAD(, generic) *hashtbl_list;
329
TAILQ_HEAD(, generic) *hashtbl_tailq;
330
size_t esize;
331
void *p;
332
333
if (elements == 0)
334
panic("hashinit: bad cnt");
335
for (hashsize = 1; hashsize < elements; hashsize <<= 1)
336
continue;
337
338
switch (htype) {
339
case HASH_LIST:
340
esize = sizeof(*hashtbl_list);
341
break;
342
case HASH_TAILQ:
343
esize = sizeof(*hashtbl_tailq);
344
break;
345 #ifdef DIAGNOSTIC
346
default:
347
panic("hashinit: invalid table type");
348 #endif
349
}
350
351
if ((p = malloc(hashsize * esize, mtype, mflags)) == NULL)
352
return (NULL);
353
354
switch (htype) {
355
case HASH_LIST:
356
hashtbl_list = p;
357
for (i = 0; i < hashsize; i++)
358
LIST_INIT(&hashtbl_list[i]);
359
break;
360
case HASH_TAILQ:

1.3. WAITING AND SLEEPING IN KERNEL
39
361
hashtbl_tailq = p;
362
for (i = 0; i < hashsize; i++)
363
TAILQ_INIT(&hashtbl_tailq[i]);
364
break;
365
}
366
*hashmask = hashsize - 1;
367
return (p);
368 }
369
370 /*
371
* Free memory from hash table previosly allocated via hashinit().
372
*/
373 void
374 hashdone(hashtbl, mtype)
375
void *hashtbl;
376
int mtype;
377 {
378
379
free(hashtbl, mtype);
380 }
------------------------------------------------­ kern/kern subr.c
1.3
Waiting and Sleeping in Kernel
1.4
Kernel Lock Manager
The lock functions provide synchronisation in the kernel by preventing multiple
threads from simultaneously executing critical sections of code accessing shared
data. A number of different locks are available:
1.4.1
simplelock and lock
struct simplelock
Provides a simple spinning mutex.
A processor will busy-wait
while trying to acquire a simplelock.
The simplelock operations
are implemented with machine-dependent locking primitives.
Simplelocks are usually used only by the high-level lock manager
and to protect short, critical sections of code.
Simplelocks
are the only locks that can be be used inside an interrupt han-
dler.
For a simplelock to be used in an interrupt handler, care
must be taken to disable the interrupt, acquire the lock, do any
processing, release the simplelock and re-enable the interrupt.
This procedure is necessary to avoid deadlock between the inter-
rupt handler and other threads executing on the same processor.
struct lock
Provides a high-level lock supporting sleeping/spinning until
the lock can be acquired.
The lock manager supplies both exclu-
sive-access and shared-access locks, with recursive exclusive-
access locks within a single thread.
It also allows upgrading a
shared-access lock to an exclusive-access lock, as well as down-

40
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
grading an exclusive-access lock to a shared-access lock.
If the kernel option LOCKDEBUG is enabled, additional facilities are pro-
vided to record additional lock information.
These facilities are pro-
vided to assist in determining deadlock occurrences.
1.4.2
Simplelock Interfaces
simple_lock_init(slock)
The simplelock slock is initialised to the unlocked state.
A
statically allocated simplelock also can be initialised with the
macro SIMPLELOCK_INITIALIZER.
The effect is the same as the dy-
namic initialisation by a call to simple_lock_init.
For exam-
ple,
struct simplelock slock = SIMPLELOCK_INITIALIZER;
simple_lock(slock)
The simplelock slock is locked.
If the simplelock is held then
execution will spin until the simplelock is acquired.
Care must
be taken that the calling thread does not already hold the sim-
plelock.
In this case, the simplelock can never be acquired.
If kernel option LOCKDEBUG is enabled, a "locking against my-
self" panic will occur.
simple_lock_try(slock)
Try to acquire the simplelock slock without spinning.
If the
simplelock is held by another thread then the return value is 0.
If the simplelock was acquired successfully then the return val-
ue is 1.
simple_lock_unlock(slock)
The simplelock slock is unlocked.
The simplelock must be locked
and the calling thread must be the one that last acquired the
simplelock.
If the calling thread does not hold the simplelock,
the simplelock will be released but the kernel behaviour is un-
defined.
simple_lock_freecheck(start, end)
Check that all simplelocks in the address range start to end are
not held.
If a simplelock within the range is found, the kernel
enters the debugger.
This function is available only with ker-
nel option LOCKDEBUG.
It provides a mechanism for basic simple-
lock consistency checks.
simple_lock_dump(void)
Dump the state of all simplelocks in the kernel.
This function
is available only with kernel option LOCKDEBUG.
1.4.3
Lock Interfaces
lockinit(lock, prio, wmesg, timo, flags)
The lock lock is initialised according to the parameters provid-
ed.
Arguments are as follows:

1.4. KERNEL LOCK MANAGER
41
lock
The lock.
prio
The thread priority when it is woken up after sleeping
on the lock.
wmesg
A sleep message used when a thread goes to sleep wait-
ing for the lock, so that the exact reason it is sleep-
ing can easily be identified.
timo
The maximum sleep time.
Used by tsleep(9).
flags
Flags to specify the lock behaviour permanently over
the lifetime of the lock.
Valid lock flags are:
LK_NOWAIT
Threads should not sleep when attempting to
acquire the lock.
LK_SLEEPFAIL
Threads should sleep, then return failure when
acquiring the lock.
LK_CANRECURSE
Threads can acquire the lock recursively.
lockmgr(lock, flags, slock)
Set, change or release a lock according to the parameters pro-
vided.
Arguments are as follows:
lock
The lock.
slock
Simplelock interlock.
If the flag LK_INTERLOCK is set
in flags, slock is a simplelock held by the caller.
When the lock lock is acquired, the simplelock is re-
leased.
If the flag LK_INTERLOCK is not set, slock is
ignored.
flags
Flags to specify the lock request type.
In addition to
the flags specified above, the following flags are
valid:
LK_SHARED
Get one of many possible shared-access locks.
If a thread holding an exclusive-access lock
requests a shared-access lock, the exclusive-
access lock is downgraded to a shared-access
lock.
LK_EXCLUSIVE
Stop further shared-access locks, when they
are cleared, grant a pending upgrade if it ex-
ists, then grant an exclusive-access lock.
Only one exclusive-access lock may exist at a

42
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
time, except that a thread holding an exclu-
sive-access lock may get additional exclusive-
access locks if it explicitly sets the LK_CAN-
RECURSE flag in the lock request, or if the
LK_CANRECURSE flag was set when the lock was
initialised.
LK_UPGRADE
The thread must hold a shared-access lock that
it wants to have upgraded to an exclusive-ac-
cess lock.
Other threads may get exclusive
access to the protected resource between the
time that the upgrade is requested and the
time that it is granted.
LK_EXCLUPGRADE
The thread must hold a shared-access lock that
it wants to have upgraded to an exclusive-ac-
cess lock.
If the request succeeds, no other
threads will have acquired exclusive access to
the protected resource between the time that
the upgrade is requested and the time that it
is granted.
However, if another thread has
already requested an upgrade, the request will
fail.
LK_DOWNGRADE
The thread must hold an exclusive-access lock
that it wants to have downgraded to a shared-
access lock.
If the thread holds multiple
(recursive) exclusive-access locks, they will
all be downgraded to shared-access locks.
LK_RELEASE
Release one instance of a lock.
LK_DRAIN
Wait for all activity on the lock to end, then
mark it decommissioned.
This feature is used
before freeing a lock that is part of a piece
of memory that is about to be freed.
LK_REENABLE
Lock is to be re-enabled after drain.
The
LK_REENABLE flag may be set only at the re-
lease of a lock obtained by a drain.
LK_SETRECURSE
Other locks while we have it OK.
LK_RECURSEFAIL
Attempt at recursive lock fails.
LK_SPIN
Lock spins instead of sleeping.

1.5. KERNEL MEMORY ALLOCATION
43
LK_INTERLOCK
Unlock the simplelock slock when the lock is
acquired.
lockstatus(lock)
Determine the status of lock lock.
Returns LK_EXCLUSIVE or
LK_SHARED for exclusive-access and shared-access locks respec-
tively.
lockmgr_printinfo(lock)
Print out information about state of lock lock.
spinlockinit(lock, wmesg, flags)
The lock lock is initialised as a spinlock according to the pa-
rameters provided.
Arguments are as follows:
lock
The lock.
wmesg
This is a simple name for lock.
flags
Flags to specify the lock behaviour.
Valid lock flags
are the same as outlined above.
spinlockmgr(lock, flags, slock)
Set, change or release a lock according to the parameters pro-
vided.
Arguments are as follows:
lock
The spin lock.
flags
Flags to specify the lock request type.
Valid lock
flags are the same as outlined above.
slock
Simplelock interlock.
The simplelock slock is set by
the caller.
When the lock lock is acquired, the sim-
plelock is released.
1.5
Kernel Memory Allocation
1.6
Resource Pool Manager
Before describing virtual filesystem initialization, there needs to be brief mention of
resource-pool manager which is used as base library in implementating (1) namei
pathname buffers, (2) vnode management data structures, (3) file structures, (4)
current working directory structures, (5) file descriptor structures, and so forth.
resource-pool manager is implemented in kern/subr pool.c. Instead of in-
vestigating details of this code, we present the API of pool allocator defined in
sys/pool.h to minimize the basic knowledge to access to the essentials of fast
filesystem code.
Theses utility routines provide management of pools of fixed-sized areas of mem-
ory. Resource pool set aside an amount of memory for exclusive use by the resource
pool owner.

44
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
1.6.1
Design of Resource-Pool Manager
Memory is allocated in pages which are split into pieces according to the pool item
size. Each page is kept on a list headed by `pr pagelist` in the pool structure. The
individual pool items are on a linked list headed by `ph itemlist` in each page header.
------------------------------------------------------ sys/pool.h
208 void
pool_init(struct pool *, size_t, u_int, u_int,
209
int, const char *, struct pool_allocator *);
210 void
pool_destroy(struct pool *);
211
212 void
pool_set_drain_hook(struct pool *,
213
void (*)(void *, int), void *);
214
215 void
*pool_get(struct pool *, int);
216 void
pool_put(struct pool *, void *);
217 int
pool_reclaim(struct pool *);
218
...
231 int
pool_prime(struct pool *, int);
232 void
pool_setlowat(struct pool *, int);
233 void
pool_sethiwat(struct pool *, int);
234 void
pool_sethardlimit(struct pool *, int, const char *, int);
235 void
pool_drain(void *);
236
237 /*
238
* Debugging and diagnostic aides.
239
*/
240 void
pool_print(struct pool *, const char *);
241 void
pool_printit(struct pool *, const char *,
242
void (*)(const char *, ...));
243 int
pool_chk(struct pool *, const char *);
244
245 /*
246
* Pool cache routines.
247
*/
248 void
pool_cache_init(struct pool_cache *, struct pool *,
249
int (*ctor)(void *, void *, int),
250
void (*dtor)(void *, void *),
251
void *);
252 void
pool_cache_destroy(struct pool_cache *);
253 void
*pool_cache_get(struct pool_cache *, int);
254 void
pool_cache_put(struct pool_cache *, void *);
255 void
pool_cache_destruct_object(struct pool_cache *, void *);
256 void
pool_cache_invalidate(struct pool_cache *);
------------------------------------------------------ sys/pool.h
1.6.2
Initializing a pool
The pool init function declared in line 208 initializes a resource pool. This
function allow other kernel parts to declare static pools that must be initialized
before malloc kernel function is available.
The arguments are, in order,

1.6. RESOURCE POOL MANAGER
45
struct pool *pp
is the handle identifying the pool resource instance
size t size
specifies the size of the memory items managed by the pool
u int align
specifies the memory address alignment of the items re-
turned by pool get function. This argument must be a power
of two. If zero, the alignment defaults to a architecture specific
natural alignment.
u int ioff
int flags
const char *wchan
set the wait channel passed on to tsleep function,
a kernel function similar to sleep in C library, if pool get must
wait for items to be returned to the pool. If you run top program,
you may see this string in the state field.
struct pool allocator *palloc
is called to add additional memory if
the pool is depleted.
1.6.3
Destroying a Pool
pool destroy function declared in line 210 destroys a resource pool. It takes a
single argument identifying the pool resource instance.
1.6.4
Allocating Items from a Pool
pool get function allocates an item from the pool and returns a pointer to it. The
arguments are, in order,
struct pool *pp
is the handle identifying the pool resource instance
int flags
defines behavior in case the pooled resources are depleted. If
no resources are available and PR WAITOK is given, this function will
wait until items are returned to the pool. If both PR LIMITFAIL
and PR WAITOK is specified, and the pool has reached its hard limit,
pool get function will return NULL without waiting, allowing the
caller to do its own garbage collection.
1.6.5
Returning Items to a Pool
pool put function declared in line 216 returns the pool item to the resource pool.
If the number of available items in the pool exceeds the maximum pool size set by
pool sethiwat function and there are no outstanding requests for pool items, the
excess items will be returned to the system.
The arguments are, in order,
struct pool *pp
is the handle identifying the pool resource instance.
void *item
is a pointer to a pool item previously obtained by pool get
function.
1.6.6
Using Cache to Speed Up
Pool caches provide a way for constructed objects to be cached. This can lead
to performance improvements by avoiding needless object construction/destruction
that is deferred until absolutely necessary.
Caches are grouped into cache groups. Each cache group references up to 16
constructed objects. The pool cache is initialized by pool cache init function.

46
CHAPTER 1. WELCOME TO THE WORLD OF KERNEL !
When a cache allocates an object from the pool, it calls the object's constructor
and places it into a cache group. When a cache group frees an object back to
the pool, it first calls the object's destructor. This allows the object to persist in
constructed form while freed to the cache.
Though pool cache is initialized by virtual filesystem, it is not used.
1.6.7
Other Resource-Pool Manager API
There are some resource-pool manager API that is not described such as pool prime,
pool sethiwat, pool setlowat, pool set drain hook, pool reclaim, pool drain,
pool sethardlimit, and so forth.
Although this API is designed and implemented by the resource-pool manager
designer, it is not used in filesystem code.

Chapter 2
I/O System
2.1
I/O Mapping from User to Device
There are four main kinds of I/O in 4.4BSD.
· filesystem
· character-device interface
· block-device interface
· socket interface
The character device interface provides unstructured access to the underlying
hardware, whereas the block device provides structured acess to the underlying
hardware.
All I/O executed by block-device interface is done to or from I/O buffers that
resides in the kernel's address space, the buffer cache. This approach requires at
least one memory-to-memory copy operation to satisfy a user request.
For character-device interface, I/O operations do not go through the buffer
cache; instead, they are made directly between the device and buffers in the appli-
cation's virtual address space.
2.1.1
Device Drivers
A device driver is divided into three main sections:
1. Autoconfiguration and initialization routines
2. The top half: routines for servicing I/O requests
3. The bottom half: interrupt service routines
2.1.2
I/O Queueing
Queue Processing Procedure
The I/O queues are the primary means of communication bewteen the top and
bottom halves of a device dirver. When an input or output request is received by
the top half of the driver,
1. it is recorded in a data structure that is placed on per-device queue for pro-
cessing.
47

48
CHAPTER 2. I/O SYSTEM
2. When an input or output operation completes, the device dirver receives an
interrupt from the controller.
3. The interrupt service routine removes the appropriate request from the de-
vice's queue,
4. notifies the requester that the command has completed, and then
5. starts the next request from the queue.
Maintaing Queue Consistency
Because I/O queues are shared among asynchronous routines, access to the queues
must be synchronized. Routine that make up the top half of a device driver must
raise the processor priority level using splbio(), spltty(), etc. to prevent the
bottom half from being entered as a result of an interrupt while a top-half routine
is manipulating an I/O queue.
2.1.3
Interrupt Handling
The system arranges for the unit-number parameter to be passed to the interrupt
service routine for each device by installing the address of an auxiliary glue routine
in the interrupt-vector table. This glue routine, rather than the actual interrupt
service routine, is invoked to service the interrupt.
2.2
Block Devices
The task of the block-device interface is to convert from the user abstraction of
a disk as an array of bytes to the structure imposed by the underlying physical
medium. This operation of converting random access to an array of bytes to reads
and writes of disk serctors is known as block I/O.
2.2.1
Entry Points for Block-Device Drivers
open commonly verify the integrity of the associated medium. open entry point
will be called for each open or mount system call on a block special device file.
strategy start a read or write operation, and return immediately.
Block I/O
routines such as bread or bwrite routines call the device's strategy routine
to read or write data not in the buffer cache. If the request is synchronous,
the caller must sleep until I/O completes.
close Disk devices have nothing to do when a device is closed.
dump write all physical memory to the device.
psize returns the size of a disk-drive partition. This entry point is used during the
bootstrap procedure to calculate the location at which a crash dump should
be placed and to determine the sizes of the swap devices.
2.2.2
Disk Labels
What is in it ?
Disk label contains the information about the geometry of the disk and about the
layout of he partitions.

2.3. CHARACTER DEVICES
49
How is it used ?
When the machine is powered up or the reset button is pressed, the CPU executes
the hardware bootstrap code from the ROM. The hardware bootstrap code typically
reads the first few sectors on the disk into the main memory, then branches to the
address of the first location that it read. The program stored in these first few
sectors is the second-level bootstrap. Having the disk label stored in the part of the
disk read as part of the hardware bootstrap allows the second-level bootstrap to
have the disk-label information. This information gives it the ability to find the
root filesystem and hence the files, such as kernel, needed to bring up 4.4BSD.
Format of Disk Label
The size and location of the second-level bootstrap are dependent on the require-
ments of the hardware bootstrap code. Since there is no standard for disk-label for-
mats and the hardware bootstrap code usually understands only the vendor lavel,
it is often necessary to support both the vendor and the 4.4BSD disk labels. Here,
the vendor label must be placed where the hardware bootstrap ROM code expects
it; the 4.4BSD label must be placed out of the way of the vendor label but within
the are that is read in by the hardware bootstrap code, so that it will be available
to the second-level bootstrap.
2.3
Character Devices
A character device ususally maps the hardware interface into a byte stream. The
character interface for disks and tapes is also called the raw device interface. Its
primary task is to arrange for direct I/O to and from the device. It also handles the
asynchronous nature of I/O by maintaing and ordering an active queue of pending
transfers.
2.3.1
Raw Devices and Physical I/O
Most raw devices differ from block devices only in the way that they do I/O.
Whereas block devices read and write to and from the system buffer cache, raw
device bypasses the buffer cache. This eliminates the memory-to-memory copy, but
denies the benefits of data caching. To preserve consistency between data in the
buffer cache and data written directly to the device via character-device interface,
the raw device should be used only when the block device is idle.
Buffers for Character-Device Interface
Because raw devices bypass the buffer cache, they are responsible for managing their
own buffer structures. The read and write entry points for raw device driver uses
physio function to start a raw I/O operatin. The strategy funtion manages buffers
to map the user data buffer. This buffer is completely different and separated from
buffer cache used by block-device driver.
The strategy function of kern/kern physio.c is shown below. You may un-
derstand details after you read chapters about UVM and buffer cache. Now, just
see the algorithm described in comments !
------------------------------------------------­ kern/kern physio.c
69 /*
70
* Do "physical I/O" on behalf of a user.
"Physical I/O" is I/O directly
71
* from the raw device to user buffers, and bypasses the buffer cache.

50
CHAPTER 2. I/O SYSTEM
72
*
73
* Comments in brackets are from Leffler, et al.'s pseudo-code implementation.
74
*/
75 int
76 physio(strategy, bp, dev, flags, minphys, uio)
77
void (*strategy) __P((struct buf *));
78
struct buf *bp;
79
dev_t dev;
80
int flags;
81
void (*minphys) __P((struct buf *));
82
struct uio *uio;
83 {
84
struct iovec *iovp;
85
struct proc *p = curproc;
86
int error, done, i, nobuf, s;
87
long todo;
88
89
error = 0;
90
flags &= B_READ | B_WRITE;
91
92
/* Make sure we have a buffer, creating one if necessary. */
93
if ((nobuf = (bp == NULL)) != 0) {
94
95
bp = getphysbuf();
96
/* bp was just malloc'd so can't already be busy */
97
bp->b_flags |= B_BUSY;
98
99
} else {
100
101
/* [raise the processor priority level to splbio;] */
102
s = splbio();
103
104
/* [while the buffer is marked busy] */
105
while (bp->b_flags & B_BUSY) {
106
/* [mark the buffer wanted] */
107
bp->b_flags |= B_WANTED;
108
/* [wait until the buffer is available] */
109
tsleep((caddr_t)bp, PRIBIO+1, "physbuf", 0);
110
}
111
112
/* Mark it busy, so nobody else will use it. */
113
bp->b_flags |= B_BUSY;
114
115
/* [lower the priority level] */
116
splx(s);
117
}
118
119
/* [set up the fixed part of the buffer for a transfer] */
120
bp->b_dev = dev;
121
bp->b_error = 0;
122
bp->b_proc = p;
123
LIST_INIT(&bp->b_dep);
124
125
/*

2.3. CHARACTER DEVICES
51
126
* [while there are data to transfer and no I/O error]
127
* Note that I/O errors are handled with a 'goto' at the bottom
128
* of the 'while' loop.
129
*/
130
for (i = 0; i < uio->uio_iovcnt; i++) {
131
iovp = &uio->uio_iov[i];
132
while (iovp->iov_len > 0) {
133
134
/*
135
* [mark the buffer busy for physical I/O]
136
* (i.e. set B_PHYS (because it's an I/O to user
137
* memory, and B_RAW, because B_RAW is to be
138
* "Set by physio for raw transfers.", in addition
139
* to the "busy" and read/write flag.)
140
*/
141
bp->b_flags = B_BUSY | B_PHYS | B_RAW | flags;
142
143
/* [set up the buffer for a maximum-sized transfer] */
144
bp->b_blkno = btodb(uio->uio_offset);
145
bp->b_bcount = iovp->iov_len;
146
bp->b_data = iovp->iov_base;
147
148
/*
149
* [call minphys to bound the transfer size]
150
* and remember the amount of data to transfer,
151
* for later comparison.
152
*/
153
(*minphys)(bp);
154
todo = bp->b_bcount;
155 #ifdef DIAGNOSTIC
156
if (todo <= 0)
157
panic("todo(%ld) <= 0; minphys broken", todo);
158
if (todo > MAXPHYS)
159
panic("todo(%ld) > MAXPHYS; minphys broken",
160
todo);
161 #endif
162
163
/*
164
* [lock the part of the user address space involved
165
*
in the transfer]
166
* Beware vmapbuf(); it clobbers b_data and
167
* saves it in b_saveaddr.
However, vunmapbuf()
168
* restores it.
169
*/
170
PHOLD(p);
171
error = uvm_vslock(p, bp->b_data, todo,
172
(flags & B_READ) ?
173
VM_PROT_WRITE : VM_PROT_READ);
174
if (error) {
175
bp->b_flags |= B_ERROR;
176
bp->b_error = error;
177
goto after_vsunlock;
178
}
179
vmapbuf(bp, todo);

52
CHAPTER 2. I/O SYSTEM
180
181
/* [call strategy to start the transfer] */
182
(*strategy)(bp);
183
184
/*
185
* Note that the raise/wait/lower/get error
186
* steps below would be done by biowait(), but
187
* we want to unlock the address space before
188
* we lower the priority.
189
*
190
* [raise the priority level to splbio]
191
*/
192
s = splbio();
193
194
/* [wait for the transfer to complete] */
195
while ((bp->b_flags & B_DONE) == 0)
196
tsleep((caddr_t) bp, PRIBIO + 1, "physio", 0);
197
198
/* Mark it busy again, so nobody else will use it. */
199
bp->b_flags |= B_BUSY;
200
201
/* [lower the priority level] */
202
splx(s);
203
204
/*
205
* [unlock the part of the address space previously
206
*
locked]
207
*/
208
vunmapbuf(bp, todo);
209
uvm_vsunlock(p, bp->b_data, todo);
210
after_vsunlock:
211
PRELE(p);
212
213
/* remember error value (save a splbio/splx pair) */
214
if (bp->b_flags & B_ERROR)
215
error = (bp->b_error ? bp->b_error : EIO);
216
217
/*
218
* [deduct the transfer size from the total number
219
*
of data to transfer]
220
*/
221
done = bp->b_bcount - bp->b_resid;
222
KASSERT(done >= 0);
223
KASSERT(done <= todo);
224
225
iovp->iov_len -= done;
226
iovp->iov_base = (caddr_t)iovp->iov_base + done;
227
uio->uio_offset += done;
228
uio->uio_resid -= done;
229
230
/*
231
* Now, check for an error.
232
* Also, handle weird end-of-disk semantics.
233
*/

2.3. CHARACTER DEVICES
53
234
if (error || done < todo)
235
goto done;
236
}
237
}
238
239 done:
240
/*
241
* [clean up the state of the buffer]
242
* Remember if somebody wants it, so we can wake them up below.
243
* Also, if we had to steal it, give it back.
244
*/
245
s = splbio();
246
bp->b_flags &= ~(B_BUSY | B_PHYS | B_RAW);
247
if (nobuf)
248
putphysbuf(bp);
249
else {
250
/*
251
* [if another process is waiting for the raw I/O buffer,
252
*
wake up processes waiting to do physical I/O;
253
*/
254
if (bp->b_flags & B_WANTED) {
255
bp->b_flags &= ~B_WANTED;
256
wakeup(bp);
257
}
258
}
259
splx(s);
260
261
return (error);
262 }
263
264 /*
265
* allocate a buffer structure for use in physical I/O.
266
*/
267 struct buf *
268 getphysbuf()
269 {
270
struct buf *bp;
271
int s;
272
273
s = splbio();
274
bp = pool_get(&bufpool, PR_WAITOK);
275
splx(s);
276
memset(bp, 0, sizeof(*bp));
277
return(bp);
278 }
279
280 /*
281
* get rid of a swap buffer structure which has been used in physical I/O.
282
*/
283 void
284 putphysbuf(bp)
285
struct buf *bp;
286 {
287
int s;

54
CHAPTER 2. I/O SYSTEM
288
289
if (__predict_false(bp->b_flags & B_WANTED))
290
panic("putphysbuf: private buf B_WANTED");
291
s = splbio();
292
pool_put(&bufpool, bp);
293
splx(s);
294 }
295
296 /*
297
* Leffler, et al., says on p. 231:
298
* "The minphys() routine is called by physio() to adjust the
299
* size of each I/O transfer before the latter is passed to
300
* the strategy routine..."
301
*
302
* so, just adjust the buffer's count accounting to MAXPHYS here,
303
* and return the new count;
304
*/
305 void
306 minphys(bp)
307
struct buf *bp;
308 {
309
310
if (bp->b_bcount > MAXPHYS)
311
bp->b_bcount = MAXPHYS;
312 }
------------------------------------------------­ kern/kern physio.c
2.3.2
Entry Points for Character-Device Drivers
open
clode
ioctl
mmap Map a device offset into a memory address. This entry point is called by the
virtual-memory system to convert a logical mapping to a physical address.
read
reset
select
stop
write
2.4
Descriptor Management
2.4.1
File Descriptor, Descriptor Table, and File Entry
For user process, all I/O is done through file descriptors. System calls that refer to
open files take a file descriptor as an argument to specify the file. The file descriptor
is used by the kernel to index into the descriptor table for the current process to
locate a file entry.

2.4. DESCRIPTOR MANAGEMENT
55
2.4.2
What does the File Entry Points ?
File entry can point to vnode structure or socket.
vnode
The file entry provides a file type and a pointer to an underlying
object for the descriptor. For data files, the file entry points to a
vnode structure.
Special files do not have data blocks allocated on the disk; they
are handled by the special-device filesystem that calls appropriate
drivers to haldle I/O for them.
The virtual-memory system supports the mapping of files into a
process's address space. Here, the file descriptor must reference a
vnode that will be partially or completely mapped into the user's
address space.
socket
The file entry may also reference a socket. The Sockets have a
different file type, and the file entry points to a system block that
is used in doing interprocess communication.
2.4.3
Movement of Data Inside the Kernel: uiomove function
Within the kernel, I/O data are described by an array of vectors. Each I/O vector
or iovec has a base address and a length. The kernel maintains another structure,
called a uio structure. All I/O within the kernel is described with iovec and uio
structures. Movement of data is processed as following steps.
1. System calls such as read and write that are not passed an iovec create a
uio to describe their arguemnts.
2. The uio structure reaches the part of the kernel responsible for moving the
data to or from the process address space: the filesystem, the network, or a
device driver.
3. In general, these parts of the kernel arrange a kernel buffer to hold the data,
then use uiomove function to copy the data to or from the buffer or buffers
describved by the uio structure.
4. uiomove function is called with a pointer to kernel data area, a data count,
and a uio structure. As it moves data, it updates the counters and pointers
of the iovec and uio structures by a corresponding amount.
5. If the kernel buffer is not as large as the area described by the uio structure,
the uio structure will point to the part of the process address space just
beyond the location completed most recently. Thus, while servicing a request,
the kernel may call uiomove function multiple times, each time giving a pointer
to a new kernel buffer for the next block of data.
The source for the definition of iovec, uio structure is
------------------------------------------------------ sys/uio.h
54 struct iovec {
55
void
*iov_base;
/* Base address. */
56
size_t
iov_len;
/* Length. */
57 };
58
59 #if !defined(_POSIX_C_SOURCE) && !defined(_XOPEN_SOURCE)

56
CHAPTER 2. I/O SYSTEM
60 #include <sys/ansi.h>
61
62 #ifndef off_t
63 typedef __off_t
off_t;
/* file offset */
64 #define off_t
__off_t
65 #endif
66
67 enum
uio_rw { UIO_READ, UIO_WRITE };
68
69 /* Segment flag values. */
70 enum uio_seg {
71
UIO_USERSPACE,
/* from user data space */
72
UIO_SYSSPACE
/* from system space */
73 };
74
75 struct uio {
76
struct
iovec *uio_iov; /* pointer to array of iovecs */
77
int uio_iovcnt; /* number of iovecs in array */
78
off_t
uio_offset; /* offset into file this uio corresponds to */
79
size_t
uio_resid;
/* residual i/o count */
80
enum
uio_seg uio_segflg; /* see above */
81
enum
uio_rw uio_rw;
/* see above */
82
struct
proc *uio_procp;/* process if UIO_USERSPACE */
83 };
------------------------------------------------------ sys/uio.h
The source for uiomove function is
------------------------------------------------­ kern/kern subr.c
140 int
141 uiomove(buf, n, uio)
142
void *buf;
143
size_t n;
144
struct uio *uio;
145 {
146
struct iovec *iov;
147
u_int cnt;
148
int error = 0;
149
char *cp = buf;
150
struct proc *p = uio->uio_procp;
151
152 #ifdef DIAGNOSTIC
153
if (uio->uio_rw != UIO_READ && uio->uio_rw != UIO_WRITE)
154
panic("uiomove: mode");
155 #endif
156
while (n > 0 && uio->uio_resid) {
157
iov = uio->uio_iov;
158
cnt = iov->iov_len;
159
if (cnt == 0) {
160
uio->uio_iov++;
161
uio->uio_iovcnt--;
162
continue;
163
}

2.4. DESCRIPTOR MANAGEMENT
57
164
if (cnt > n)
165
cnt = n;
166
switch (uio->uio_segflg) {
167
168
case UIO_USERSPACE:
169
if (curproc->p_cpu->ci_schedstate.spc_flags &
170
SPCF_SHOULDYIELD)
171
preempt(NULL);
172
if (__predict_true(p == curproc)) {
173
if (uio->uio_rw == UIO_READ)
174
error = copyout(cp, iov->iov_base, cnt);
175
else
176
error = copyin(iov->iov_base, cp, cnt);
177
} else {
178
if (uio->uio_rw == UIO_READ)
179
error = copyout_proc(p, cp,
180
iov->iov_base, cnt);
181
else
182
error = copyin_proc(p, iov->iov_base,
183
cp, cnt);
184
}
185
if (error)
186
return (error);
187
break;
188
189
case UIO_SYSSPACE:
190
if (uio->uio_rw == UIO_READ)
191
error = kcopy(cp, iov->iov_base, cnt);
192
else
193
error = kcopy(iov->iov_base, cp, cnt);
194
if (error)
195
return (error);
196
break;
197
}
198
iov->iov_base = (caddr_t)iov->iov_base + cnt;
199
iov->iov_len -= cnt;
200
uio->uio_resid -= cnt;
201
uio->uio_offset += cnt;
202
cp += cnt;
203
KDASSERT(cnt <= n);
204
n -= cnt;
205
}
206
return (error);
207 }
------------------------------------------------­ kern/kern subr.c
where
int
copyin(const void *uaddr, void *kaddr, size_t len);
Copies len bytes of data from the user-space address uaddr
to the kernel-space address kaddr.

58
CHAPTER 2. I/O SYSTEM
int
copyout(const void *kaddr, void *uaddr, size_t len);
Copies len bytes of data from the kernel-space address
kaddr to the user-space address uaddr.
Character device drivers that do not copy data from the process generally do
not interpret the uio structure. Instead, there is one low-level kernel routine that
arranges a direct transfer to or from the address space of the process. Here, a
separate I/O operation is done for each iovec element.
Block device drivers does not use uio structures. User operations on block devies
are done through the buffer cache.

Chapter 3
Virtual File System
The virtual filesystem, VFS, is the kernel interface to filesystems. The interface
specifies the calls for the kernel to access filesystems. It also specifies the core
functionality that a filesystem must provide to the kernel. The focus of VFS activity
is the vnode and is discussed in the other chapter.
3.1
Architecture of Virtual File System
3.1.1
Why VFS is needed ?
In earlier BSD, the file entries directly referenced the local filesystem inode. An
inode is a data structure that describes the contents of a file. However, with the
advent of multiple filesystem types, the architecture had to be generalized. Thus, it
was easier and more logical to add a new layer to the system below the file entry and
above the inode. This new layer was first implemented by Sun Microsystems, which
called it the virtual-node, or vnode, later. A vnode used by a local filesystem would
refer to an inode. A vnode used by a remote filesystem would refer to a protocol
control block that described the location and naming information necessary to access
the remote file.
3.1.2
What Is in the Vnode ?
The vnode is an extensible object-oriented interface. It contains information that is
generically useful independent of the underlying filesystem object that it represents.
vnode structure is defined as,
------------------------------------------------------ sys/vnode.h
86 struct vnode {
87
struct uvm_object
v_uobj;
/* the VM object
*/
88 #define v_usecount
v_uobj.uo_refs
89 #define v_interlock
v_uobj.vmobjlock
90
voff_t
v_size;
/* size of file
*/
91
int
v_flag;
/* flags
*/
92
int
v_numoutput;
/* number of pending writes
*/
93
long
v_writecount;
/* reference count of writers */
94
long
v_holdcnt;
/* page & buffer references
*/
95
u_long
v_id;
/* capability identifier
*/
96
struct mount
*v_mount;
/* ptr to vfs we are in
*/
97
int
(**v_op) __P((void *));
/* vnode operations vector
*/
59

60
CHAPTER 3. VIRTUAL FILE SYSTEM
98
TAILQ_ENTRY(vnode) v_freelist;
/* vnode freelist
*/
99
LIST_ENTRY(vnode)
v_mntvnodes;
/* vnodes for mount point
*/
100
struct buflists
v_cleanblkhd;
/* clean blocklist head
*/
101
struct buflists
v_dirtyblkhd;
/* dirty blocklist head
*/
102
LIST_ENTRY(vnode)
v_synclist;
/* vnodes with dirty buffers
*/
103
union {
104
struct mount
*vu_mountedhere;
/* ptr to mounted vfs (VDIR)
*/
105
struct socket
*vu_socket;
/* unix ipc (VSOCK)
*/
106
struct specinfo *vu_specinfo;
/* device (VCHR, VBLK)
*/
107
struct fifoinfo *vu_fifoinfo;
/* fifo (VFIFO)
*/
108
}
v_un;
109
struct nqlease
*v_lease;
/* Soft reference to lease
*/
110
enum vtype
v_type;
/* vnode type
*/
111
enum vtagtype
v_tag;
/* type of underlying data
*/
112
struct lock
v_lock;
/* lock for this vnode
*/
113
struct lock
*v_vnlock;
/* pointer to lock
*/
114
void
*v_data;
/* private data for fs
*/
115
struct klist
v_klist;
/* knotes attached to vnode
*/
116 #ifdef VERIFIED_EXEC
117
char
fp_status;
/* fingerprint status
118
(see below)
*/
119 #endif
120 };
121 #define v_mountedhere
v_un.vu_mountedhere
122 #define v_socket
v_un.vu_socket
123 #define v_specinfo
v_un.vu_specinfo
124 #define v_fifoinfo
v_un.vu_fifoinfo
------------------------------------------------------ sys/vnode.h
The information stored in a vnode includes the following:
v usecount
is the number of file entries that are open for reading
and/or writing that reference the vnode.
v numoutput
is the number of buffer write operations in progress. To
speed the flushing of dirty data, the kernel does this operation by
doing asynchronous writes on all the dirty buffers at once. For
local filesystem, this simultaneous push causes all the buffers to be
put into the disk queue, so that they can be sorted into an optimal
order to minimize seeking. System calls that return until the data
are on stable store, such as fsync system call, can sleep on the
count of pending output operations, waiting for the count to reach
zero.
v writecount
is the number of file entries that are open for writing
that reference the vnode.
v holdcnt
is the number of pages and buffers that are associated with
the vnode.
v mount
describes the filesystem that contains the object represented
by the vnode.
v op
is a pointer to the set of vnode operations defined for the object.
v freelist
is a list linking together all the vnodes in the system that
are not being used actively. The free list is used when a filesystem
needs to allocate a new vnode.

3.1. ARCHITECTURE OF VIRTUAL FILE SYSTEM
61
v mntvnodes
is a list linking together all the vnodes associated with
a specific mount point. Thus, when sync system call is executed
for a filesystem, the kernel can traverse this list to visit all the files
active within that filesystem.
v cleanblkhd
is the header of vnode clean-buffer list. This list stores
all the buffers, about the vnode, that have not been modified, or
have been written back since they were last modified.
This list is used to free buffers when a file is deleted. Since the
file is never be read again, the kernel can immediately calcel any
pending I/O on its dirty buffers, and reclaim all its clean and dirty
buffers and place them at the head of the buffer free list, ready for
immediate reuse.
v dirtyblkhd
is the header of vnode dirty-buffer list. This list stores
all the buffers, about the vnode, that have been modified, but not
yet been written back.
v un
is a reference to state about special devices, sockets, and FIFOs.
v lease
is used with NFS. So you need not regard it if you are only
interested in local filesystem code.
v type
is the type of the underlying object such as regular file, di-
rectory, character device, and etc. This type information is not
strictly necessary, since a vnode client could always call a vnode
operation to get the type of the underlying object. However, be-
cause the type often is needed, the type of underlying objects does
not change, and it takes time to call through the vnode interface,
the object type is cached in the vnode.
This field has a value among VNON, VREG, VDIR, VBLK, VCHR,
VLNK, VSOCK, VFIFO, VBAD.
v lock
is used for locking the vnode.
v data
is a pointer to private information needed for the underlying
object. For the local filesystem, this pointer will reference an inode.
3.1.3
How to Call Vnode Operations ?
Kernel manipulates vnode by passing requests to the underlying object through a
set of defined operations.
As part of the booting process, each filesystem registers the set of vnode opera-
tions that is able to support. The kernel then builds a table that lists the union of
all operations supported by any filesystem.
Supported operations are filled in with the entry point registered by the filesys-
tem. Filesystems amy opt to have unsupported operations filled in with either a
default routine, or a routine that returns the characteristic error.
When a filesystem is mounted on a directory, the previous contents of the direc-
tory are hidden; only the contents of the root of the newly mounted filesystem are
visible. The mount command pushes a new layer onto a vnode stack; an unmount
command removes a layer.
When a file access such as open, read, stat, or close occurs to a vnode in the
stack, that vnode has several options as
· Do the requested operation and resutn a result
· Pass the operation without change to the next-lower vnode on the satck.

62
CHAPTER 3. VIRTUAL FILE SYSTEM
· Modify the operations provided with the request, then pass it to the next-lower
vnode. When the operation returns from the lower vnode, it may modify the
results, or simply return them.
If an operation is passed to the bottom of the stack without any layer taking
action on it, then the interface will return the error "operation not supported."
To make pass-operation efficient, the kernel places the vnode operation name
and its arguments into an argument structure. This structure is then passed as a
single parameter to the vnode operation. Thus all call on a vnode operation will
always have exactly one parameter, which is the pointer to the argument structure.
Let's see how this design policy is implemented in NetBSD. When user level
write system call is executed, kernel executes vn write function of kern/vfs syscalls.c
as we have read in the previous chapter. The code of vn write function is again
listed here for easy reference.
------------------------------------------------­ kern/vfs init.c
526 /*
527
* File table vnode write routine.
528
*/
529 static int
530 vn_write(fp, offset, uio, cred, flags)
531
struct file *fp;
532
off_t *offset;
533
struct uio *uio;
534
struct ucred *cred;
535
int flags;
536 {
537
struct vnode *vp = (struct vnode *)fp->f_data;
538
int count, error, ioflag = IO_UNIT;
539
540
if (vp->v_type == VREG && (fp->f_flag & O_APPEND))
541
ioflag |= IO_APPEND;
542
if (fp->f_flag & FNONBLOCK)
543
ioflag |= IO_NDELAY;
544
if (fp->f_flag & FFSYNC ||
545
(vp->v_mount && (vp->v_mount->mnt_flag & MNT_SYNCHRONOUS)))
546
ioflag |= IO_SYNC;
547
else if (fp->f_flag & FDSYNC)
548
ioflag |= IO_DSYNC;
549
if (fp->f_flag & FALTIO)
550
ioflag |= IO_ALTSEMANTICS;
551
VOP_LEASE(vp, uio->uio_procp, cred, LEASE_WRITE);
552
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
553
uio->uio_offset = *offset;
554
count = uio->uio_resid;
555
error = VOP_WRITE(vp, uio, ioflag, cred);
556
if (flags & FOF_UPDATE_OFFSET) {
557
if (ioflag & IO_APPEND)
558
*offset = uio->uio_offset;
559
else
560
*offset += count - uio->uio_resid;
561
}
562
VOP_UNLOCK(vp, 0);
563
return (error);

3.1. ARCHITECTURE OF VIRTUAL FILE SYSTEM
63
564 }
------------------------------------------------­ kern/vfs init.c
The second virtual file system operation in this function is VOP WRITE.
VOP WRITE(vp, uio, ioflag, cred) write to a file. The argument vp
is the vnode of the file to write to, uio is the location of the data
to write, ioflag is a set of flags and cred are the credentials of the
calling process.
The ioflag argument is used to give directives and hints to the
file system. The low 16 bits are a bit mask which can contain the
same flags as VOP READ().
Zero is returned on success, otherwise an error is returned. The
vnode should be locked on entry and remains locked on exit.
This function is defined in kern/vnode if.c ans kern/vnode if.h twice ! In one
way, it is defined as an inline function, equivalent to macro. And the other way,
it is defined as a general function. Loadable Kernel Module (LKM) used general
function, and normal kernel uses a fast inline function. The code list shown below
is inline function version.
-------------------------------------------------- sys/vnode if.h
374 static __inline int VOP_WRITE(vp, uio, ioflag, cred)
375
struct vnode *vp;
376
struct uio *uio;
377
int ioflag;
378
struct ucred *cred;
379 {
380
struct vop_write_args a;
381
a.a_desc = VDESC(vop_write);
382
a.a_vp = vp;
383
a.a_uio = uio;
384
a.a_ioflag = ioflag;
385
a.a_cred = cred;
386
return (VCALL(vp, VOFFSET(vop_write), &a));
387 }
-------------------------------------------------- sys/vnode if.h
VCALL and VOFFSET macros are defined as,
-------------------------------------------------- sys/cdefs.h
469 /*
470
* VOCALL calls an op given an ops vector.
We break it out because BSD's
471
* vclean changes the ops vector and then wants to call ops with the old
472
* vector.
473
*/
474 /*
475
* actually, vclean doesn't use it anymore, but nfs does,
476
* for device specials and fifos.
477
*/
478 #define VOCALL(OPSV,OFF,AP) (( *((OPSV)[(OFF)])) (AP))
479

64
CHAPTER 3. VIRTUAL FILE SYSTEM
480 /*
481
* This call works for vnodes in the kernel.
482
*/
483 #define VCALL(VP,OFF,AP) VOCALL((VP)->v_op,(OFF),(AP))
484 #define VDESC(OP) (& __CONCAT(OP,_desc))
485 #define VOFFSET(OP) (VDESC(OP)->vdesc_offset)
-------------------------------------------------- sys/cdefs.h
VOFFSET macro used in line 1476 of sys/vnode if.h is expanded as,
VOFFSET(vop_write)
-->
(VDESC(vop_write)->vdesc_offset)
-->
((& __CONCAT(vop_write))->vdesc_offset,_desc)
-->
(vop_write_desc->vdesc_offset)
Therefore VCALL macro used in line 1476 is expanded as,
VCALL(vp, VOFFSET(vop_write), &a))
-->
VCALL(vp, vop_write_desc->vdesc_offset, &a)
-->
VOCALL((vp)->v_op,(vop_write_desc->vdesc_offset),(&a))
-->
(( *(((vp)->v_op)[((vop_write_desc->vdesc_offset))])) (&a))
Thus line 1476 is equivalent to
1476
return (VCALL(vp, VOFFSET(vop_write), &a));
<====>
return ( *( (vp->v_op) [vop_write_desc->vdesc_offset] ) ) (&a);
|
|
| |
|
+-------------------------------------------+ |
|
|
+------------------------------------------------+
vop write desc is defined in kern/vnode if.c as,
------------------------------------------------­ kern/vnode if.h
109 const struct vnodeop_desc vop_bwrite_desc = {
110
2,
111
"vop_bwrite",
112
0,
113
vop_bwrite_vp_offsets,
114
VDESC_NO_OFFSET,
115
VDESC_NO_OFFSET,
116
VDESC_NO_OFFSET,
117
VDESC_NO_OFFSET,
118
NULL,
119 };
------------------------------------------------­ kern/vnode if.h
where the struct vnodeop desc is defined in sys/vnode.h as,
-------------------------------------------------- sys/vnode.h

3.2. VIRTUAL FILESYSTEM INITIALIZATION
65
389 /*
390
* This structure describes the vnode operation taking place.
391
*/
392 struct vnodeop_desc {
393
int
vdesc_offset;
/* offset in vector--first for speed */
394
const char
*vdesc_name;
/* a readable name for debugging */
395
int
vdesc_flags;
/* VDESC_* flags */
396
397
/*
398
* These ops are used by bypass routines to map and locate arguments.
399
* Creds and procs are not needed in bypass routines, but sometimes
400
* they are useful to (for example) transport layers.
401
* Nameidata is useful because it has a cred in it.
402
*/
403
const int
*vdesc_vp_offsets;
/* list ended by VDESC_NO_OFFSET */
404
int
vdesc_vpp_offset;
/* return vpp location */
405
int
vdesc_cred_offset;
/* cred location, if any */
406
int
vdesc_proc_offset;
/* proc location, if any */
407
int
vdesc_componentname_offset; /* if any */
408
/*
409
* Finally, we've got a list of private data (about each operation)
410
* for each transport layer.
(Support to manage this list is not
411
* yet part of BSD.)
412
*/
413
caddr_t
*vdesc_transports;
414 };
-------------------------------------------------- sys/vnode.h
Therefore line 1476 is equivalent to
<====>
return ( *( (vp->v_op) [vop_lease_desc->vdesc_offset] ) ) (&a);
<====>
return ( *( (vp->v_op) [2] ) ) (&a);
<====>
return ffs_write (&a);
In the next section, we will explain how the vp->v op pointer is initialized.
3.2
Virtual Filesystem Initialization
Virtual filesystem initialization is initiated in main function of kern/init main.c
which is practically the first function executed after machine bootstrap. At there,
vfsinit function of kern/vfs init.c is called.
That function calls vfsinit function of kern/vfs init.c
------------------------------------------------------ kern/vfs init.c
320 /*
321
* Initialize the vnode structures and initialize each file system type.
322
*/
323 void
324 vfsinit()
325 {
326
extern struct vfsops * const vfs_list_initial[];
327
int i;
328

66
CHAPTER 3. VIRTUAL FILE SYSTEM
329
/*
330
* Initialize the namei pathname buffer pool and cache.
331
*/
332
pool_init(&pnbuf_pool, MAXPATHLEN, 0, 0, 0, "pnbufpl",
333
&pool_allocator_nointr);
334
pool_cache_init(&pnbuf_cache, &pnbuf_pool, NULL, NULL, NULL);
335
336
/*
337
* Initialize the vnode table
338
*/
339
vntblinit();
340
341
/*
342
* Initialize the vnode name cache
343
*/
344
nchinit();
345
346 #ifdef DEBUG
347
/*
348
* Check the list of vnode operations.
349
*/
350
vfs_op_check();
351 #endif
352
353
/*
354
* Initialize the special vnode operations.
355
*/
356
vfs_opv_init(vfs_special_vnodeopv_descs);
357
358
/*
359
* Establish each file system which was statically
360
* included in the kernel.
361
*/
362
vattr_null(&va_null);
363
for (i = 0; vfs_list_initial[i] != NULL; i++) {
364
if (vfs_attach(vfs_list_initial[i])) {
365
printf("multiple `%s' file systems",
366
vfs_list_initial[i]->vfs_name);
367
panic("vfsinit");
368
}
369
}
370 }
------------------------------------------------------ kern/vfs init.c
Now, we describes each portion of vfsinit function with related source codes.
3.2.1
Initializing the namei pathname buffer pool
At line 332, vfsinit function initializes the namei pathname buffer pool. MAXPATHLEN
specifies the size of the memory items managed by the pool and is defined to 1024 at
sys/param.h and sys/syslimits.h. The next three zero parameter means there
is no alignment constraint in initializing pool and logging facility is not used. When
the top program executes, we will see "pnbufpl" in the state field when kernel is
waiting for allocation of item for pool.

3.2. VIRTUAL FILESYSTEM INITIALIZATION
67
At line 334, vfsinit function initializes the pool cache for namei pathname
buffer pool. This cache, however, is not used anywhere; It is only executed here,
but it may be useful for future release of NetBSD operating system.
3.2.2
Initializing the vnode table
Vnode table is initialized by calling vntblinit function, at line 339 of vfs subr.c.
------------------------------------------------------ kern/vfs subr.c
190 void
191 vntblinit()
192 {
193
194
pool_init(&vnode_pool, sizeof(struct vnode), 0, 0, 0, "vnodepl",
195
&pool_allocator_nointr);
196
197
/*
198
* Initialize the filesystem syncer.
199
*/
200
vn_initialize_syncerd();
201 }
------------------------------------------------------ kern/vfs subr.c
Just the same way as namei buffer cache pool is initialized, at line 194 of
vfs subr.c, pool for vnode management data structures is initialized.
At line 200, vn initialize syncerd function calls filesystem syncer that flushes
cached data to disk at regular intervals.
Surpringly, filesystem syncer daemon uses a special kind of virtual filesystem
called as syncfs filesystem ! Syncfs virtual filesystem is implemented in miscfs/syncfs
directory. When a filesystem is mounted, syncfs is installed, as a virtual filesystem
layer, on top of the filesystem by creating a new filesystem syncer vnode for the spec-
ified mount point by calling vfs allocate syncvnode function of miscfs/syncfs/sync vnops.c.
Source code of vn initialize syncerd function is
------------------------------------------------------ miscfs/syncfs subr.c
70 void
71 vn_initialize_syncerd()
72 {
73
int i;
74
75
syncer_last = SYNCER_MAXDELAY + 2;
76
77
syncer_workitem_pending = malloc(
syncer_last * sizeof (struct synclist),
78
M_VNODE, M_WAITOK);
79
80
for (i = 0; i < syncer_last; i++)
81
LIST_INIT(&syncer_workitem_pending[i]);
82
83
lockinit(&syncer_lock, PVFS, "synclk", 0, 0);
84 }
------------------------------------------------------ miscfs/syncfs subr.c

68
CHAPTER 3. VIRTUAL FILE SYSTEM
line 75 SYNCER MAXDELAY is maximum delay interval between syncer daemon works
and it is defined to 32 seconds by default.
3.2.3
Initializing the Name Cache Buffer
------------------------------------------------------ kern/vfs cache.c
401 void
402 nchinit(void)
403 {
404
405
TAILQ_INIT(&nclruhead);
406
nchashtbl =
407
hashinit(desiredvnodes, HASH_LIST, M_CACHE, M_WAITOK, &nchash);
408
ncvhashtbl =
409 #ifdef NAMECACHE_ENTER_REVERSE
410
hashinit(desiredvnodes, HASH_LIST, M_CACHE, M_WAITOK, &ncvhash);
411 #else
412
hashinit(desiredvnodes/8, HASH_LIST, M_CACHE, M_WAITOK, &ncvhash);
413 #endif
414
pool_init(&namecache_pool, sizeof(struct namecache), 0, 0, 0,
415
"ncachepl", &pool_allocator_nointr);
416 }
------------------------------------------------------ kern/vfs cache.c
3.2.4
Initialize the Special Vnode Operations
This initialization process is started by line 356 of kern/vfs init.c that we have
already listed. We list here again for easy reference.
------------------------------------------------------ kern/vfs init.c
356
vfs_opv_init(vfs_special_vnodeopv_descs);
------------------------------------------------------ kern/vfs init.c
The vfs special vnodeopv descs argument used in line 356 of kern/vfs init.c
is defined in kern/vfs init.c as
------------------------------------------------------ kern/vfs init.c
119 const struct vnodeopv_desc * const vfs_special_vnodeopv_descs[] = {
120
&dead_vnodeop_opv_desc,
121
&fifo_vnodeop_opv_desc,
122
&spec_vnodeop_opv_desc,
123
&sync_vnodeop_opv_desc,
124
NULL,
125 };
------------------------------------------------------ kern/vfs init.c
where struct vnodeopv desc is defined in sys/vnode.h as
------------------------------------------------------ sys/vnode.h

3.2. VIRTUAL FILESYSTEM INITIALIZATION
69
449 struct vnodeopv_desc {
450
/* ptr to the ptr to the vector where op should go */
451
int (***opv_desc_vector_p)(void *);
452
const struct vnodeopv_entry_desc *opv_desc_ops; /* null terminated list */
453 };
------------------------------------------------------ sys/vnode.h
For example, the one of the four members contained in vfs special vnodeopv descs
array, sync vnodeop opv desc is defined in miscfs/syncfs/sync vnops.c as
------------------------------------------­ miscfs/syncfs/sync vnops.c
47 int (**sync_vnodeop_p) __P((void *));
48 const struct vnodeopv_entry_desc sync_vnodeop_entries[] = {
49
{ &vop_default_desc, vn_default_error },
50
{ &vop_close_desc, sync_close },
/* close */
51
{ &vop_fsync_desc, sync_fsync },
/* fsync */
52
{ &vop_inactive_desc, sync_inactive },
/* inactive */
53
{ &vop_reclaim_desc, sync_reclaim },
/* reclaim */
54
{ &vop_lock_desc, sync_lock },
/* lock */
55
{ &vop_unlock_desc, sync_unlock },
/* unlock */
56
{ &vop_print_desc, sync_print },
/* print */
57
{ &vop_islocked_desc, sync_islocked },
/* islocked */
58
{ &vop_putpages_desc, sync_putpages },
/* islocked */
59
{ NULL, NULL }
60 };
61
62 const struct vnodeopv_desc sync_vnodeop_opv_desc =
63
{ &sync_vnodeop_p, sync_vnodeop_entries };
------------------------------------------­ miscfs/syncfs/sync vnops.c
where struct vnodeopv entry desc is defined in sys/vnode.h as,
-------------------------------------------------- sys/vnode.h
445 struct vnodeopv_entry_desc {
446
const struct vnodeop_desc *opve_op;
/* which operation this is */
447
int (*opve_impl)(void *);
/* code implementing this operation */
448 };
-------------------------------------------------- sys/vnode.h
where struct vnodeop desc is defined in sys/vnode.h and we showed this code
in the previous section.
To prevent your confusion, we summarized the relation between vnodeopv desc,
vnodeopv entry desc, vnodeop desc structures.
//////////////////////////////////////////////////////////////////////
// Vnode Operation Vector Description Table (vfs_init.c)
//
const struct vnodeopv_desc vfs_special_vnodeopv_descs[] = {
...
&sync_vnodeop_opv_desc,
...

70
CHAPTER 3. VIRTUAL FILE SYSTEM
};
//////////////////////////////////////////////////////////////////////
// Vnode Operation Vector Description (miscfs/syncfs/sync_vnops.c)
//
int (**sync_vnodeop_p) __P((void *));
const struct vnodeopv_desc sync_vnodeop_opv_desc = {
&sync_vnodeop_p, sync_vnodeop_entries
};
//////////////////////////////////////////////////////////////////////
// Vnode Operation Vector Entry Description Table (misc/syncfs/sync_vnops.c)
//
const struct vnodeopv_entry_desc sync_vnodeop_entries[] = {
...
{ &vop_fsync_desc, sync_fsync },
/* fsync */
...
{ NULL, NULL }
};
//////////////////////////////////////////////////////////////////////
// Vnode Operation Vector Entry Description (kern/vnode_if.c)
//
const int vop_fsync_vp_offsets[] = {
VOPARG_OFFSETOF(struct vop_fsync_args,a_vp),
VDESC_NO_OFFSET
};
const struct vnodeop_desc vop_fsync_desc = {
19,
"vop_fsync",
0,
vop_fsync_vp_offsets,
VDESC_NO_OFFSET,
VOPARG_OFFSETOF(struct vop_fsync_args, a_cred),
VOPARG_OFFSETOF(struct vop_fsync_args, a_p),
VDESC_NO_OFFSET,
NULL,
};
//////////////////////////////////////////////////////////////////////
// Vnode Operation (misc/syncfs/sync_vnops.c)
//
int
sync_fsync(v)
void *v;
{
...
}
Before going on reading this book, you should clearly understand the relation be-
tween vnode operation vector description table, vnode operation vector description,
vnode operation vector entry description table, vnode operation vector entry descrip-
tion, and vnode operation, from the above summary. Only when if you know the

3.2. VIRTUAL FILESYSTEM INITIALIZATION
71
relation without confusion, you can clearly understand how vfs opv init function
work.
vfs opv init function is defined in kern/vfs init.c as,
------------------------------------------------------ kern/vfs init.c
240 void
241 vfs_opv_init(vopvdpp)
242
const struct vnodeopv_desc * const *vopvdpp;
243 {
244
int (**opv_desc_vector) __P((void *));
245
int i;
246
247
/*
248
* Allocate the vectors.
249
*/
250
for (i = 0; vopvdpp[i] != NULL; i++) {
251
/* XXX - shouldn't be M_VNODE */
252
opv_desc_vector =
253
malloc(VNODE_OPS_COUNT * sizeof(PFI), M_VNODE, M_WAITOK);
254
memset(opv_desc_vector, 0, VNODE_OPS_COUNT * sizeof(PFI));
255
*(vopvdpp[i]->opv_desc_vector_p) = opv_desc_vector;
256
DODEBUG(printf("vector at %p allocated\n",
257
opv_desc_vector_p));
258
}
259
260
/*
261
* ...and fill them in.
262
*/
263
for (i = 0; vopvdpp[i] != NULL; i++)
264
vfs_opv_init_explicit(vopvdpp[i]);
265
266
/*
267
* Finally, go back and replace unfilled routines
268
* with their default.
269
*/
270
for (i = 0; vopvdpp[i] != NULL; i++)
271
vfs_opv_init_default(vopvdpp[i]);
272 }
------------------------------------------------------ kern/vfs init.c
for loop used in line 250-258 executes 4 times with vopvdpp variable set respec-
tively to dead vnodeop opv desc, fifo vnodeop opv desc, spec vnodeop opv desc,
and sync vnodeop opv desc. line 252-255 makes room for storing array of func-
tion pointer indicating each available vnode operation function.
VNODE OPS COUNT and PFI are defined as,
------------------------------------------------------ kern/vfs init.c
1640 #define VNODE_OPS_COUNT 50
------------------------------------------------------ kern/vfs init.c
and

72
CHAPTER 3. VIRTUAL FILE SYSTEM
------------------------------------------------------ kern/vfs init.c
127 /*
128
* This code doesn't work if the defn is **vnodop_defns with cc.
129
* The problem is because of the compiler sometimes putting in an
130
* extra level of indirection for arrays.
It's an interesting
131
* "feature" of C.
132
*/
133 typedef int (*PFI) __P((void *));
------------------------------------------------------ kern/vfs init.c
After completion of this loop, for example, the value of sync vnodeop p used in
line 63
of miscfs/syncfs/sync vnops.c changes from NULL to a allocated memory
by line 252-253 of kern/vfs init.c
Now, we will analyze vfs init explicit and vfs init default functions which
fill the allocated array with function pointers pointing to vnode operation functions.
The code for this function is,
------------------------------------------------------ kern/vfs init.c
176 vfs_opv_init_explicit(vfs_opv_desc)
177
const struct vnodeopv_desc *vfs_opv_desc;
178 {
179
int (**opv_desc_vector) __P((void *));
180
const struct vnodeopv_entry_desc *opve_descp;
181
182
opv_desc_vector = *(vfs_opv_desc->opv_desc_vector_p);
183
184
for (opve_descp = vfs_opv_desc->opv_desc_ops;
185
opve_descp->opve_op;
186
opve_descp++) {
187
/*
188
* Sanity check:
is this operation listed
189
* in the list of operations?
We check this
190
* by seeing if its offest is zero.
Since
191
* the default routine should always be listed
192
* first, it should be the only one with a zero
193
* offset.
Any other operation with a zero
194
* offset is probably not listed in
195
* vfs_op_descs, and so is probably an error.
196
*
197
* A panic here means the layer programmer
198
* has committed the all-too common bug
199
* of adding a new operation to the layer's
200
* list of vnode operations but
201
* not adding the operation to the system-wide
202
* list of supported operations.
203
*/
204
if (opve_descp->opve_op->vdesc_offset == 0 &&
205
opve_descp->opve_op->vdesc_offset != VOFFSET(vop_default)) {
206
printf("operation %s not listed in %s.\n",
207
opve_descp->opve_op->vdesc_name, "vfs_op_descs");
208
panic ("vfs_opv_init: bad operation");
209
}

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
73
210
211
/*
212
* Fill in this entry.
213
*/
214
opv_desc_vector[opve_descp->opve_op->vdesc_offset] =
215
opve_descp->opve_impl;
216
}
217 }
218
219 static void
220 vfs_opv_init_default(vfs_opv_desc)
221
const struct vnodeopv_desc *vfs_opv_desc;
222 {
223
int j;
224
int (**opv_desc_vector) __P((void *));
225
226
opv_desc_vector = *(vfs_opv_desc->opv_desc_vector_p);
227
228
/*
229
* Force every operations vector to have a default routine.
230
*/
231
if (opv_desc_vector[VOFFSET(vop_default)] == NULL)
232
panic("vfs_opv_init: operation vector without default routine.");
233
234
for (j = 0; j < VNODE_OPS_COUNT; j++)
235
if (opv_desc_vector[j] == NULL)
236
opv_desc_vector[j] =
237
opv_desc_vector[VOFFSET(vop_default)];
238 }
------------------------------------------------------ kern/vfs init.c
If you keep in mind the summary about structures related with vnode operation,
only reading the source code would be sufficient to understand how vfs opv init explicit
and vfs opv init default function initialize opv desc vector p member in vnode
operation vector description structure.
3.3
Attaching Available Static File System
line 362-369
of kern/vfs init.c attaches available static filesystem.
3.3.1
Set vnode attribute to empty
line 362
of kern/vfs init.c creates va null global variable, defined in kern/vfs init.c,
as a null vnode.
------------------------------------------------------ kern/vfs init.c
318 struct vattr va_null;
------------------------------------------------------ kern/vfs init.c
This variable is not directly used, but used with VATTR NULL macro used to clear a
vnode. This macro is defined in sys/vnode.h as,

74
CHAPTER 3. VIRTUAL FILE SYSTEM
------------------------------------------------------ kern/vfs init.c
281 #define VATTR_NULL(vap) (*(vap) = va_null)
/* initialize a vattr */
------------------------------------------------------ kern/vfs init.c
Now we list the source code for vattr null function which creates a null vnode.
The reason why kernel source uses VATTR NULL macro instead of directly calling
vattr null function, is simple since the later is faster.
------------------------------------------------------ kern/vfs subr.c
372 void
373 vattr_null(vap)
374
struct vattr *vap;
375 {
376
377
vap->va_type = VNON;
378
379
/*
380
* Assign individually so that it is safe even if size and
381
* sign of each member are varied.
382
*/
383
vap->va_mode = VNOVAL;
384
vap->va_nlink = VNOVAL;
385
vap->va_uid = VNOVAL;
386
vap->va_gid = VNOVAL;
387
vap->va_fsid = VNOVAL;
388
vap->va_fileid = VNOVAL;
389
vap->va_size = VNOVAL;
390
vap->va_blocksize = VNOVAL;
391
vap->va_atime.tv_sec =
392
vap->va_mtime.tv_sec =
393
vap->va_ctime.tv_sec = VNOVAL;
394
vap->va_atime.tv_nsec =
395
vap->va_mtime.tv_nsec =
396
vap->va_ctime.tv_nsec = VNOVAL;
397
vap->va_gen = VNOVAL;
398
vap->va_flags = VNOVAL;
399
vap->va_rdev = VNOVAL;
400
vap->va_bytes = VNOVAL;
401
vap->va_vaflags = 0;
402 }
------------------------------------------------------ kern/vfs subr.c
3.3.2
How is vfs list initial initialized ?
vfs list initial is array of pointer to virtual filesystem operation. The source
code for initializing this variable is not included in the kernel source: the needed
code is generated automatically when we compile kernel.
Berfore compiling kernel, we executes config program. For example, we gener-
ates new kenel as

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
75
# cd /usr/src/syssrc/sys/arch/sparc64/conf
# config MY_KERNEL
# cd ../compile/MY_KERNEL
# make depend; make
In the above sample session, config program generates Makefile, and many
header files under ../compile/MY KERNEL directory. There is, however, only four C
source code is generated: devsw.c, ioconf.c, param.c, swapnetbsd.c, vers.c
From these automatically generated C source files by config program, ioconf.c
contains the definition of vfs list initial variable.
For instance, if kernel configuration file contains,
----------------------------------------- arch/sparc64/conf/GENERIC32
152 ## File systems.
You probably need at least one of FFS or NFS.
153 file-system
FFS
# Berkeley Fast Filesystem
154 file-system
NFS
# Sun NFS-compatible filesystem client
155 file-system
KERNFS
# kernel data-structure filesystem
156 file-system
NULLFS
# NULL layered filesystem
157 file-system
OVERLAY
# overlay file system
158 file-system
MFS
# memory-based filesystem
159 file-system
FDESC
# user file descriptor filesystem
160 file-system
UMAPFS
# uid/gid remapping filesystem
161 file-system
LFS
# Log-based filesystem (still experimental)
162 file-system
PORTAL
# portal filesystem (still experimental)
163 file-system
PROCFS
# /proc
164 file-system
CD9660
# ISO 9660 + Rock Ridge file system
165 file-system
UNION
# union file system
166 file-system
MSDOSFS
# MS-DOS FAT filesystem(s).
----------------------------------------- arch/sparc64/conf/GENERIC32
then, ioconf.c would contain
------------------------------­ arch/sparc64/compile/MY KERNEL/ioconf.c
1643 struct vfsops * const vfs_list_initial[] = {
1644
&ffs_vfsops,
1645
&nfs_vfsops,
1646
&kernfs_vfsops,
1647
&nullfs_vfsops,
1648
&overlay_vfsops,
1649
&mfs_vfsops,
1650
&fdesc_vfsops,
1651
&umapfs_vfsops,
1652
&lfs_vfsops,
1653
&portal_vfsops,
1654
&procfs_vfsops,
1655
&cd9660_vfsops,
1656
&union_vfsops,
1657
&msdosfs_vfsops,
1658
NULL,
1659 };

76
CHAPTER 3. VIRTUAL FILE SYSTEM
------------------------------­ arch/sparc64/compile/MY KERNEL/ioconf.c
where struct vfsops is defined in sys/mount.h as,
-------------------------------------------------- sys/mount.h
344 struct vfsops {
345
const char *vfs_name;
346
int
(*vfs_mount)
__P((struct mount *mp, const char *path,
347
void *data, struct nameidata *ndp,
348
struct proc *p));
349
int
(*vfs_start)
__P((struct mount *mp, int flags,
350
struct proc *p));
351
int
(*vfs_unmount)
__P((struct mount *mp, int mntflags,
352
struct proc *p));
353
int
(*vfs_root)
__P((struct mount *mp, struct vnode **vpp));
354
int
(*vfs_quotactl) __P((struct mount *mp, int cmds, uid_t uid,
355
caddr_t arg, struct proc *p));
356
int
(*vfs_statfs)
__P((struct mount *mp, struct statfs *sbp,
357
struct proc *p));
358
int
(*vfs_sync)
__P((struct mount *mp, int waitfor,
359
struct ucred *cred, struct proc *p));
360
int
(*vfs_vget)
__P((struct mount *mp, ino_t ino,
361
struct vnode **vpp));
362
int
(*vfs_fhtovp)
__P((struct mount *mp, struct fid *fhp,
363
struct vnode **vpp));
364
int
(*vfs_vptofh)
__P((struct vnode *vp, struct fid *fhp));
365
void
(*vfs_init)
__P((void));
366
void
(*vfs_reinit)
__P((void));
367
void
(*vfs_done)
__P((void));
368
int
(*vfs_sysctl)
__P((int *, u_int, void *, size_t *, void *,
369
size_t, struct proc *));
370
int
(*vfs_mountroot) __P((void));
371
int
(*vfs_checkexp) __P((struct mount *mp, struct mbuf *nam,
372
int *extflagsp, struct ucred **credanonp));
373
const struct vnodeopv_desc * const *vfs_opv_descs;
374
int
vfs_refcount;
375
LIST_ENTRY(vfsops) vfs_list;
376 };
-------------------------------------------------- sys/mount.h
For example, ffs vfsops variable appeared in line 1644 of arch/sparc64/compile/MY KERNEL/ioconf.c
is initialized in ufs/ffs/ffs vfsops.c as
----------------------------------------------­ ufs/ffs/ffs vfsops.c
97 struct vfsops ffs_vfsops = {
98
MOUNT_FFS,
99
ffs_mount,
100
ufs_start,
101
ffs_unmount,
102
ufs_root,
103
ufs_quotactl,
104
ffs_statfs,
105
ffs_sync,

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
77
106
ffs_vget,
107
ffs_fhtovp,
108
ffs_vptofh,
109
ffs_init,
110
ffs_reinit,
111
ffs_done,
112
ffs_sysctl,
113
ffs_mountroot,
114
ufs_check_export,
115
ffs_vnodeopv_descs,
116 };
----------------------------------------------­ ufs/ffs/ffs vfsops.c
You may wonder how Makefile for kernel compile know where the filesystem
related files are. Makefile in ufs directory specifies the location of filesystem
related kernel sources recursively with the help of system-wide makefile script,
/usr/share/mk/bsd.kinc.mk.
With this information, we can plant our own filesystem with a different name
onto NetBSD kernel !
3.3.3
Establish a filesystem and initialize it
line 363-369
of vfs init.c attaches and initializes all available virtual filesystem
layers such as FFS, NFS and LFS, by calling vfs attach function.
------------------------------------------------------ kern/vfs subr.c
2634 int
2635 vfs_attach(vfs)
2636
struct vfsops *vfs;
2637 {
2638
struct vfsops *v;
2639
int error = 0;
2640
2641
2642
/*
2643
* Make sure this file system doesn't already exist.
2644
*/
2645
LIST_FOREACH(v, &vfs_list, vfs_list) {
2646
if (strcmp(vfs->vfs_name, v->vfs_name) == 0) {
2647
error = EEXIST;
2648
goto out;
2649
}
2650
}
2651
2652
/*
2653
* Initialize the vnode operations for this file system.
2654
*/
2655
vfs_opv_init(vfs->vfs_opv_descs);
2656
2657
/*
2658
* Now initialize the file system itself.
2659
*/
2660
(*vfs->vfs_init)();

78
CHAPTER 3. VIRTUAL FILE SYSTEM
2661
2662
/*
2663
* ...and link it into the kernel's list.
2664
*/
2665
LIST_INSERT_HEAD(&vfs_list, vfs, vfs_list);
2666
2667
/*
2668
* Sanity: make sure the reference count is 0.
2669
*/
2670
vfs->vfs_refcount = 0;
2671
2672
out:
2673
return (error);
2674 }
------------------------------------------------------ kern/vfs subr.c
In the case of FFS, line 2660 of kern/vfs subr.c calls ffs init function, since
ffs vfsops variable used in line 1644 of arch/sparc64/compile/MY KERNEL/ioconf.c
is initialized so that its vfs init member is set to ffs init by line 97-116 of
ufs/ffs/vfs ffsops.c.
3.3.4
Fast Filesystem Initialization
Fast Filesystem as a virtual filesystem layer is initialized by ffs init function. This
ffs init function called by vfs attach function that we just described is shown
below.
------------------------------------------------------ kern/vfs subr.c
1368 void
1369 ffs_init()
1370 {
1371
if (ffs_initcount++ > 0)
1372
return;
1373
1374
softdep_initialize();
1375
ufs_init();
1376
1377
pool_init(&ffs_inode_pool, sizeof(struct inode), 0, 0, 0, "ffsinopl",
1378
&pool_allocator_nointr);
1379 }
1380
1381 void
1382 ffs_reinit()
1383 {
1384
softdep_reinitialize();
1385
ufs_reinit();
1386 }
------------------------------------------------------ kern/vfs subr.c
3.3.5
Soft Dependency Module Initialization
To use soft dependency support, 14 additional caches for meta data structure is
needed. It is only 5 caches, however, when Fast File System(FFS) without soft de-

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
79
pendency support is considered. For simplicity, we do not consider caches for soft de-
pendency. For not to use soft dependency support in NetBSD Sparc64, it is sufficient
for you to remove a line saying "options SOFTDEP" in arch/sparc64/sparc64/conf/GENERIC32
kernel configuration file.
Be sure to know that even if you turn off the switch, 14 additional caches for soft
dependency support is initialized, but they are never used, since every call to soft de-
pendency related functions are avoided by checking mnt flag in structure mount.
Soft dependency module initialization function, softdep initalize is shown be-
low.
----------------------------------------------­ ufs/ffs/ffs softdep.c
1050 /*
1051
* Executed during filesystem system initialization before
1052
* mounting any file systems.
1053
*/
1054 void
1055 softdep_initialize()
1056 {
1057
int i;
1058
1059
LIST_INIT(&mkdirlisthd);
1060
LIST_INIT(&softdep_workitem_pending);
1061
max_softdeps = desiredvnodes * 4;
1062
pagedep_hashtbl = hashinit(desiredvnodes / 5, HASH_LIST, M_PAGEDEP,
1063
M_WAITOK, &pagedep_hash);
1064
sema_init(&pagedep_in_progress, "pagedep", PRIBIO, 0);
1065
inodedep_hashtbl = hashinit(desiredvnodes, HASH_LIST, M_INODEDEP,
1066
M_WAITOK, &inodedep_hash);
1067
sema_init(&inodedep_in_progress, "inodedep", PRIBIO, 0);
1068
newblk_hashtbl = hashinit(64, HASH_LIST, M_NEWBLK, M_WAITOK,
1069
&newblk_hash);
1070
sema_init(&newblk_in_progress, "newblk", PRIBIO, 0);
1071
pool_init(&sdpcpool, sizeof(struct buf), 0, 0, 0, "sdpcpool",
1072
&pool_allocator_nointr);
1073
for (i = 0; i < PCBPHASHSIZE; i++) {
1074
LIST_INIT(&pcbphashhead[i]);
1075
}
1076
1077
pool_init(&pagedep_pool, sizeof(struct pagedep), 0, 0, 0,
1078
"pagedeppl", &pool_allocator_nointr);
1079
pool_init(&inodedep_pool, sizeof(struct inodedep), 0, 0, 0,
1080
"inodedeppl", &pool_allocator_nointr);
1081
pool_init(&newblk_pool, sizeof(struct newblk), 0, 0, 0,
1082
"newblkpl", &pool_allocator_nointr);
1083
pool_init(&bmsafemap_pool, sizeof(struct bmsafemap), 0, 0, 0,
1084
"bmsafemappl", &pool_allocator_nointr);
1085
pool_init(&allocdirect_pool, sizeof(struct allocdirect), 0, 0, 0,
1086
"allocdirectpl", &pool_allocator_nointr);
1087
pool_init(&indirdep_pool, sizeof(struct indirdep), 0, 0, 0,
1088
"indirdeppl", &pool_allocator_nointr);
1089
pool_init(&allocindir_pool, sizeof(struct allocindir), 0, 0, 0,
1090
"allocindirpl", &pool_allocator_nointr);
1091
pool_init(&freefrag_pool, sizeof(struct freefrag), 0, 0, 0,

80
CHAPTER 3. VIRTUAL FILE SYSTEM
1092
"freefragpl", &pool_allocator_nointr);
1093
pool_init(&freeblks_pool, sizeof(struct freeblks), 0, 0, 0,
1094
"freeblkspl", &pool_allocator_nointr);
1095
pool_init(&freefile_pool, sizeof(struct freefile), 0, 0, 0,
1096
"freefilepl", &pool_allocator_nointr);
1097
pool_init(&diradd_pool, sizeof(struct diradd), 0, 0, 0,
1098
"diraddpl", &pool_allocator_nointr);
1099
pool_init(&mkdir_pool, sizeof(struct mkdir), 0, 0, 0,
1100
"mkdirpl", &pool_allocator_nointr);
1101
pool_init(&dirrem_pool, sizeof(struct dirrem), 0, 0, 0,
1102
"dirrempl", &pool_allocator_nointr);
1103
pool_init(&newdirblk_pool, sizeof (struct newdirblk), 0, 0, 0,
1104
"newdirblkpl", &pool_allocator_nointr);
1105 }
----------------------------------------------­ ufs/ffs/ffs softdep.c
All jumps to the soft dependency code, lives in ffs mount, ffs reload, ffs mountfs,
ffs unmount, ffs vget functions of ufs/ffs/ffs vfsops.c.
ffs mount and ffs unmount functions are called respectively when the mount
and umount system call is executed. ffs mountfs is subroutine of ffs mount func-
tion and is also used for ffs mountroot function. ffs reload function reloads all
incore data for a filesystem after running fsck on the root filesystem and finding
things to fix. ffs vget function is called to look up a FFS dinode number to find
its incore vnode.
These code are shown in the following list. You do not need to understand it all,
since the point is soft dependency functions are not called when kernel configuration
is set up so.
----------------------------------------------­ ufs/ffs/ffs vfsops.c
177 int
178 ffs_mount(mp, path, data, ndp, p)
179
struct mount *mp;
180
const char *path;
181
void *data;
182
struct nameidata *ndp;
183
struct proc *p;
184 {
...
307
if (mp->mnt_flag & MNT_SOFTDEP)
308
error = softdep_flushfiles(mp, flags, p);
309
else
310
error = ffs_flushfiles(mp, flags, p);
...
338
if ((fs->fs_flags & FS_DOSOFTDEP) &&
339
!(mp->mnt_flag & MNT_SOFTDEP) && fs->fs_ronly == 0) {
340 #ifdef notyet
341
flags = WRITECLOSE;
342
if (mp->mnt_flag & MNT_FORCE)
343
flags |= FORCECLOSE;
344
error = softdep_flushfiles(mp, flags, p);
345
if (error == 0 && ffs_cgupdate(ump, MNT_WAIT) == 0)
346
fs->fs_flags &= ~FS_DOSOFTDEP;
347
(void) ffs_sbupdate(ump, MNT_WAIT);

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
81
348 #elif defined(SOFTDEP)
349
mp->mnt_flag |= MNT_SOFTDEP;
350 #endif
351
}
...
382
if ((fs->fs_flags & FS_DOSOFTDEP)) {
383
error = softdep_mount(devvp, mp, fs,
384
p->p_ucred);
385
if (error)
386
return (error);
387
}
...
427 }
...
...
...
442 int
443 ffs_reload(mountp, cred, p)
444
struct mount *mountp;
445
struct ucred *cred;
446
struct proc *p;
447 {
...
586
if ((fs->fs_flags & FS_DOSOFTDEP))
587
softdep_mount(devvp, mountp, fs, cred);
...
646 }
...
...
...
651 int
652 ffs_mountfs(devvp, mp, p)
653
struct vnode *devvp;
654
struct mount *mp;
655
struct proc *p;
656 {
...
894
if (ronly == 0 && (fs->fs_flags & FS_DOSOFTDEP)) {
895
error = softdep_mount(devvp, mp, fs, cred);
896
if (error) {
897
free(fs->fs_csp, M_UFSMNT);
898
goto out;
899
}
900
}
...
916 }
...
...
...
950 int
951 ffs_unmount(mp, mntflags, p)
952
struct mount *mp;
953
int mntflags;
954
struct proc *p;

82
CHAPTER 3. VIRTUAL FILE SYSTEM
955 {
...
964
if (mp->mnt_flag & MNT_SOFTDEP) {
965
if ((error = softdep_flushfiles(mp, flags, p)) != 0)
966
return (error);
967
} else {
968
if ((error = ffs_flushfiles(mp, flags, p)) != 0)
969
return (error);
970
}
...
1006 }
...
...
...
1191 int
1192 ffs_vget(mp, ino, vpp)
1193
struct mount *mp;
1194
ino_t ino;
1195
struct vnode **vpp;
1196 {
...
1286
if (DOINGSOFTDEP(vp))
1287
softdep_load_inodeblock(ip);
1288
else
1289
ip->i_ffs_effnlink = ip->i_ffs_nlink;
1290
brelse(bp);
...
1319 }
----------------------------------------------­ ufs/ffs/ffs vfsops.c
The DOINGSOFTDEP() macro used in the above list is defined in ufs/inode.h as
#define DOINGSOFTDEP(vp)
((vp)->v_mount->mnt_flag & MNT_SOFTDEP)
As you have seen the codes, there is no need to worry about the operation of
soft dependency facility if you removed the kernel option from kernel configuration
file, although 14 caches for soft dependency is initialized.
3.3.6
UFS Initialization
In line 1375 of ufs/ffs/ffs vfsops.c, ffs init function calls ufs init, UFS
initialization function that is defined as,
----------------------------------------------­ ufs/ufs/ufs vfsops.c
225 /*
226
* Initialize UFS filesystems, done only once.
227
*/
228 void
229 ufs_init()
230 {
231
if (ufs_initcount++ > 0)
232
return;
233

3.3. ATTACHING AVAILABLE STATIC FILE SYSTEM
83
234
ufs_ihashinit();
235 #ifdef QUOTA
236
dqinit();
237 #endif
238 }
----------------------------------------------­ ufs/ufs/ufs vfsops.c
where ufs ihashint function that initializes inode hash table is shown below.
----------------------------------------------­ ufs/ufs/ufs ihash.c
61 /*
62
* Initialize inode hash table.
63
*/
64 void
65 ufs_ihashinit()
66 {
67
lockinit(&ufs_hashlock, PINOD, "ufs_hashlock", 0, 0);
68
ihashtbl =
69
hashinit(desiredvnodes, HASH_LIST, M_UFSMNT, M_WAITOK, &ihash);
70
simple_lock_init(&ufs_ihash_slock);
71 }
----------------------------------------------­ ufs/ufs/ufs ihash.c
Note the line 69 which creates hash table that can store `desiredvnodes' elements.
desiredvnodes global variable is also defined by param.c -- autogenerated source
code by config program.
------------------------------­ arch/sparc64/compile/MY KERNEL/param.c
107 int
hz = HZ;
108 int
tick = 1000000 / HZ;
109 int
tickadj = 240000 / (60 * HZ);
/* can adjust 240ms in 60s */
110 int
rtc_offset = RTC_OFFSET;
111 int
maxproc = NPROC;
112 int
desiredvnodes = NVNODE;
113 int
maxfiles = MAXFILES;
114 int
ncallout = 16 + NPROC;
/* size of callwheel (rounded to ^2) */
115 u_long
sb_max = SB_MAX;
/* maximum socket buffer size */
116 int
fscale = FSCALE;
/* kernel uses `FSCALE', user uses `fscale' */
------------------------------­ arch/sparc64/compile/MY KERNEL/param.c
where the NVNODE macro is defined in sys/param.h as,
----------------------------------------------- kern/sys/param.h
128 #ifndef NPROC
129 #define NPROC
(20 + 16 * MAXUSERS)
130 #endif
131 #ifndef NTEXT
132 #define NTEXT
(80 + NPROC / 8)
/* actually the object cache */
133 #endif
134 #ifndef NVNODE
135 #define NVNODE
(NPROC + NTEXT + 100)
136 #define NVNODE_IMPLICIT
137 #endif

84
CHAPTER 3. VIRTUAL FILE SYSTEM
----------------------------------------------- kern/sys/param.h
where the line 134 means NVNODE parameter can be tuned in kernel configuration
file using option command. Since default MAXUSERS is 64 unless tuned by kernel
configuration file,
NPROC = 20 + 16 * MAXUSERS
= 20 + 16 * 64
= 1044
NTEXT = 80 + NPROC / 8
= 80 + 1044 / 8
= 210
NVNODE = NPROC + NTEXT + 100
= 1044 + 210 + 100
= 1354
So, if you want to change the default value of desiredvnodes other than 1354, you
can change by tuning MAXUSERS parameter in kernel configuration file using option
command.
Up to now, we showed how virtual file system layer is initialized. In the next
chapter, we will describe a file system is mounted, with exploring the mount process
of root file system !
3.4
Virtual Filesystem Operations
In a similar fashion to the vnode interface, all operations that are done on a file
system are conducted through a single interface that allows the system to carry out
operations on a file system without knowing its construction or type.
As we had described earlier, all supported file systems in the kernel have an entry
in the vfs list initial table. This table is generated by config program and is
a NULL-terminated list of vfsops structures. The vfsops structure describes the
operations that can be done to a specific file system type. The vfsops structure is
shown below.
-------------------------------------------------- sys/mount.h
344 struct vfsops {
345
const char *vfs_name;
346
int
(*vfs_mount)
__P((struct mount *mp, const char *path,
347
void *data, struct nameidata *ndp,
348
struct proc *p));
349
int
(*vfs_start)
__P((struct mount *mp, int flags,
350
struct proc *p));
351
int
(*vfs_unmount)
__P((struct mount *mp, int mntflags,
352
struct proc *p));
353
int
(*vfs_root)
__P((struct mount *mp, struct vnode **vpp));
354
int
(*vfs_quotactl) __P((struct mount *mp, int cmds, uid_t uid,
355
caddr_t arg, struct proc *p));
356
int
(*vfs_statfs)
__P((struct mount *mp, struct statfs *sbp,
357
struct proc *p));
358
int
(*vfs_sync)
__P((struct mount *mp, int waitfor,
359
struct ucred *cred, struct proc *p));
360
int
(*vfs_vget)
__P((struct mount *mp, ino_t ino,

3.4. VIRTUAL FILESYSTEM OPERATIONS
85
361
struct vnode **vpp));
362
int
(*vfs_fhtovp)
__P((struct mount *mp, struct fid *fhp,
363
struct vnode **vpp));
364
int
(*vfs_vptofh)
__P((struct vnode *vp, struct fid *fhp));
365
void
(*vfs_init)
__P((void));
366
void
(*vfs_reinit)
__P((void));
367
void
(*vfs_done)
__P((void));
368
int
(*vfs_sysctl)
__P((int *, u_int, void *, size_t *, void *,
369
size_t, struct proc *));
370
int
(*vfs_mountroot) __P((void));
371
int
(*vfs_checkexp) __P((struct mount *mp, struct mbuf *nam,
372
int *extflagsp, struct ucred **credanonp));
373
const struct vnodeopv_desc * const *vfs_opv_descs;
374
int
vfs_refcount;
375
LIST_ENTRY(vfsops) vfs_list;
376 };
-------------------------------------------------- sys/mount.h
The following table list the elements of the vfsops vector, the corre- sponding
invocation macro, and a description of the element.
Vector element
Macro
Description
int
(*vfs_mount)()
VFS_MOUNT
Mount a file system
int
(*vfs_start)()
VFS_START
Make operational
int
(*vfs_unmount)()
VFS_UMOUNT
Unmount a file system
int
(*vfs_root)()
VFS_ROOT
Get the file system root vnode
int
(*vfs_quotactl)()
VFS_QUOTACTL
Query/modify space quotas
int
(*vfs_statfs)()
VFS_STATFS
Get file system statistics
int
(*vfs_sync)()
VFS_SYNC
Flush file system buffers
int
(*vfs_vget)()
VFS_VGET
Get vnode from file ID
int
(*vfs_fhtovp)()
VFS_FHTOVP
NFS file handle to vnode lookup
int
(*vfs_vptofh)()
VFS_VPTOFH
Vnode to NFS file handle lookup
void (*vfs_init)()
-
Initialise file system
void (*vfs_reinit)()
-
Reinitialise file system
void (*vfs_done)()
-
Cleanup unmounted file system
int
(*vfs_sysctl)()
-
Query/modify kernel state
int
(*vfs_mountroot)() -
Mount the root file system
int
(*vfs_checkexp)()
VFS_CHECKEXP
Check if fs is exported
Some additional non-function members of the vfsops structure are the file system
name vfs name and a reference count vfs refcount. It is not mandatory for a
filesystem type to support a particular operation, but it must assign each member
function pointer to a suitable function to do the minimum required of it. In most
cases, such functions either do nothing or return an error value to the effect that it
is not supported.
At system boot, each filesystem with an entry in vfs list initial is estab-
lished and initialised. Each initialised file system is recorded by the kernel in the list
vfs list and the file system specific initialisation function vfs init in its vfsops
vector is invoked. When the filesystem is not longer needed vfs done is invoked to
run file system specific cleanups and the file system is removed from the kernel list.
At system boot, the root filesystem is mounted by invoking the file system type
specific vfs mountroot function in the vfsops vector. All filesystems that can
be mounted as a root file system must define this function. It is responsible for
initialising to list of mount structures for all future mounted file systems.

86
CHAPTER 3. VIRTUAL FILE SYSTEM
Kernel state which affects a specific filesystem type can be queried and modified
using the sysctl interface. The vfs sysctl member of the vfsops structure is
invoked by filesystem independent code.
3.5
References to Source Code
3.5.1
kern/vfs init.c - 334 lines, 7 functions
Gloval Variables
const struct vnodeopv_desc * const vfs_special_vnodeopv_descs[] = {
&dead_vnodeop_opv_desc,
&fifo_vnodeop_opv_desc,
&spec_vnodeop_opv_desc,
&sync_vnodeop_opv_desc,
NULL,
};
struct vattr va_null;
Functions
vn_default_error()
vfs_opv_init_explicit()
vfs_opv_init_default()
vfs_opv_init()
vfs_opv_free()
vfs_op_check()
vfsinit()

Chapter 4
Buffer Cache
Buffer cache manages the memory that buffers data being transferred to and from
the network or disk, and act as a cache of recently used blocks.
Since we are planning to replace buffer cache, it is essential for us to know the
details of buffer cache, and the interaction between vnode operations and buffer
cache.
The architecture of buffer cache is best described by [1]. But the details about
how the buffer cache is implemented is best described by [2].
The buffer cache is composed of two parts. The first part is the buffer header
and the second part is the actual buffer contents.
4.1
Buffer Cache Header
The Buffer header of NetBSD release 1.6 is defined in sys/buf.h as,
------------------------------------------------------ sys/buf.h
151 /*
152
* The buffer header describes an I/O operation in the kernel.
153
*/
154 struct buf {
155
LIST_ENTRY(buf) b_hash;
/* Hash chain. */
156
LIST_ENTRY(buf) b_vnbufs;
/* Buffer's associated vnode. */
157
TAILQ_ENTRY(buf) b_freelist;
/* Free list position if not active. */
158
TAILQ_ENTRY(buf) b_actq;
/* Device driver queue when active. */
159
struct
proc *b_proc;
/* Associated proc if B_PHYS set. */
160
volatile long
b_flags;
/* B_* flags. */
161
int
b_error;
/* Errno value. */
162
long
b_bufsize;
/* Allocated buffer size. */
163
long
b_bcount;
/* Valid bytes in buffer. */
164
long
b_resid;
/* Remaining I/O. */
165
dev_t
b_dev;
/* Device associated with buffer. */
166
struct {
167
caddr_t b_addr;
/* Memory, superblocks, indirect etc. */
168
} b_un;
169
void
*b_saveaddr;
/* Original b_addr for physio. */
170
daddr_t b_lblkno;
/* Logical block number. */
171
daddr_t b_blkno;
/* Underlying physical block number
172
(partition relative) */
173
daddr_t b_rawblkno;
/* Raw underlying physical block
87

88
CHAPTER 4. BUFFER CACHE
174
number (not partition relative) */
175
/* Function to call upon completion. */
176
void
(*b_iodone) __P((struct buf *));
177
struct
vnode *b_vp;
/* File vnode. */
178
void
*b_private;
/* Private data for owner */
179
off_t
b_dcookie;
/* Offset cookie if dir block */
180
struct
workhead b_dep;
/* List of filesystem dependencies. */
181 };
------------------------------------------------------ sys/buf.h
where
b vnbufs is a pointer to the vnode whose data the buffer holds.
b flags tracks status information about the buffer, such as whether
the buffer contains useful data, whether the buffer is in use, and
whether the data must be written back to the file before the buffer
can be reused.
b bufsize indicates the size of allocated buffer contents, without regard
to the validity of the data contained.
b bcount contains the number of valid bytes contained in the buffer.
The possible values of b flags variable are,
------------------------------------------------------ sys/buf.h
192 /*
193
* These flags are kept in b_flags.
194
*/
195 #define B_AGE
0x00000001
/* Move to age queue when I/O done. */
196 #define B_NEEDCOMMIT 0x00000002
/* Needs committing to stable storage */
197 #define B_ASYNC
0x00000004
/* Start I/O, do not wait. */
198 #define B_BAD
0x00000008
/* Bad block revectoring in progress. */
199 #define B_BUSY
0x00000010
/* I/O in progress. */
200 #define B_SCANNED
0x00000020
/* Block already pushed during sync */
201 #define B_CALL
0x00000040
/* Call b_iodone from biodone. */
202 #define B_DELWRI
0x00000080
/* Delay I/O until buffer reused. */
203 #define B_DIRTY
0x00000100
/* Dirty page to be pushed out async. */
204 #define B_DONE
0x00000200
/* I/O completed. */
205 #define B_EINTR
0x00000400
/* I/O was interrupted */
206 #define B_ERROR
0x00000800
/* I/O error occurred. */
207 #define B_GATHERED
0x00001000
/* LFS: already in a segment. */
208 #define B_INVAL
0x00002000
/* Does not contain valid info. */
209 #define B_LOCKED
0x00004000
/* Locked in core (not reusable). */
210 #define B_NOCACHE
0x00008000
/* Do not cache block after use. */
211 #define B_CACHE
0x00020000
/* Bread found us in the cache. */
212 #define B_PHYS
0x00040000
/* I/O to user memory. */
213 #define B_RAW
0x00080000
/* Set by physio for raw transfers. */
214 #define B_READ
0x00100000
/* Read buffer. */
215 #define B_TAPE
0x00200000
/* Magnetic tape I/O. */
216 #define B_WANTED
0x00800000
/* Process wants this buffer. */
217 #define B_WRITE
0x00000000
/* Write buffer (pseudo flag). */
218 #define B_XXX
0x02000000
/* Debugging flag. */
219 #define B_VFLUSH
0x04000000
/* Buffer is being synced. */

4.2. BUFFER CACHE CONTENTS
89
------------------------------------------------------ sys/buf.h
To set and test these flags, convevient macros are provided as,
------------------------------------------------­ kern/vfs bio.c
70 /* Macros to clear/set/test flags. */
71 #define SET(t, f)
(t) |= (f)
72 #define CLR(t, f)
(t) &= ~(f)
73 #define ISSET(t, f)
((t) & (f))
------------------------------------------------­ kern/vfs bio.c
As we analyze buffer cache, we will gradually know every meaning of these flags.
4.2
Buffer Cache Contents
Buffer contents are maintaained separately from the header to allow easy manipu-
lation of buffer sizes via the page-mapping hardware.
4.2.1
Allocation Virtual Memory to Buffer Cache
Kernel allocates to each buffer MAXBSIZE bytes of virtual memory, but the address
space is not fully populated with physical memory. Initially, each buffer is assigned
4096 bytes of physical memory. As smaller buffers are allocated, they give up their
unused physical memory to buffers that need to hold more than 4096 bytes.
MAXSIZE is machine-dependent since it is defined by
------------------------------------------------------ sys/param.h
211 /*
212
* File system parameters and macros.
213
*
214
* The file system is made out of blocks of at most MAXBSIZE units, with
215
* smaller units (fragments) only in the last direct block.
MAXBSIZE
216
* primarily determines the size of buffers in the buffer pool.
It may be
217
* made larger without any effect on existing file systems; however making
218
* it smaller may make some file systems unmountable.
219
*/
220 #ifndef MAXBSIZE
/* XXX */
221 #define MAXBSIZE
MAXPHYS
222 #endif
223 #define MAXFRAG
8
------------------------------------------------------ sys/param.h
and
------------------------------------------------------
arch/sparc64/include/param.h
128 #define DEV_BSIZE
512
129 #define DEV_BSHIFT
9
/* log2(DEV_BSIZE) */
130 #define BLKDEV_IOSIZE
2048
131 #define MAXPHYS
(64 * 1024)
------------------------------------------------------
arch/sparc64/include/param.h

90
CHAPTER 4. BUFFER CACHE
4.2.2
Identifying Buffer
How can we identify buffer ? Is there unique ID ?
4.4BSD identify buffers by their logical block number within filesystem by b lblkno
member in buffer header.
Since it is difficult to detect aliases for a block belonging to a local file and
the same block accessed through the block device disk, kernel prevents this case
from occuring: The kernel does not allow the block device from a partition to be
opened while that partition is mounted. Nor does the kernel allow a partition to be
mounted if the block device from the partition is already open.
4.3
Buffer Hash
A buffer with valid contents is contained on exactly one bufhash hash chain. The
kernel uses the hash chains to determine quickly whether a block is in the buffer
pool, and if it is, to locate it.
A buffer is removed from the buffer hash only when its contents become invalid
or it is reused for different data. Thus, even if the buffer is in use by one process,
it can still be found by another process, although B BUSY flag will be set so that it
will not be used until the buffer is released.
The buffer hash is defined in kern/vfs bio.c as,
------------------------------------------------­ kern/vfs bio.c
75 /*
76
* Definitions for the buffer hash lists.
77
*/
78 #define BUFHASH(dvp, lbn)
\
79
(&bufhashtbl[(((long)(dvp) >> 8) + (int)(lbn)) & bufhash])
80 LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash;
81 u_long
bufhash;
------------------------------------------------­ kern/vfs bio.c
If you as unfamilar with the LIST HEAD macro, review a section describing kernel
list data structures in chapter 1. If you know how to use linked list macros, then
you would know the above definition of line 80 is equal to
------------------------------------------------------
struct bufhashhdr {
struct
buf *lh_first;
/* first element */
} *bufhashtbl, invalhash;
------------------------------------------------------
bufhashtbl points to a hash table composed of an array of linked-lists. However,
invalhash is simply a linked-list, not an array.
4.4
Buffer Cache Free Lists
In addition to appearing on the hash list, each unlocked byffer appears on exactly
one free list. There are four kinds of free list. They are defined in vfs bio.c as,
------------------------------------------------­ kern/vfs bio.c

4.4. BUFFER CACHE FREE LISTS
91
92 /*
93
* Definitions for the buffer free lists.
94
*/
95 #define BQUEUES
4
/* number of free buffer queues */
96
97 #define BQ_LOCKED
0
/* super-blocks &c */
98 #define BQ_LRU
1
/* lru, useful buffers */
99 #define BQ_AGE
2
/* rubbish */
100 #define BQ_EMPTY
3
/* buffer headers with no memory */
101
102 TAILQ_HEAD(bqueues, buf) bufqueues[BQUEUES];
103 int needbuffer;
------------------------------------------------­ kern/vfs bio.c
If you as unfamilar with the TAILQ HEAD macro, review a section describing
kernel list data structures in chapter 1.
If you had read the section, you would know line 102 means that four tail
queues are defined, and these tail queues contain elements whose type is struct
buf. Also, you would know these definition is exactly equivalent to
------------------------------------------------------
struct bqueues {
struct buf
*tqh_first;
/* first element */
struct buf
**tqh_first;
/* addr of last next element */
} bufqueues [BQUEUES];
------------------------------------------------------
4.4.1
LOCKED List
Buffers on this list cannot be flushed from the cache.
4.4.2
LRU List
After a buffer is used, the buffer is then returned to the end of the LRU list. When
new buffers are needed, they are taken from the front of the LRU list. As its name
suggests, this list implements a least recently used (LRU) algorithm.
4.4.3
AGE List
AGE list holds two kinds of buffers. They are the buffers which are,
Blocks of unlinked file: These buffers are not likely to be reused. The
buffers are placed at the front of the AGE list where they will be
reclaimed quickly.
Read-ahead block: These buffers are not proben their usefulness. The
buffers are placed at the end of the AGE list where they will might
remain long enough to be used again.
AGE list is used for two purposes. First, if a block is requested and it is found
on a buffer cache that lives in the AGE list, the buffer is returned to the end of the
LRU list, not the AGE list, because it has proved its usefulness. Second, when a
new buffer is needed, the front of the AGE list is searched first; only when the AGE
list is empty, the LRU list is used.

92
CHAPTER 4. BUFFER CACHE
4.4.4
EMPTY List
The EMPTY list contains buffers that have no physical memory. They are held on
this list waiting for another buffer to be reused for a smaller block and thus give up
its extra physical memory.
4.5
Buffer Cache Initialization
Buffer cache is initialized in the beginning stage of the system bootstrap. Initial-
ization process consists of two stages.
At first, cpu startup function does machine dependent memory allocation. At
Second, bufinit function called by cpu startup function initializes buffer cache
hash and free lists using the memory allocated by previous machine dependent
initialization stage.
4.5.1
Physical Memory Allocation
main function of kern/init main.c is machine independent bootstrap routine, and
it calls machine dependent startup routine cpu startup function of arch/sparc64/sparc64/machdep.c
defined as
--------------------------------------- arch/sparc64/sparc64/machdep.c
166 /*
167
* Machine-dependent startup code
168
*/
169 void
170 cpu_startup()
171 {
172
caddr_t v;
173
long sz;
174
u_int i, base, residual;
175 #ifdef DEBUG
176
extern int pmapdebug;
177
int opmapdebug = pmapdebug;
178 #endif
179
vaddr_t minaddr, maxaddr;
180
vsize_t size;
181
extern struct user *proc0paddr;
182
char pbuf[9];
183
184 #ifdef DEBUG
185
pmapdebug = 0;
186 #endif
187
188
proc0.p_addr = proc0paddr;
189
190
/*
191
* Good {morning,afternoon,evening,night}.
192
*/
193
printf(version);
194
/*identifycpu();*/
195
format_bytes(pbuf, sizeof(pbuf), ctob((u_int64_t)physmem));
196
printf("total memory = %s\n", pbuf);

4.5. BUFFER CACHE INITIALIZATION
93
197
198
/*
199
* Find out how much space we need, allocate it,
200
* and then give everything true virtual addresses.
201
*/
202
sz = (long)allocsys(NULL, NULL);
203
if ((v = (caddr_t)uvm_km_alloc(kernel_map, round_page(sz))) == 0)
204
panic("startup: no room for %lx bytes of tables", sz);
205
if (allocsys(v, NULL) - v != sz)
206
panic("startup: table size inconsistency");
207
208
/*
209
* allocate virtual and physical memory for the buffers.
210
*/
211
size = MAXBSIZE * nbuf;
/* # bytes for buffers */
212
213
/* allocate VM for buffers... area is not managed by VM system */
214
if (uvm_map(kernel_map, (vaddr_t *) &buffers, round_page(size),
215
NULL, UVM_UNKNOWN_OFFSET, 0,
216
UVM_MAPFLAG(UVM_PROT_NONE, UVM_PROT_NONE, UVM_INH_NONE,
217
UVM_ADV_NORMAL, 0)) != 0)
218
panic("cpu_startup: cannot allocate VM for buffers");
219
220
minaddr = (vaddr_t) buffers;
221
if ((bufpages / nbuf) >= btoc(MAXBSIZE)) {
222
bufpages = btoc(MAXBSIZE) * nbuf; /* do not overallocate RAM */
223
}
224
base = bufpages / nbuf;
225
residual = bufpages % nbuf;
226
227
/* now allocate RAM for buffers */
228
for (i = 0 ; i < nbuf ; i++) {
229
vaddr_t curbuf;
230
vsize_t curbufsize;
231
struct vm_page *pg;
232
233
/*
234
* each buffer has MAXBSIZE bytes of VM space allocated.
of
235
* that MAXBSIZE space we allocate and map (base+1) pages
236
* for the first "residual" buffers, and then we allocate
237
* "base" pages for the rest.
238
*/
239
curbuf = (vaddr_t) buffers + (i * MAXBSIZE);
240
curbufsize = NBPG * ((i < residual) ? (base+1) : base);
241
242
while (curbufsize) {
243
pg = uvm_pagealloc(NULL, 0, NULL, 0);
244
if (pg == NULL)
245
panic("cpu_startup: "
246
"not enough RAM for buffer cache");
247
pmap_kenter_pa(curbuf, VM_PAGE_TO_PHYS(pg),
248
VM_PROT_READ | VM_PROT_WRITE);
249
curbuf += PAGE_SIZE;
250
curbufsize -= PAGE_SIZE;

94
CHAPTER 4. BUFFER CACHE
251
}
252
}
253
pmap_update(kernel_map->pmap);
254
255
/*
256
* Allocate a submap for exec arguments.
This map effectively
257
* limits the number of processes exec'ing at any time.
258
*/
259
exec_map = uvm_km_suballoc(kernel_map, &minaddr, &maxaddr,
260
16*NCARGS, VM_MAP_PAGEABLE, FALSE, NULL);
261
262
/*
263
* Finally, allocate mbuf cluster submap.
264
*/
265
mb_map = uvm_km_suballoc(kernel_map, &minaddr, &maxaddr,
266
nmbclusters * mclbytes, VM_MAP_INTRSAFE, FALSE, NULL);
267
268 #ifdef DEBUG
269
pmapdebug = opmapdebug;
270 #endif
271
format_bytes(pbuf, sizeof(pbuf), ptoa(uvmexp.free));
272
printf("avail memory = %s\n", pbuf);
273
format_bytes(pbuf, sizeof(pbuf), bufpages * NBPG);
274
printf("using %u buffers containing %s of memory\n", nbuf, pbuf);
275
276
/*
277
* Set up buffers, so they can be used to read disk labels.
278
*/
279
bufinit();
280
281 #if 0
282
pmap_redzone();
283 #endif
284 }
--------------------------------------- arch/sparc64/sparc64/machdep.c
From the above function, buffers global variable is defined, in automatically
compile-time generated code by config program, arch/sparc64/compile/MY KERNEL/param.c,
as
------------------------------­ arch/sparc64/compile/MY KERNEL/param.c
194 /*
195
* These have to be allocated somewhere; allocating
196
* them here forces loader errors if this file is omitted
197
* (if they've been externed everywhere else; hah!).
198
*/
199 struct
buf *buf;
200 char
*buffers;
------------------------------­ arch/sparc64/compile/MY KERNEL/param.c
and NBPG macro is defined, in machine dependent source code, arch/sparc64/include/param.h,
as

4.5. BUFFER CACHE INITIALIZATION
95
--------------------------------------- arch/sparc64/include/param.h
298 #define PGSHIFT
13
/* log2(NBPG) */
299 #define NBPG
(1<<PGSHIFT)
/* bytes/page */
300 #define PGOFSET
(NBPG-1)
/* byte offset into page */
--------------------------------------- arch/sparc64/include/param.h
Global variables used in cpu startup function, such as nbuf, bufpages, bufcache
is defined in kern/kern allocsys.c as
--------------------------------------- arch/sparc64/include/param.h
94 /*
95
* Declare these as initialized data so we can patch them.
96
*/
97 #ifndef NBUF
98 # define NBUF 0
99 #endif
100
101 #ifndef BUFPAGES
102 # define BUFPAGES 0
103 #endif
104
105 #ifdef BUFCACHE
106 # if (BUFCACHE < 5) || (BUFCACHE > 95)
107 #
error BUFCACHE is not between 5 and 95
108 # endif
109 #else
110
/* Default to 10% of first 2MB and 5% of remaining. */
111 # define BUFCACHE 0
112 #endif
113
114 u_int
nbuf = NBUF;
115 u_int
nswbuf = 0;
116 u_int
bufpages = BUFPAGES;
/* optional hardwired count */
117 u_int
bufcache = BUFCACHE;
/* % of RAM to use for buffer cache */
--------------------------------------- arch/sparc64/include/param.h
If you specifies NBUF, or BUFPAGES macro in your kernel configuration file, then
the kernel come to have fixed amount of buffer cache. However, by default, current
NetBSD releases automatically calculates the amount of memory allocated for buffer
cache by setting the value of NBUF, and BUFPAGES to zero ! Automatically calculated
amount of memory allocated for buffer cache is 0.2 MB of the first system memory
plus 5 percent of the remaining system memory.
You can change the 5 percent by setting BUFCACHE macro in your kernel config-
uration file using option command.
This calculation is done by allocsys function called from line 202-206 of
cpu startup function. The source code of allocsys function is in the kern/kern allocsys.c
as
------------------------------------------------­ kern/kern allocsys.c
119 /*
120
* Allocate space for system data structures.
We are given

96
CHAPTER 4. BUFFER CACHE
121
* a starting virtual address and we return a final virtual
122
* address; along the way we set each data structure pointer.
123
*
124
* We call allocsys() with 0 to find out how much space we want,
125
* allocate that much and fill it with zeroes, and then call
126
* allocsys() again with the correct base virtual address.
127
*
128
*/
129
130 caddr_t
131 allocsys(caddr_t v, caddr_t (*mdcallback)(caddr_t))
132 {
133
134
/* Calculate the number of callwheels if necessary. */
135
if (callwheelsize == 0)
136
callout_setsize();
137
138
ALLOCSYS(v, callwheel, struct callout_queue, callwheelsize);
139 #ifdef CALLWHEEL_STATS
140
ALLOCSYS(v, callwheel_sizes, int, callwheelsize);
141 #endif
142 #ifdef SYSVSHM
143
ALLOCSYS(v, shmsegs, struct shmid_ds, shminfo.shmmni);
144 #endif
145 #ifdef SYSVSEM
146
ALLOCSYS(v, sema, struct semid_ds, seminfo.semmni);
147
ALLOCSYS(v, sem, struct __sem, seminfo.semmns);
148
/* This is pretty disgusting! */
149
ALLOCSYS(v, semu, int, (seminfo.semmnu * seminfo.semusz) / sizeof(int));
150 #endif
151 #ifdef SYSVMSG
152
ALLOCSYS(v, msgpool, char, msginfo.msgmax);
153
ALLOCSYS(v, msgmaps, struct msgmap, msginfo.msgseg);
154
ALLOCSYS(v, msghdrs, struct __msg, msginfo.msgtql);
155
ALLOCSYS(v, msqids, struct msqid_ds, msginfo.msgmni);
156 #endif
157
158
/*
159
* Determine how many buffers to allocate.
160
*
161
*
- If bufcache is specified, use that % of memory
162
*
for the buffer cache.
163
*
164
*
- Otherwise, we default to the traditional BSD
165
*
formula of 10% of the first 2MB and 5% of
166
*
the remaining.
167
*/
168
if (bufpages == 0) {
169
if (bufcache != 0) {
170
if (bufcache < 5 || bufcache > 95)
171
panic("bufcache is out of range (%d)",
172
bufcache);
173
bufpages = physmem / 100 * bufcache;
174
} else {

4.5. BUFFER CACHE INITIALIZATION
97
175
if (physmem < btoc(2 * 1024 * 1024))
176
bufpages = physmem / 10;
177
else
178
bufpages = (btoc(2 * 1024 * 1024) + physmem) /
179
20;
180
}
181
}
182
183 #ifdef DIAGNOSTIC
184
if (bufpages == 0)
185
panic("bufpages = 0");
186 #endif
187
188
/*
189
* Call the mdcallback now; it may need to adjust bufpages.
190
*/
191
if (mdcallback != NULL)
192
v = mdcallback(v);
193
194
/*
195
* Ensure a minimum of 16 buffers.
196
*/
197
if (nbuf == 0) {
198
nbuf = bufpages;
199
if (nbuf < 16)
200
nbuf = 16;
201
}
202
203 #ifdef VM_MAX_KERNEL_BUF
204
/*
205
* XXX stopgap measure to prevent wasting too much KVM on
206
* the sparsely filled buffer cache.
207
*/
208
if (nbuf > VM_MAX_KERNEL_BUF / MAXBSIZE)
209
nbuf = VM_MAX_KERNEL_BUF / MAXBSIZE;
210 #endif
211
212
/*
213
* We allocate 1/2 as many swap buffer headers as file I/O buffers.
214
*/
215
if (nswbuf == 0) {
216
nswbuf = (nbuf / 2) &~ 1;
/* force even */
217
if (nswbuf > 256)
218
nswbuf = 256;
/* sanity */
219
}
220
ALLOCSYS(v, buf, struct buf, nbuf);
221
222
return (v);
223 }
------------------------------------------------­ kern/kern allocsys.c
where the ALLOCSYS macro is defined in sys/systm.h as
-------------------------------------------------- sys/systm.h

98
CHAPTER 4. BUFFER CACHE
325 #define ALLOCSYS(base, name, type, num) \
326
(name) = (type *)(base); (base) = (caddr_t)ALIGN((name)+(num))
-------------------------------------------------- sys/systm.h
where the ALIGN macro is defined in machine dependent source, arch/sparc64/include/param.h
as
--------------------------------------- arch/sparc64/include/param.h
95 /*
96
* Round p (pointer or byte index) up to a correctly-aligned value for
97
* the machine's strictest data type.
The result is u_int and must be
98
* cast to any desired pointer type.
99
*
100
* ALIGNED_POINTER is a boolean macro that checks whether an address
101
* is valid to fetch data elements of type t from on this architecture.
102
* This does not reflect the optimal alignment, just the possibility
103
* (within reasonable limits).
104
*
105
*/
106 #define ALIGNBYTES32
0x7
107 #define ALIGNBYTES64
0xf
108 #ifdef __arch64__
109 #define ALIGNBYTES
ALIGNBYTES64
110 #else
111 #define ALIGNBYTES
ALIGNBYTES32
112 #endif
113 #define ALIGN(p)
(((u_long)(p) + ALIGNBYTES) & ~ALIGNBYTES)
114 #define ALIGN32(p)
(((u_long)(p) + ALIGNBYTES32) & ~ALIGNBYTES32)
115 #define ALIGNED_POINTER(p,t)
((((u_long)(p)) & (sizeof(t)-1)) == 0)
--------------------------------------- arch/sparc64/include/param.h
Exactly saying, you may not fully understand cpu start function, until we describe
UVM memory management system. Thing worthy of being remembered is that now
you know
· How the physical memory for buffer cache is allocated ?
· How can I change the amount of buffer cache ?
· After machine dependent, physical memory allocation for buffer
cache, nbuf, bufpages variables are set to relevant values on the
basis of BUFCACHE that is representing how much portion of the
available physical system memory should be allocated for buffer
cache.
· buffer global variable is a pointer to virtual memory chunk allo-
cated by UVM for the whole buffer cache.
4.5.2
Initialization of Hash and Free List
bufinit function is called from line 279 of cpu startup machine dependent func-
tion. The bufinit function initalizes buffer cache hash and its four free lists.
------------------------------------------------­ kern/vfs bio.c

4.5. BUFFER CACHE INITIALIZATION
99
146 /*
147
* Initialize buffers and hash links for buffers.
148
*/
149 void
150 bufinit()
151 {
152
struct buf *bp;
153
struct bqueues *dp;
154
u_int i, base, residual;
155
156
/*
157
* Initialize the buffer pool.
This pool is used for buffers
158
* which are strictly I/O control blocks, not buffer cache
159
* buffers.
160
*/
161
pool_init(&bufpool, sizeof(struct buf), 0, 0, 0, "bufpl", NULL);
162
163
for (dp = bufqueues; dp < &bufqueues[BQUEUES]; dp++)
164
TAILQ_INIT(dp);
165
bufhashtbl = hashinit(nbuf, HASH_LIST, M_CACHE, M_WAITOK, &bufhash);
166
base = bufpages / nbuf;
167
residual = bufpages % nbuf;
168
for (i = 0; i < nbuf; i++) {
169
bp = &buf[i];
170
memset((char *)bp, 0, sizeof(*bp));
171
bp->b_dev = NODEV;
172
bp->b_vnbufs.le_next = NOLIST;
173
LIST_INIT(&bp->b_dep);
174
bp->b_data = buffers + i * MAXBSIZE;
175
if (i < residual)
176
bp->b_bufsize = (base + 1) * PAGE_SIZE;
177
else
178
bp->b_bufsize = base * PAGE_SIZE;
179
bp->b_flags = B_INVAL;
180
dp = bp->b_bufsize ? &bufqueues[BQ_AGE] : &bufqueues[BQ_EMPTY];
181
binsheadfree(bp, dp);
182
binshash(bp, &invalhash);
183
}
184 }
------------------------------------------------­ kern/vfs bio.c
line 161
initialize the buffer pool. Notice that this buffer pool is completely differ-
ent thing from buffer cache. Buffer cache holds data block specified by logical
file block. Buffer pool, however, is used to transfer data between raw device
and user buffers, and bypass the buffer cache.
Buffer pool is used by physical I/O by device driver layers such as SCSI
controller. When we describe ccd device driver in other chapter, we will
explain how the buffer pool is used.
line 164
Did you review chapter 1 about using linked-list and tail queues ? Then
your will know the line is equivalent to
dp->tqh_first = NULL;
dp->tqh_last = &dp->tqh_first;

100
CHAPTER 4. BUFFER CACHE
This code initializes four buffer cache free lists: LOCKED, LRU, AGE, EMPTY
lists.
line 165
initializes buffer cache hash and receives mask value in bufhash variable.
If are not certain what this line do, review a subsection about description of
kernel hash implementatin, in chapter 1.
line 166-167 nbuf is the number of all buffer cache. bufpages is the number of
all physical memory pages that is available for buffer cache. nbuf is equal to
bufpages unless NBUF kernel configuration variable is explicitly set. Therefore,
by default, these two line is equivalent to
base = 1;
residual = 0;
line 168
Remember that nbuf is the total number of buffer cache in kernel. This
number is displayed in kernel bootstrap message such as
console is keyboard/display
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002
The NetBSD Foundation, Inc.
All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.
All rights reserved.
NetBSD 1.6K (KERNEL) #0: Sat Nov
9 22:09:36 KST 2002
cimon@ultra1:/usr/src/syssrc/sys/arch/sparc64/compile/KERNEL
total memory = 128 MB
avail memory = 108 MB
using 832 buffers containing 6656 KB of memory
otpath: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
mainbus0 (root): SUNW,Ultra-1
cpu0 at mainbus0: SUNW,UltraSPARC @ 170 MHz, version 0 FPU
cpu0: 32K instruction (32 b/l), 16K data (32 b/l), 512K external (64 b/l)
The nbuf variable is set to 832, for a system having the above bootstrap
message.
line 169
Do you remember where buf variable appeared ? We described, in this
section, that buf appears in arch/sparc64/compile/MY KERNEL/param.c as
line 170
clears i-th buffer cache header in the system.
------------------------- arch/sparc64/compile/MY KERNEL/param.c
199 struct
buf *buf;
200 char
*buffers;
------------------------- arch/sparc64/compile/MY KERNEL/param.c
buf global variables the whole memory chunk that can hold all the buffer
cache header.
This is initialized in line 220 of kern/kern allocsys.c. The line is equiva-
lent to

4.5. BUFFER CACHE INITIALIZATION
101
ALLOCSYS(v, buf, struct buf, nbuf);
====>
buf = (struct buf *) v;
v
= (caddr_t) ALIGN (buf + nbuf);
Therefore, buf points to a memory chunk that can hold all available buffer
cache header in system. Ok ? If you are not certain, review this section. And
then you are still not certain, please ask me.
buffers global variables the whole memory chunk that can hold all the
buffer cache contents. We already explained, in the previous subsection, how
buffers is initialized.
line 171
Since the buffer pointed by bp is just initialized and empty, this buffer
is not associated with any other physical storage device. Therefore set b dev
member of the buffer cache header to NODEV.
We showed the whole source code of buffer cache header in a previous section.
And the definition of NODEV macro is in sys/param.h as
--------------------------------------------- sys/param.h
203 #define NODEV
(dev_t)(-1)
/* non-existent device */
--------------------------------------------- sys/param.h
line 172
This is some kinds of ad-hoc approach or hacking to make common linked-
list header. b vnbufs member is a link to a linked-list that holds the vnode
for the buffer cache. However, the head for the linked-list is not defined.
Therefore, instead of using LIST INIT macro requiring head node, this line
initializes the virtual link-list !
line 173
You may disregard it, since it is only used by Soft Dependency facilities.
line 174
set the b data member of buffser cache to point the virtual memory chunk
whose size is MAXBSIZE.
line 175-178
by default, only line 178 is effective.
line 179
set the status of buffer cache. Because buffer cache is not associated with
any vnode or valid data, the status is set to B INVAL.
line 180-181
by default, these two lines are equivalent to
binheadfree(bp, &bufqueues[BQ_AGE]);
meaning that a buffer cache pointed by bp variable is inserted in the head of
AGE list, since the binheadfree is a macro defined as,
------------------------------------------------­ kern/vfs bio.c
110 /*
111
* Insq/Remq for the buffer free lists.
112
*/
113 #define binsheadfree(bp, dp)
TAILQ_INSERT_HEAD(dp, bp, b_freelist)
114 #define binstailfree(bp, dp)
TAILQ_INSERT_TAIL(dp, bp, b_freelist)

102
CHAPTER 4. BUFFER CACHE
------------------------------------------------­ kern/vfs bio.c
line 182
places a buffer cache pointed by bp into invalid hash list. It is natural
that the buffer cache does not go to hash list since it does not contain any
contents.
------------------------------------------------­ kern/vfs bio.c
86 /*
87
* Insq/Remq for the buffer hash lists.
88
*/
89 #define binshash(bp, dp)
LIST_INSERT_HEAD(dp, bp, b_hash)
90 #define bremhash(bp)
LIST_REMOVE(bp, b_hash)
------------------------------------------------­ kern/vfs bio.c
4.6
Buffer Cache Operation
In this section, we shows list of buffer cache operations that are used by filesystem.
Buffer cache operations are defined in kern/vfs bio.c and declared in sys/buf.h
as,
------------------------------------------------------ sys/buf.h
260 void
allocbuf __P((struct buf *, int));
261 void
bawrite __P((struct buf *));
262 void
bdirty __P((struct buf *));
263 void
bdwrite __P((struct buf *));
264 void
biodone __P((struct buf *));
265 int
biowait __P((struct buf *));
266 int
bread __P((struct vnode *, daddr_t, int,
267
struct ucred *, struct buf **));
268 int
breada __P((struct vnode *, daddr_t, int, daddr_t, int,
269
struct ucred *, struct buf **));
270 int
breadn __P((struct vnode *, daddr_t, int, daddr_t *, int *, int,
271
struct ucred *, struct buf **));
272 void
brelse __P((struct buf *));
273 void
bremfree __P((struct buf *));
274 void
bufinit __P((void));
275 int
bwrite __P((struct buf *));
276 void
cluster_callback __P((struct buf *));
277 int
cluster_read __P((struct vnode *, u_quad_t, daddr_t, long,
278
struct ucred *, struct buf **));
279 void
cluster_write __P((struct buf *, u_quad_t));
280 struct buf *getblk __P((struct vnode *, daddr_t, int, int, int));
281 struct buf *geteblk __P((int));
282 struct buf *getnewbuf __P((int slpflag, int slptimeo));
283 struct buf *incore __P((struct vnode *, daddr_t));
284
285 void
minphys __P((struct buf *bp));
286 int
physio __P((void (*strategy)(struct buf *), struct buf *bp, dev_t dev,
287
int flags, void (*minphys)(struct buf *), struct uio *uio));
288

4.7. MANAGING BUFFER CACHE FREE LISTS
103
289 void
brelvp __P((struct buf *));
290 void
reassignbuf __P((struct buf *, struct vnode *));
291 void
bgetvp __P((struct vnode *, struct buf *));
------------------------------------------------------ sys/buf.h
4.6.1
Finding a Buffer Cache from Hash: incore function
The following incore function is used to find a buffer cache related with a vnode
that has specific logical file block number. (note that line 596-597)
------------------------------------------------­ kern/vfs bio.c
580 /*
581
* Determine if a block is in the cache.
582
* Just look on what would be its hash chain.
If it's there, return
583
* a pointer to it, unless it's marked invalid.
If it's marked invalid,
584
* we normally don't return the buffer, unless the caller explicitly
585
* wants us to.
586
*/
587 struct buf *
588 incore(vp, blkno)
589
struct vnode *vp;
590
daddr_t blkno;
591 {
592
struct buf *bp;
593
594
/* Search hash chain */
595
LIST_FOREACH(bp, BUFHASH(vp, blkno), b_hash) {
596
if (bp->b_lblkno == blkno && bp->b_vp == vp &&
597
!ISSET(bp->b_flags, B_INVAL))
598
return (bp);
599
}
600
601
return (NULL);
602 }
------------------------------------------------­ kern/vfs bio.c
line 595
since buffer cache hash is actually an array of linked-lists, this access is
logical. BUFHASH chooses an linked-list from the array. If you are not certain
this operation, review a section describing usage of data structure in kernel,
in chapter 1.
line 596
shows that buffer cache is identified with its associated vnode and logical
file block number.
From the following sections, we will explain what the functions in the above list do.
4.7
Managing Buffer Cache Free Lists
4.7.1
Reusing Old Buffer: bremfree function
The bremfree function is used to remove a buffer cache from a free list.
------------------------------------------------­ kern/vfs bio.c

104
CHAPTER 4. BUFFER CACHE
120 void
121 bremfree(bp)
122
struct buf *bp;
123 {
124
int s = splbio();
125
126
struct bqueues *dp = NULL;
127
128
/*
129
* We only calculate the head of the freelist when removing
130
* the last element of the list as that is the only time that
131
* it is needed (e.g. to reset the tail pointer).
132
*
133
* NB: This makes an assumption about how tailq's are implemented.
134
*/
135
if (TAILQ_NEXT(bp, b_freelist) == NULL) {
136
for (dp = bufqueues; dp < &bufqueues[BQUEUES]; dp++)
137
if (dp->tqh_last == &bp->b_freelist.tqe_next)
138
break;
139
if (dp == &bufqueues[BQUEUES])
140
panic("bremfree: lost tail");
141
}
142
TAILQ_REMOVE(dp, bp, b_freelist);
143
splx(s);
144 }
------------------------------------------------­ kern/vfs bio.c
124 splbio function blocks hardware interrupts from disks and other storage de-
vices so that the buffer cache coherency is not disturbed.
135
Remind that the definition of TAILQ NEXT and b freelist member in struct
buf as,
#define TAILQ_ENTRY(type)
\
struct {
\
struct type *tqe_next;
/* next element */
\
struct type **tqe_prev; /* address of previous next element */
\
}
...
#define TAILQ_NEXT(elm, field)
((elm)->field.tqe_next)
and
...
struct buf {
LIST_ENTRY(buf) b_hash;
/* Hash chain. */
LIST_ENTRY(buf) b_vnbufs;
/* Buffer's associated vnode. */
TAILQ_ENTRY(buf) b_freelist;
/* Free list position if not active. */
TAILQ_ENTRY(buf) b_actq;
/* Device driver queue when active. */
...

4.7. MANAGING BUFFER CACHE FREE LISTS
105
line 135
checks whether the buffer cache pointed by bp pointer is the last
elemenent in any one of four free lists.
line 136-140
find which free list contains the buffer cache. If the buffer cache to
be removed from a free list is not the last element from the free list, there is
no need to know the pointer to header.
But, if the buffer cache to be removed from a free list is the last element, there
need to know the pointer to header
You may wonder why. As the line 133 says, we need to know the implemen-
tation of tail queues to answer this reason.
line 142
remove the buffer cache from a free list.
From the below definition of TAILQ REMOVE, we can find the why the pointer
to the header of a free list in which the buffer cache pointed by bp lives, when
the buffer cache is the last element of the free list.
#define TAILQ_REMOVE(head, elm, field) do {
\
QUEUEDEBUG_TAILQ_PREREMOVE((head), (elm), field)
\
QUEUEDEBUG_TAILQ_OP((elm), field)
\
if (((elm)->field.tqe_next) != NULL)
\
(elm)->field.tqe_next->field.tqe_prev =
\
(elm)->field.tqe_prev;
\
else
\
(head)->tqh_last = (elm)->field.tqe_prev;
\
*(elm)->field.tqe_prev = (elm)->field.tqe_next;
\
QUEUEDEBUG_TAILQ_POSTREMOVE((elm), field);
\
} while (/*CONSTCOND*/0)
Ok ?
line 143
Restore the interrupt process condition changed by line 124. For your
reference, we show the definition of splbio and splx function of arch/sparc64/include/psl.h
as,
----------------------------------­ arch/sparc64/include/psl.h
79 /* Interesting spl()s */
80 #define PIL_SCSI
3
81 #define PIL_FDSOFT
4
82 #define PIL_AUSOFT
4
83 #define PIL_BIO
5
...
355 #define SPLHOLD(name, newpil) \
356 static __inline int name __P((void)); \
357 static __inline int name() \
358 { \
359
int oldpil; \
360
__asm __volatile("rdpr %%pil,%0" : "=r" (oldpil)); \
361
if (newpil <= oldpil) \
362
return oldpil; \
363
__asm __volatile("wrpr %%g0,%0,%%pil" : : "n" (newpil)); \
364
return (oldpil); \
365 }
366 #endif

106
CHAPTER 4. BUFFER CACHE
...
382 /* Block devices */
383 SPLHOLD(splbio, PIL_BIO)
...
448 static __inline void splx(newpil)
449
int newpil;
450 #endif
451 {
452 #ifdef SPLDEBUG
...
457 #endif
458
__asm __volatile("wrpr %%g0,%0,%%pil" : : "rn" (newpil));
459 }
----------------------------------­ arch/sparc64/include/psl.h
4.7.2
Allocating a New Buffer: getnewbuf function
If a process wants to read data from a file, the kernel determines which file system
contains the file and which block in the filesystem contains the data. When about
to read data from a particular disk block, the kernel checks the block is in the buffer
cache and, if it is not there, assigns a new free buffer using getnewbuf function.
Up to now, we presented elaborated description, and from now on we gives brief
explanation to reduce the size of this report :)
The algorithm of this function is
start:
if (there is a buffer on AGE free list)
{
remove the buffer from AGE free list;
} else if (there is a buffer on LRU free list)
remove the buffer from LRU free list;
} else {
// There is no buffer in any free lists. Oops !
//
sleep (event any buffers on free list);
return NULL;
}
if (the buffer is being flushed to storage) {
// Note that the buffer under flush is
// just removed from LRU list
//
set the buffer go to AGE list when the flush is done;
// check whether there is another free buffer */
//
goto start;
}
set the buffer cache as bust;
if (buffer is marked for delayed write) {
// Kernel must write the ``delayed write buffer``

4.7. MANAGING BUFFER CACHE FREE LISTS
107
// to storage and allocate another buffer !
//
set the buffer go to AGE list when the flush is done;
start asynchronous write of the buffer to disk;
return NULL;
}
// The buffer cache do not have filesystem
// logical block number associated with its data.
// Since logical block number is the hash key,
// the buffer cache no longer exist on hash.
//
disassociate the buffer cache from related vnode;
remove the buffer from old hash entry;
return buffer;
When the getnewbuf function returns NULL pointer, the caller of getnewbuf func-
tion generally try again calling the getnewbuf function.
------------------------------------------------­ kern/vfs bio.c
768 /*
769
* Find a buffer which is available for use.
770
* Select something from a free list.
771
* Preference is to AGE list, then LRU list.
772
*/
773 struct buf *
774 getnewbuf(slpflag, slptimeo)
775
int slpflag, slptimeo;
776 {
777
struct buf *bp;
778
int s;
779
780 start:
781
s = splbio();
782
if ((bp = TAILQ_FIRST(&bufqueues[BQ_AGE])) != NULL ||
783
(bp = TAILQ_FIRST(&bufqueues[BQ_LRU])) != NULL) {
784
bremfree(bp);
785
} else {
786
/* wait for a free buffer of any kind */
787
needbuffer = 1;
788
tsleep(&needbuffer, slpflag|(PRIBIO+1), "getnewbuf", slptimeo);
789
splx(s);
790
return (NULL);
791
}
792
793
if (ISSET(bp->b_flags, B_VFLUSH)) {
794
/*
795
* This is a delayed write buffer being flushed to disk.
Make
796
* sure it gets aged out of the queue when it's finished, and
797
* leave it off the LRU queue.
798
*/
799
CLR(bp->b_flags, B_VFLUSH);

108
CHAPTER 4. BUFFER CACHE
800
SET(bp->b_flags, B_AGE);
801
splx(s);
802
goto start;
803
}
804
805
/* Buffer is no longer on free lists. */
806
SET(bp->b_flags, B_BUSY);
807
808
/*
809
* If buffer was a delayed write, start it and return NULL
810
* (since we might sleep while starting the write).
811
*/
812
if (ISSET(bp->b_flags, B_DELWRI)) {
813
splx(s);
814
/*
815
* This buffer has gone through the LRU, so make sure it gets
816
* reused ASAP.
817
*/
818
SET(bp->b_flags, B_AGE);
819
bawrite(bp);
820
return (NULL);
821
}
822
823
/* disassociate us from our vnode, if we had one... */
824
if (bp->b_vp)
825
brelvp(bp);
826
splx(s);
827
828
if (LIST_FIRST(&bp->b_dep) != NULL && bioops.io_deallocate)
829
(*bioops.io_deallocate)(bp);
830
831
/* clear out various other fields */
832
bp->b_flags = B_BUSY;
833
bp->b_dev = NODEV;
834
bp->b_blkno = bp->b_lblkno = bp->b_rawblkno = 0;
835
bp->b_iodone = 0;
836
bp->b_error = 0;
837
bp->b_resid = 0;
838
bp->b_bcount = 0;
839
840
bremhash(bp);
841
return (bp);
842 }
------------------------------------------------­ kern/vfs bio.c
Souce code is exact implementation of the algorithm that we just described. The
only exception is line 828-829 that can be ignored since these two lines is only
effective when Soft Dependency facility is enabled.
4.7.3
Adjusting Buffer Size: allocbuf function
The task of allocbuf is to ensure that the buffer has enough physical memory
allocated to it. The data are for each buffer is allocated MAXBSIZE bytes of virtual
address space by bufinit function.

4.7. MANAGING BUFFER CACHE FREE LISTS
109
allocbuf compares the size of the intended data block with the amount of
physical memory already allocated to the buffer.
· If there is excess physical memory,
­
and there is a buffer available on the EMPTY list, the excess memory is
put into the empty buffer, and that buffer is then inserted onto the front
of the AGE list.
­
If there are no buffers on the EMPTY lists, the excess physical memory
is retained in the original buffer.
· If the buffer has insufficient memory, it takes memory from other buffers.
allocbuf function does this allocation by calling getnewbuf function that we
described in the previous subsection, to get a second buffer and transfer the
physical memory in the second buffer to the new buffer under construction.
­
If there is memory remaining in the second buffer, the second buffer is
released to the front of AGE list, otherwise the second buffer is released
to the EMPTY list.
­
If the new buffer still does not have enough physical memory, the process
is repeated.
------------------------------------------------­ kern/vfs bio.c
677 /*
678
* Expand or contract the actual memory allocated to a buffer.
679
*
680
* If the buffer shrinks, data is lost, so it's up to the
681
* caller to have written it out *first*; this routine will not
682
* start a write.
If the buffer grows, it's the callers
683
* responsibility to fill out the buffer's additional contents.
684
*/
685 void
686 allocbuf(bp, size)
687
struct buf *bp;
688
int size;
689 {
690
struct buf *nbp;
691
vsize_t desired_size;
692
int s;
693
694
desired_size = round_page((vsize_t)size);
695
if (desired_size > MAXBSIZE)
696
panic("allocbuf: buffer larger than MAXBSIZE requested");
697
698
if (bp->b_bufsize == desired_size)
699
goto out;
700
701
/*
702
* If the buffer is smaller than the desired size, we need to snarf
703
* it from other buffers.
Get buffers (via getnewbuf()), and
704
* steal their pages.
705
*/
706
while (bp->b_bufsize < desired_size) {
707
int amt;

110
CHAPTER 4. BUFFER CACHE
708
709
/* find a buffer */
710
while ((nbp = getnewbuf(0, 0)) == NULL)
711
;
712
713
SET(nbp->b_flags, B_INVAL);
714
binshash(nbp, &invalhash);
715
716
/* and steal its pages, up to the amount we need */
717
amt = min(nbp->b_bufsize, (desired_size - bp->b_bufsize));
718
pagemove((nbp->b_data + nbp->b_bufsize - amt),
719
bp->b_data + bp->b_bufsize, amt);
720
bp->b_bufsize += amt;
721
nbp->b_bufsize -= amt;
722
723
/* reduce transfer count if we stole some data */
724
if (nbp->b_bcount > nbp->b_bufsize)
725
nbp->b_bcount = nbp->b_bufsize;
726
727 #ifdef DIAGNOSTIC
728
if (nbp->b_bufsize < 0)
729
panic("allocbuf: negative bufsize");
730 #endif
731
732
brelse(nbp);
733
}
734
735
/*
736
* If we want a buffer smaller than the current size,
737
* shrink this buffer.
Grab a buf head from the EMPTY queue,
738
* move a page onto it, and put it on front of the AGE queue.
739
* If there are no free buffer headers, leave the buffer alone.
740
*/
741
if (bp->b_bufsize > desired_size) {
742
s = splbio();
743
if ((nbp = TAILQ_FIRST(&bufqueues[BQ_EMPTY])) == NULL) {
744
/* No free buffer head */
745
splx(s);
746
goto out;
747
}
748
bremfree(nbp);
749
SET(nbp->b_flags, B_BUSY);
750
splx(s);
751
752
/* move the page to it and note this change */
753
pagemove(bp->b_data + desired_size,
754
nbp->b_data, bp->b_bufsize - desired_size);
755
nbp->b_bufsize = bp->b_bufsize - desired_size;
756
bp->b_bufsize = desired_size;
757
nbp->b_bcount = 0;
758
SET(nbp->b_flags, B_INVAL);
759
760
/* release the newly-filled buffer and leave */
761
brelse(nbp);

4.7. MANAGING BUFFER CACHE FREE LISTS
111
762
}
763
764 out:
765
bp->b_bcount = size;
766 }
------------------------------------------------­ kern/vfs bio.c
The only additional information to understand every details of the above code, we
think, is
· b bcount member in struct buf used in line 724-724 represents the size of
physical memory allocated to that buffer cache
· The reason that brelse function is called at line 761, instead of directly
putting into the AGE list, is to awake any possible process for the availability
of a new buffer.
· round page macro is defined in uvm/uvm param.h as
----------------------------------------­ uvm/uvm param.h
151 /*
152
* Round off or truncate to the nearest page.
These will work
153
* for either addresses or counts (i.e., 1 byte rounds to 1 page).
154
*/
155 #define round_page(x)
(((x) + PAGE_MASK) & ~PAGE_MASK)
156 #define trunc_page(x)
((x) & ~PAGE_MASK)
----------------------------------------­ uvm/uvm param.h
where the PAGE MASK is defined as
------------------------------------------ uvm/uvm param.h
96 /*
97
*
All references to the size of a page should be done with PAGE_SIZE
98
*
or PAGE_SHIFT.
The fact they are variables is hidden here so that
99
*
we can easily make them constant if we so desire.
100
*/
101 #define PAGE_SIZE
uvmexp.pagesize
/* size of page */
102 #define PAGE_MASK
uvmexp.pagemask
/* size of page - 1 */
103 #define PAGE_SHIFT
uvmexp.pageshift
/* bits to shift for pages */
------------------------------------------ uvm/uvm param.h
where the uvmexp.pagesize is set up in arch/sparc64/sparc64/pamp.c as,
----------------------------------- arch/sparc64/sparc64/pmap.c
467 void
468 pmap_bootstrap(kernelstart, kernelend, maxctx)
469
u_long kernelstart, kernelend;
470
u_int maxctx;

112
CHAPTER 4. BUFFER CACHE
471 {
...
491
/*
492
* set machine page size
493
*/
494
uvmexp.pagesize = NBPG;
495
uvmexp.ncolors = pmap_calculate_colors();
496
uvm_setpagesize();
----------------------------------- arch/sparc64/sparc64/pmap.c
where the NBPG is defined to 8192 as
--------------------------------­ arch/sparc64/include/param.h
298 #define PGSHIFT
13
/* log2(NBPG) */
299 #define NBPG
(1<<PGSHIFT)
/* bytes/page */
300 #define PGOFSET
(NBPG-1)
/* byte offset into page */
--------------------------------­ arch/sparc64/include/param.h
Consistency of Physical Memory Mapping
allocbuf function ensures that each physical-memory page is mapped into exactly
one buffer at all times. So, the kernel maintains the consistency by purging old
buffers when files are shortened or removed as follows.
· Whenever a file is removed,
1. the kernel traverses its list of dirty buffers.
2. For each buffer, the kernel cancels its write requests and
3. marks the buffer invalid, so that the buffer cannot be found in the buffer
pool again.
4. Each invalid buffer is put at the front of the AGE list, so that it will be
used before any buffers with potentially useful data.
· For a file being partially truncated, only the buffers following the truncation
point are invalidated.
4.7.4
Getting a Buffer: getblk function
This function is the essential part in reading a logical file block into a buffer cache.
The algorithm of getblk function is
input: logical file block number, and vnode
output: locked buffer that can now be used for reading block,
but not having read the filesystem yet !
{
start:
if (block is in hash)
{
if (the buffercache is busy ?)
{

4.7. MANAGING BUFFER CACHE FREE LISTS
113
if (UVM is using the block ?)
{
return NULL;
}
sleep (until a buffer becomes free);
}
mark the buffer cache busy;
remove the buffer from free list;
}
else
{
if (try to get a new buffer cache failed ?)
{
goto start;
}
place the buffer cache into hash;
associate the buffer cache with vnode;
}
ensure the buffer cache has desired amount of physical memory;
return the buffer cache;
}
The source code for getblk function of kern/kern bio.c is
------------------------------------------------­ kern/vfs bio.c
604 /*
605
* Get a block of requested size that is associated with
606
* a given vnode and block offset. If it is found in the
607
* block cache, mark it as having been found, make it busy
608
* and return it. Otherwise, return an empty block of the
609
* correct size. It is up to the caller to insure that the
610
* cached blocks be of the correct size.
611
*/
612 struct buf *
613 getblk(vp, blkno, size, slpflag, slptimeo)
614
struct vnode *vp;
615
daddr_t blkno;
616
int size, slpflag, slptimeo;
617 {
618
struct buf *bp;
619
int s, err;
620
621 start:
622
bp = incore(vp, blkno);
623
if (bp != NULL) {
624
s = splbio();
625
if (ISSET(bp->b_flags, B_BUSY)) {
626
if (curproc == uvm.pagedaemon_proc) {
627
splx(s);
628
return NULL;

114
CHAPTER 4. BUFFER CACHE
629
}
630
SET(bp->b_flags, B_WANTED);
631
err = tsleep(bp, slpflag | (PRIBIO + 1), "getblk",
632
slptimeo);
633
splx(s);
634
if (err)
635
return (NULL);
636
goto start;
637
}
638 #ifdef DIAGNOSTIC
639
if (ISSET(bp->b_flags, B_DONE|B_DELWRI) &&
640
bp->b_bcount < size && vp->v_type != VBLK)
641
panic("getblk: block size invariant failed");
642 #endif
643
SET(bp->b_flags, B_BUSY);
644
bremfree(bp);
645
splx(s);
646
} else {
647
if ((bp = getnewbuf(slpflag, slptimeo)) == NULL)
648
goto start;
649
650
binshash(bp, BUFHASH(vp, blkno));
651
bp->b_blkno = bp->b_lblkno = bp->b_rawblkno = blkno;
652
s = splbio();
653
bgetvp(vp, bp);
654
splx(s);
655
}
656
allocbuf(bp, size);
657
return (bp);
658 }
------------------------------------------------­ kern/vfs bio.c
For your reference, we show the code of bgetvp function that associate the buffer
cache with vnode.
------------------------------------------------­ kern/vfs subr.c
135 /*
136
* Insq/Remq for the vnode usage lists.
137
*/
138 #define bufinsvn(bp, dp)
LIST_INSERT_HEAD(dp, bp, b_vnbufs)
139 #define bufremvn(bp) {
\
140
LIST_REMOVE(bp, b_vnbufs);
\
141
(bp)->b_vnbufs.le_next = NOLIST;
\
142 }
...
857 /*
858
* Associate a buffer with a vnode.
859
*/
860 void
861 bgetvp(vp, bp)
862
struct vnode *vp;
863
struct buf *bp;
864 {

4.7. MANAGING BUFFER CACHE FREE LISTS
115
865
int s;
866
867
if (bp->b_vp)
868
panic("bgetvp: not free, bp %p", bp);
869
VHOLD(vp);
870
s = splbio();
871
bp->b_vp = vp;
872
if (vp->v_type == VBLK || vp->v_type == VCHR)
873
bp->b_dev = vp->v_rdev;
874
else
875
bp->b_dev = NODEV;
876
/*
877
* Insert onto list for new vnode.
878
*/
879
bufinsvn(bp, &vp->v_cleanblkhd);
880
splx(s);
881 }
------------------------------------------------­ kern/vfs subr.c
In line 871-873, note that b vp and b dev member of the buffer cache is set up to
assiciate the vnode with the buffer. The buffer is inserted into the vnode clean list by
line 879
. The VHOLD vnode operation used in line 869 is defined in sys/vnode.h
as
-------------------------------------------------- sys/vnode.h
260 #define HOLDRELE(vp)
holdrele(vp)
261 #define VHOLD(vp)
vhold(vp)
262 #define VREF(vp)
vref(vp)
...
302 /*
303
* increase buf or page ref
304
*/
305 static __inline void
306 vhold(struct vnode *vp)
307 {
308
309
simple_lock(&vp->v_interlock);
310
if ((vp->v_freelist.tqe_prev != (struct vnode **)0xdeadb) &&
311
vp->v_holdcnt == 0 && vp->v_usecount == 0) {
312
simple_lock(&vnode_free_list_slock);
313
TAILQ_REMOVE(&vnode_free_list, vp, v_freelist);
314
TAILQ_INSERT_TAIL(&vnode_hold_list, vp, v_freelist);
315
simple_unlock(&vnode_free_list_slock);
316
}
317
vp->v_holdcnt++;
318
simple_unlock(&vp->v_interlock);
319 }
-------------------------------------------------- sys/vnode.h
VHOLD vnode operation marks the vnode as active by incrementing vp->v holdcnt
and moving the vnode from the freelist to the holdlist. Once on the holdlist, the
vnode will not be recycled until it is released with holdrele function.

116
CHAPTER 4. BUFFER CACHE
4.8
Allocating and Reading Filesystem with Buffer
Cache
From this section, the description is presented briefly as possible as we can, so that
we improve on analysis speed.
------------------------------------------------­ kern/vfs bio.c
186 static __inline struct buf *
187 bio_doread(vp, blkno, size, cred, async)
188
struct vnode *vp;
189
daddr_t blkno;
190
int size;
191
struct ucred *cred;
192
int async;
193 {
194
struct buf *bp;
195
struct proc *p = (curproc != NULL ? curproc : &proc0);
/* XXX */
196
197
bp = getblk(vp, blkno, size, 0, 0);
198
199
/*
200
* If buffer does not have data valid, start a read.
201
* Note that if buffer is B_INVAL, getblk() won't return it.
202
* Therefore, it's valid if it's I/O has completed or been delayed.
203
*/
204
if (!ISSET(bp->b_flags, (B_DONE | B_DELWRI))) {
205
/* Start I/O for the buffer. */
206
SET(bp->b_flags, B_READ | async);
207
VOP_STRATEGY(bp);
208
209
/* Pay for the read. */
210
p->p_stats->p_ru.ru_inblock++;
211
} else if (async) {
212
brelse(bp);
213
}
214
215
return (bp);
216 }
------------------------------------------------­ kern/vfs bio.c
line 197
get buffer containing the block or new block from buffer cache hash. This
buffer is locked and on hash list, but not on free list.
line 204
check the block is already containing the desired block.
line 207
calls filesystem strategy routine. If the target filesystem is Fast Filesys-
tem, then ufs strategy is called. Ths VOP STRATEGY is defined as,
------------------------------------------ sys/vnode if.h
1606 static __inline int VOP_STRATEGY(bp)
1607
struct buf *bp;
1608 {

4.8. ALLOCATING AND READING FILESYSTEM WITH BUFFER CACHE117
1609
struct vop_strategy_args a;
1610
a.a_desc = VDESC(vop_strategy);
1611
a.a_bp = bp;
1612
return (VCALL(bp->b_vp, VOFFSET(vop_strategy), &a));
1613 }
------------------------------------------ sys/vnode if.h
line 211-212
Why should this buffer be returned to free list ? Since asyncronoous
read is requested for a block already on buffer cache hash, these lines try to
return the buffer to free list.
----------------------------------------------­ ufs/ufs/ufs vnops.c
1655 /*
1656
* Calculate the logical to physical mapping if not done already,
1657
* then call the device strategy routine.
1658
*/
1659 int
1660 ufs_strategy(void *v)
1661 {
1662
struct vop_strategy_args /* {
1663
struct buf *a_bp;
1664
} */ *ap = v;
1665
struct buf
*bp;
1666
struct vnode
*vp;
1667
struct inode
*ip;
1668
int
error;
1669
1670
bp = ap->a_bp;
1671
vp = bp->b_vp;
1672
ip = VTOI(vp);
1673
if (vp->v_type == VBLK || vp->v_type == VCHR)
1674
panic("ufs_strategy: spec");
1675
KASSERT(bp->b_bcount != 0);
1676
if (bp->b_blkno == bp->b_lblkno) {
1677
error = VOP_BMAP(vp, bp->b_lblkno, NULL, &bp->b_blkno,
1678
NULL);
1679
if (error) {
1680
bp->b_error = error;
1681
bp->b_flags |= B_ERROR;
1682
biodone(bp);
1683
return (error);
1684
}
1685
if ((long)bp->b_blkno == -1) /* no valid data */
1686
clrbuf(bp);
1687
}
1688
if ((long)bp->b_blkno < 0) { /* block is not on disk */
1689
biodone(bp);
1690
return (0);
1691
}
1692
vp = ip->i_devvp;
1693
bp->b_dev = vp->v_rdev;
1694
VOCALL (vp->v_op, VOFFSET(vop_strategy), ap);

118
CHAPTER 4. BUFFER CACHE
1695
return (0);
1696 }
----------------------------------------------­ ufs/ufs/ufs vnops.c
line 1672
obtains a pointer to inode. The definition of VTOI is
----------------------------------------­ ufs/ufs/inode.h
191 /* Convert between inode pointers and vnode pointers. */
192 #define VTOI(vp)
((struct inode *)(vp)->v_data)
193 #define ITOV(ip)
((ip)->i_vnode)
----------------------------------------­ ufs/ufs/inode.h
line 1677
changes the logical block number of a file relative to the beginning of a
file, to the physical block number of a filesystem relative to the beginning of a
partition. b lblkno member contains the logical block number of a file asso-
ciated with the vnode. b blkno member contains the physical block number
of a filesystem.
line 1686
clears the buffer's data area. The macro definition is sys/buf.h
line 1692-1694
obtains the vnode for device driver of the filesystem such as CCD
pseudo device driver, or SCSI general layer. And then, the strategy function
of the driver or layer is called via specfs virtual filesystem layer !
If the
strategy function of the SCSI general layer is called, the function then calls
the start function of SCSI device driver such as Adaptec Fast-Wide, or LSI
logic controller, using physio function of kern/kern physio.c.
biodone function is called by a device driver to mark I/O complete on the buffer
that is just read or written.
------------------------------------------------­ kern/vfs bio.c
869 /*
870
* Mark I/O complete on a buffer.
871
*
872
* If a callback has been requested, e.g. the pageout
873
* daemon, do so. Otherwise, awaken waiting processes.
874
*
875
* [ Leffler, et al., says on p.247:
876
*
"This routine wakes up the blocked process, frees the buffer
877
*
for an asynchronous write, or, for a request by the pagedaemon
878
*
process, invokes a procedure specified in the buffer structure" ]
879
*
880
* In real life, the pagedaemon (or other system processes) wants
881
* to do async stuff to, and doesn't want the buffer brelse()'d.
882
* (for swap pager, that puts swap buffers on the free lists (!!!),
883
* for the vn device, that puts malloc'd buffers on the free lists!)
884
*/
885 void
886 biodone(bp)
887
struct buf *bp;
888 {

4.8. ALLOCATING AND READING FILESYSTEM WITH BUFFER CACHE119
889
int s = splbio();
890
891
if (ISSET(bp->b_flags, B_DONE))
892
panic("biodone already");
893
SET(bp->b_flags, B_DONE);
/* note that it's done */
894
895
if (LIST_FIRST(&bp->b_dep) != NULL && bioops.io_complete)
896
(*bioops.io_complete)(bp);
897
898
if (!ISSET(bp->b_flags, B_READ))
/* wake up reader */
899
vwakeup(bp);
900
901
if (ISSET(bp->b_flags, B_CALL)) {
/* if necessary, call out */
902
CLR(bp->b_flags, B_CALL);
/* but note callout done */
903
(*bp->b_iodone)(bp);
904
} else {
905
if (ISSET(bp->b_flags, B_ASYNC))
/* if async, release */
906
brelse(bp);
907
else {
/* or just wakeup the buffer */
908
CLR(bp->b_flags, B_WANTED);
909
wakeup(bp);
910
}
911
}
912
913
splx(s);
914 }
------------------------------------------------­ kern/vfs bio.c
line 898-899
Recall that the B READ flag is set by bio doread function. If this
flag is not set, then biodone is called after write, therefore, decrease number
of pending write in vnode structure. The definition of vwakeup function is
--------------------------------------- kern/vfs subr.c
630 /*
631
* Update outstanding I/O count and do wakeup if requested.
632
*/
633 void
634 vwakeup(bp)
635
struct buf *bp;
636 {
637
struct vnode *vp;
638
639
if ((vp = bp->b_vp) != NULL) {
640
if (--vp->v_numoutput < 0)
641
panic("vwakeup: neg numoutput, vp %p", vp);
642
if ((vp->v_flag & VBWAIT) && vp->v_numoutput <= 0) {
643
vp->v_flag &= ~VBWAIT;
644
wakeup((caddr_t)&vp->v_numoutput);
645
}
646
}
647 }
--------------------------------------- kern/vfs subr.c

120
CHAPTER 4. BUFFER CACHE
line 905-906
returns a buffer that just finished asynchronous read to free list, since
it is not immediately used. Remember that brelse function clears B BUSY flag
which is set by getblk function.
line 907-910
just wakeup the process waiting from biowait function for the com-
pletion of I/O.
4.8.1
Just Read: bread function
The filesystem allocates and fills buffers by calling the bread function. Bread func-
tion
· Takes a vnode, a logical block number, and a size, and
· Returns a pointer to a locked buffer.
It is important to remember that any other process that tries to obtain the
buffer will be put to sleep until the buffer is released.
------------------------------------------------­ kern/vfs bio.c
218 /*
219
* Read a disk block.
220
* This algorithm described in Bach (p.54).
221
*/
222 int
223 bread(vp, blkno, size, cred, bpp)
224
struct vnode *vp;
225
daddr_t blkno;
226
int size;
227
struct ucred *cred;
228
struct buf **bpp;
229 {
230
struct buf *bp;
231
232
/* Get buffer for block. */
233
bp = *bpp = bio_doread(vp, blkno, size, cred, 0);
234
235
/* Wait for the read to complete, and return result. */
236
return (biowait(bp));
237 }
------------------------------------------------­ kern/vfs bio.c
line 236
Remember that this is synchronous read, not asynchronous. The differ-
ence between them is that synchronous read wait for the completion of read
operation from filesystem, but asynchronous read does not wait. To differen-
tiate this trait, see the breadn function.
------------------------------------------------­ kern/vfs bio.c
844 /*
845
* Wait for operations on the buffer to complete.
846
* When they do, extract and return the I/O's error value.
847
*/
848 int
849 biowait(bp)

4.8. ALLOCATING AND READING FILESYSTEM WITH BUFFER CACHE121
850
struct buf *bp;
851 {
852
int s;
853
854
s = splbio();
855
while (!ISSET(bp->b_flags, B_DONE | B_DELWRI))
856
tsleep(bp, PRIBIO + 1, "biowait", 0);
857
splx(s);
858
859
/* check for interruption of I/O (e.g. via NFS), then errors. */
860
if (ISSET(bp->b_flags, B_EINTR)) {
861
CLR(bp->b_flags, B_EINTR);
862
return (EINTR);
863
} else if (ISSET(bp->b_flags, B_ERROR))
864
return (bp->b_error ? bp->b_error : EIO);
865
else