Let's build an OS: Booting Something

Let’s go on an adventure. I’ve learnt a lot more Rust over the last year, and I want to get back into writing properly, so my plan is to write a Linux Operating System. While writing it, I’ll be taking notes in my repo - https://github.com/sinkingpoint/qos/tree/main/notes . And every now and then formalising them into more structured blog posts over here, once I’ve learnt enough to make something interesting.

Welcome to the first of such formalisations: Getting something booting.

I’m a bit of a rabbit hole learner. I start one place and easily get distracted into others. It helps to keep an end goal in mind in order to act as a north star, so getting something booting seems like a noble first goal to work towards. Let’s see what rabbit holes we can find.

What to boot?

When one wants to boot something, it helps to have something to boot. Well that’s easy - I’m writing this on my laptop, and that’s already booted something. But what has it booted? Now there’s a good question. Let’s ask:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.8-200.fc39.x86_64 root=UUID=e2cd75ff-3ee9-41ce-b23f-28f7d78f4a4f ro rootflags=subvol=root rd.luks.uuid=luks-f44121b1-5be9-47d8-a206-72c52309e7dd rhgb quiet

/proc/cmdline contains all the kernel parameters that my laptop booted with. BOOT_IMAGE= there looks promising!

🐇 Rabbithole: Kernel Parameters

While we’re really only interested in one parameter (BOOT_IMAGE) there, it can’t hurt to enumerate the rest.

root=UUID=e2cd75ff-3ee9-41ce-b23f-28f7d78f4a4f

root allows us to specify what disk should be mounted as our root filesystem. The structure of this one is interesting to note however - UUID= isn’t something that the kernel specifically understands. Instead, the variable is read and interpreted by the initramfs (Spoilers!) in order to find the disk. This in particular is telling systemd to boot the disk identified by UUID e2cd75ff-3ee9-41ce-b23f-28f7d78f4a4f. Other valid options include LABEL, PARTLABEL, PARTUUID, and ID (see: https://man7.org/linux/man-pages/man8/mount.8.html).

ro

ro tells the kernel to mount the root file system as read-only when we boot.

rootflags=subvol=root

rootflags allows us to send specific options when mounting the filesystem. In particular, it’s the data argument to the mount syscall. subvol=root tells the call what BTRFS subvolume to mount.

rd.luks.uuid

rd.luks.uuid isn’t interpreted by the kernel at all, and is instead understood by systemd-cryptsetup-generator to indicate which LUKS device to activate when booting. The rd. at the front indicates that it’s only handled by the initramfs (rd=“ram disk”).

rhgb & quiet

While technically two separate arguments, these work hand in hand. In particular, they enable the “Red Hat Graphical Boot” mode, and disable a lot of the kernel messages after the kernel boots. These are Fedora specific to allow for a nice splash screen when booting, rather than 👻spooky👻 kernel messages.

BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.8-200.fc39.x86_64, so my laptop is booting from a disk ((hd0,gpt2): Hard Drive 0, “GUID Partition Table” (GPT) 2) something called “vmlinuz-6.6.8-200.fc39.x86_64”. Let’s see if we can find that.

$ find / -name vmlinuz-6.6.8-200.fc39.x86_64
/boot/vmlinuz-6.6.8-200.fc39.x86_64

$ file /boot/vmlinuz-6.6.8-200.fc39.x86_64
/boot/vmlinuz-6.6.8-200.fc39.x86_64: Linux kernel x86 boot executable bzImage, version 6.6.8-200.fc39.x86_64 (mockbuild@f2936e05dca94a129acf79933fec484d) #1 SMP PREEMPT_DYNAMIC Thu Dec 21 04:01:49 UTC 2023, RO-rootFS, swap_dev 0XD, Normal VGA

A “boot executable”? Nice. That sounds like something we can boot! And, ah! That “6.6.8-200.fc39” indicates the Linux kernel version that I’m running (Kernel 6.6.8, minor release 200, Fedora 39) Now… how do we boot it?

🐇 Rabbithole: Other boot files

If we look in /boot for other “6.6.8-200.fc39.x86_64” related files, we get a few. Let’s take a look at them.

/boot/config-6.6.8-200.fc39.x86_64

$ file /boot/config-6.6.8-200.fc39.x86_64
/boot/config-6.6.8-200.fc39.x86_64: Linux make config build file, ASCII text

$ head /boot/config-6.6.8-200.fc39.x86_64 
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 6.6.8-200.fc39.x86_64 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=130201
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=24000

This contains the Kernel Configuration that my Kernel was built with.

/boot/initramfs-6.6.8-200.fc39.x86_64.img

$ sudo file /boot/initramfs-6.6.8-200.fc39.x86_64.img
/boot/initramfs-6.6.8-200.fc39.x86_64.img: ASCII cpio archive (SVR4 with no CRC)

A CPIO archive? Never heard of that one before. I do know what initramfs is though. This contains the “Initial Ram File system” (initramfs) that my system booted with. That makes sense! We found in our other rabbithole that our Kernel parameters contained a few flags that the documentation said were interpreted by an initramfs - this must be that!

From https://wiki.gentoo.org/wiki/Initramfs:

An initramfs (initial ram file system) is used to prepare Linux systems during boot before the init process starts.

That sounds like something to do with the boot process though, so let’s return once we’ve got something booting.

/boot/symvers-6.6.8-200.fc39.x86_64.xz

$ file /boot/symvers-6.6.8-200.fc39.x86_64.xz
/boot/symvers-6.6.8-200.fc39.x86_64.xz: symbolic link to /lib/modules/6.6.8-200.fc39.x86_64/symvers.xz

$ unxz ./symvers-6.6.8-200.fc39.x86_64.xz

$ file symvers-6.6.8-200.fc39.x86_64 
symvers-6.6.8-200.fc39.x86_64: ASCII text

$ head symvers-6.6.8-200.fc39.x86_64 
0x00000000      system_state    vmlinux EXPORT_SYMBOL
0x00000000      static_key_initialized  vmlinux EXPORT_SYMBOL_GPL
0x00000000      reset_devices   vmlinux EXPORT_SYMBOL
0x00000000      loops_per_jiffy vmlinux EXPORT_SYMBOL
0x00000000      init_uts_ns     vmlinux EXPORT_SYMBOL_GPL
0x00000000      wait_for_initramfs      vmlinux EXPORT_SYMBOL_GPL
0x00000000      init_task       vmlinux EXPORT_SYMBOL
0x00000000      cc_platform_has vmlinux EXPORT_SYMBOL_GPL
0x00000000      cc_mkdec        vmlinux EXPORT_SYMBOL_GPL
0x00000000      tdx_kvm_hypercall       vmlinux EXPORT_SYMBOL_GPL

From https://www.kernel.org/doc/html/latest/kbuild/modules.html:

Module.symvers contains a list of all exported symbols from a kernel build.

Module.symvers contains all exported symbols from the kernel and compiled modules. For each symbol, the corresponding CRC value is also stored.

According to the above link, the structure is:

<CRC>       <Symbol>         <Module>                         <Export Type>

Let’s take one:

0x00000000      cc_platform_has vmlinux EXPORT_SYMBOL_GPL

That means that lines means a CRC of 0x00000000, a symbol name of cc_platform_has, a namespace of vmlinux, and an export of EXPORT_SYMBOL_GPL so it can only be used from GPL licensed modules. We can even find where that comes from! https://github.com/torvalds/linux/blob/fbafc3e621c3f4ded43720fdb1d6ce1728ec664e/arch/x86/coco/core.c#L111

Could have fooled me. My guess at the first field was a memory address. Weird. My CRCs are all 0x00000000? Ah:

For a kernel build without CONFIG_MODVERSIONS enabled, the CRC would read 0x00000000.

And indeed:

$ grep CONFIG_MODVERSIONS /boot/config-6.6.8-200.fc39.x86_64 
# CONFIG_MODVERSIONS is not set

/boot/System.map-6.6.8-200.fc39.x86_64

$ sudo file /boot/System.map-6.6.8-200.fc39.x86_64
/boot/System.map-6.6.8-200.fc39.x86_64: ASCII text

$ sudo head /boot/System.map-6.6.8-200.fc39.x86_64 
0000000000000000 D __per_cpu_start
0000000000000000 D fixed_percpu_data
0000000000001000 D cpu_debug_store
0000000000002000 D irq_stack_backing_store
0000000000006000 D cpu_tss_rw
000000000000b000 D gdt_page
000000000000c000 d exception_stacks
0000000000018000 d entry_stack_storage
0000000000019000 D espfix_waddr
0000000000019008 D espfix_stack

Of all the files here, this one’s the one with a Wikipedia page. It must be important. In fact, this does what I thought at first glance the symvers file did - it provides a mapping from symbols to addresses in kernel space. The middle column there is the “type” of the symbol. This allows me to say, if I wanted to call the soft_restart_cpu function, that I should jump to address 0xffffffff81000400. Useful!

Well my laptop has already booted that boot image, and I could start fuzzing around in those files to start building something, having to restart my laptop every time I wanted to test a change sounds like a real bear. In order to have to not do that, I can boot things in a Virtual Machine - much nicer. We manage Virtual Machines with Hypervisor, and I’ll need to pick one. But there’s so many! The one I’m most familiar with is qemu, so for familiarities sake than anything else, let’s use that. It’s a decent choice - it has a nice Command Line Interface, and it supports KVM which allows it to be 🏃fast🏃 .

Let’s try this:

$ qemu-system-x86_64 -kernel ./vmlinuz-6.6.8-200.fc39.x86_64
...
[    1.101571] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    1.105553] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.8-200.fc39.x86_64 #1
[    1.107942] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[    1.111619] Call Trace:
...

Fantastic! We just booted a kernel. It immediately crashed (or, “panic"ed, because we didn’t give it a drive, or an initramfs, or… anything really), but by golly it actually booted.

🐇 Rabbithole: Useful qemu flags

While the above command works well enough, there’s a few more flags we can chuck on the end to make it a bit nicer to work with.

-display none allows us to stop the extra window that qemu opens.
-serial stdio -append "console=ttyAMA0 console=ttyS0" allows us to redirect the output of our virtual machine back to the terminal we ran qemu on.
--enable-kvm enables using KVM, which gives us a notable speed increase.
-m 2G gives our VM 2G of memory, rather than the default 128MB. We probably wont need all that memory to start with, but we might as well have it to prepare for the future.

Those give us an actual final command of:

qemu-system-x86_64 -kernel ./vmlinuz-6.6.8-200.fc39.x86_64 -display none -serial stdio -append "console=ttyAMA0 console=ttyS0" --enable-kvm -m 2G

Now what?

An initramfs

Let’s go back to those files in /boot. Another one stood out there, /boot/initramfs-6.6.8-200.fc39.x86_64.img. Now, I happen to know that after the kernel starts it needs something to mount, and an initial ram file system (“initramfs”) is a perfect candidate for that. Let’s give it a try:

qemu-system-x86_64 -kernel ./vmlinuz-6.6.8-200.fc39.x86_64 -initrd ./initramfs-6.6.8-200.fc39.x86_64.img -display none -serial stdio -append "console=ttyAMA0 console=ttyS0" --enable-kvm -m 2G

Note here the new -initrd flag pointing to our initramfs. Running that, we get presenting with a bunch of systemd start logs. My gosh it works! We just booted my laptop, on my laptop. Meta.

But I don’t want to just run someone else’s initramfs, I want to make my own. So what does one look like?

$ file initramfs-6.6.8-200.fc39.x86_64.img 
initramfs-6.6.8-200.fc39.x86_64.img: ASCII cpio archive (SVR4 with no CRC)

🐇 Rabbithole: The structure of an initramfs

A “CPIO archive”, huh? Never heard of them. They seem very similar to tar files - a collection of files bundled into one. I even seem to have a tool installed for extracting them!

$ cpio -i < initramfs-6.6.8-200.fc39.x86_64.img 
416 blocks

$ tree
.
├── early_cpio
└── kernel
    └── x86
        └── microcode
            └── GenuineIntel.bin

4 directories, 2 files

Wait what? Where’s our file system? And also, my initramfs file is 39 megabytes - that bin file is only 207 _kilo_bytes. Is CPIO really that wasteful to have a 180x overhead? Something’s not quite right here. Let’s at least look at what we do have.

We’ve got two files here:

$ file early_cpio kernel/x86/microcode/GenuineIntel.bin 
early_cpio:                            ASCII text
kernel/x86/microcode/GenuineIntel.bin: data

$ cat early_cpio 
1

So, our “early_cpio” file contains a single “1” in it, and our “GenuineIntel.bin” contains some random junk (microcode by the folder name, and the fact that it says “Intel”). What does that early_cpio file do? There’s no reference to it in the Linux source, but we can find earlycpio.c that seems to be called from the microcode loader. We can even find where it comes from, but as far as I can tell this file is purely informational - please correct me!

So this cpio is loaded before the real initramfs in order to load the microcode onto my CPU. So where’s the real one?

Let’s take a closer look:

$ binwalk initramfs-6.6.8-200.fc39.x86_64.img 

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             ASCII cpio archive (SVR4 with no CRC), file name: ".", file name length: "0x00000002", file size: "0x00000000"
112           0x70            ASCII cpio archive (SVR4 with no CRC), file name: "early_cpio", file name length: "0x0000000B", file size: "0x00000002"
240           0xF0            ASCII cpio archive (SVR4 with no CRC), file name: "kernel", file name length: "0x00000007", file size: "0x00000000"
360           0x168           ASCII cpio archive (SVR4 with no CRC), file name: "kernel/x86", file name length: "0x0000000B", file size: "0x00000000"
484           0x1E4           ASCII cpio archive (SVR4 with no CRC), file name: "kernel/x86/microcode", file name length: "0x00000015", file size: "0x00000000"
616           0x268           ASCII cpio archive (SVR4 with no CRC), file name: "kernel/x86/microcode/GenuineIntel.bin", file name length: "0x00000026", file size: "0x00033C00"
212732        0x33EFC         ASCII cpio archive (SVR4 with no CRC), file name: "TRAILER!!!", file name length: "0x0000000B", file size: "0x00000000"
212992        0x34000         gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00 (null date)
8109851       0x7BBF1B        gzip compressed data, from Unix, last modified: 1970-01-01 00:00:00 (null date)
11000563      0xA7DAF3        xz compressed data
13377833      0xCC2129        xz compressed data
13884598      0xD3DCB6        xz compressed data
13907055      0xD4346F        xz compressed data
13912976      0xD44B90        xz compressed data
13966468      0xD51C84        xz compressed data
24097481      0x16FB2C9       Certificate in DER format (x509 v3), header length: 4, sequence length: 17280
25197943      0x1807D77       CRC32 polynomial table, little endian

Ah hah! Our initramfs is being sneaky 😏. It’s not just a CPIO archive but a CPIO archive plus a gzipped file bolted on the end! What’s that doing there?

$ (cpio -i; cat > second.gz) < initramfs-6.6.8-200.fc39.x86_64.img 
416 blocks

$ file second.gz
second: gzip compressed data, max compression, from Unix, original size modulo 2^32 82102784

$ gunzip second.gz   

$ file second 
second: ASCII cpio archive (SVR4 with no CRC)

It’s a second CPIO file, this time compressed!

Let’s extract them:

$ (cpio -i; gunzip | cpio -i) < ../initramfs-6.6.8-200.fc39.x86_64.img
416 blocks
cpio: dev/console: Cannot mknod: Operation not permitted
cpio: dev/kmsg: Cannot mknod: Operation not permitted
cpio: dev/null: Cannot mknod: Operation not permitted
cpio: dev/random: Cannot mknod: Operation not permitted
cpio: dev/urandom: Cannot mknod: Operation not permitted
160357 blocks

$ ls
bin  dev  early_cpio  etc  init  kernel  lib  lib64  proc  root  run  sbin  shutdown  sys  sysroot  tmp  usr  var

A bunch of “Operation not permitted” errors because we can’t mknod, but wahey! We have an honest to god file system in here! Hey, what happens once the file system is mounted? Booting it above started systemd - how did that happen? Let’s look at the code.

The first line there seems promising if (ramdisk_execute_command) { - we’re in a ram disk! Where does that come from? Well, apparently it comes from two places - the default of "/init", or dynamically from the rdinit kernel parameter. We learnt before that we don’t have an rdinit in our kernel parameters so we must be using the default! What’s that?

$ file init 
init: symbolic link to usr/lib/systemd/systemd

Well shucks, it’s systemd! Exactly what we saw when we booted it with qemu.

So, our initramfs is a CPIO archived file system, with /init being the executable that gets run. Shall we make one? Let’s start simple - booting into a shell. We can assemble a tree, and turn that into a CPIO archive:

$ mkdir tree
mkdir: created directory 'tree'

$ cp /bin/sh tree/init

tree$ find . | cpio -c -o > initramfs
cpio: File ./initramfs grew, 1439744 new bytes not copied
5625 blocks

Let’s try and boot our new initramfs:

$ qemu-system-x86_64 -kernel ./vmlinuz-6.6.8-200.fc39.x86_64 -initrd ./tree/initramfs -display none -serial stdio -append "console=ttyAMA0 console=ttyS0" --enable-kvm
...
[    1.150096] Run /init as init process
[    1.150782] Failed to execute /init (error -2)
[    1.151436] Run /sbin/init as init process
[    1.152056] Run /etc/init as init process
[    1.152663] Run /bin/init as init process
[    1.153301] Run /bin/sh as init process
[    1.153947] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.

Oh 🙁 errno says a -2 is ENOENT. Our init process can’t be found? But it found it to run it? Ah! Shared libraries. We need a few of those - let’s add them.

$ ldd tree/init
    linux-vdso.so.1 (0x00007ffcb718e000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007fed6c3a9000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fed6c1c7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fed6c563000)

$ tree
.
├── init
└── lib64
    ├── ld-linux-x86-64.so.2
    ├── libc.so.6
    └── libtinfo.so.6

$ find . | cpio -c -o > initramfs

And run it:

$ qemu-system-x86_64 -kernel ./vmlinuz-6.6.8-200.fc39.x86_64 -initrd ./tree/initramfs -display none -serial stdio -append "console=ttyAMA0 console=ttyS0" --enable-kvm
...
init: cannot set terminal process group (-1): Inappropriate ioctl for device
init: no job control in this shell
init-5.2#

😮 We have a shell! That means we’ve successfully made an initramfs from scratch! If we Ctrl-D, the kernel panics (“Attempted to kill init”), but still! Progress!

And that’s where I’ll leave off here. It’s worthwhile looking back at what we’ve learnt:

The structure of the /boot directory
Using qemu to boot a kernel
The structure of an initramfs
And making our own one!

Where do we go from here? Well, the world’s our oyster. I don’t like packaging /bin/sh so maybe we’ll start with our own shell?

Let me know what you think of these - I’m planning to run this as a series a bit, meandering my way around constructing what I feel like and sharing along the way. Let’s see how far we can get!

I'm on BlueSky: @colindou.ch. Come yell at me!

What to boot?#

root=UUID=e2cd75ff-3ee9-41ce-b23f-28f7d78f4a4f#

ro#

rootflags=subvol=root#

rd.luks.uuid#

rhgb & quiet#

/boot/config-6.6.8-200.fc39.x86_64#

/boot/initramfs-6.6.8-200.fc39.x86_64.img#

/boot/symvers-6.6.8-200.fc39.x86_64.xz#

/boot/System.map-6.6.8-200.fc39.x86_64#

An initramfs#