Linux Kernel: How to calculate the value of _text(%rip)?


Theme: juejin

Description:

This article uses Linux kernel version v3.10

When studying the Linux Kernel source code, I encountered the following code snippet:

// file: arch/x86/kernel/head_64.S
startup_64:
/*
 * Compute the delta between the address I am compiled to run at and the
 * address I am actually running at.
 */
leaq_text(%rip), %rbp
subq$_text - __START_KERNEL_map, %rbp

So how do we calculate the value of _text(%rip)?

1. RIP-Relative Addressing

Before calculating the value, let's first look at the instruction format. Instructions containing %rip are a new instruction format introduced for the x86_64 architecture, called RIP-Relative Addressing. Its calculation rule is the start address of the next instruction plus the offset. The offset is a 32-bit signed integer, so it allows an offset range of ±2GB.

A new addressing form, RIP-relative (relative instruction-pointer) addressing, is implemented in 64-bit mode. An

effective address is formed by adding displacement to the 64-bit RIP of the next instruction.

Note: Cited from Intel 64 and IA-32 Architectures Software Developer Manuals Volume 2A Chapter 2 Instruction Format 2.2.1.6 RIP-Relative Addressing

There are two usage modes:

  • Constant offset, for example 1234(%rip)
  • Symbol offset, for example symbol(%rip)

For the first case, calculation is very simple: just add the constant to the address of the next instruction.

The calculation method for the second case is different from the first. You cannot directly add the address of the symbol symbol to the instruction address. You first calculate the offset from RIP to the symbol symbol, then add that offset to the instruction address. In other words, this approach points to the actual address of the symbol symbol.

The x86-64 architecture adds an RIP (instruction pointer relative) addressing. This addressing mode is specified by using ‘rip’ as a base register. Only constant offsets are valid. For example:

AT&T: ‘1234(%rip)’, Intel: ‘[rip + 1234]’

​Points to the address 1234 bytes past the end of the current instruction.

AT&T: ‘symbol(%rip)’, Intel: ‘[rip + symbol]’

​Points to the symbol in RIP relative way, this is shorter than the default absolute addressing.

Note: Cited from the official ld documentation: 9.16.7 Memory References

2. Value of the Symbol

Now that we have covered the instruction format, let's look at the value of the _text symbol.

The _text symbol is defined in the linker script arch/x86/kernel/vmlinux.lds.S:

// file: arch/x86/kernel/vmlinux.lds.S
#ifdef CONFIG_X86_32
        ...
#else
        . = __START_KERNEL;
        phys_startup_64 = startup_64 - LOAD_OFFSET;
#endif

/* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;



} :text = 0x9090

In the linker script, the special symbol . refers to the location counter, which always points to the current output position; assigning a value to . will move the location counter. For more information about the location counter, refer to the online ld documentation: 3.10.5 The Location Counter.

In the code above, the symbol _text is set to the value of the location counter; on x86_64 systems, the location counter is defined as __START_KERNEL. The __START_KERNEL macro is defined in the file arch/x86/include/asm/page_64_types.h:

// file: arch/x86/include/asm/page_64_types.h
#define __PHYSICAL_START((CONFIG_PHYSICAL_START + \
  (CONFIG_PHYSICAL_ALIGN - 1)) & \
 ~(CONFIG_PHYSICAL_ALIGN - 1))

#define __START_KERNEL(__START_KERNEL_map + __PHYSICAL_START)
#define __START_KERNEL_map_AC(0xffffffff80000000, UL)

Among these, CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN are Linux Kernel configuration options, their default values are as follows:

// file: include/generated/autoconf.h
#define CONFIG_PHYSICAL_START 0x1000000   // 1M
#define CONFIG_PHYSICAL_ALIGN 0x1000000// 1M

After calculation, the macro __START_KERNEL evaluates to 0xffffffff81000000. This value is also the (virtual) address of the _text symbol, which can also be verified from vmlinux:

$ nm vmlinux|grep _text
...
ffffffff81000000 T _text
...

Additionally, we can see another macro __PHYSICAL_START, which defines the physical address where the kernel code (starting from protected mode) is loaded, with a value of 0x1000000 (1MB). Addresses below 1MB are reserved for BIOS, bootloader, boot sectors, kernel boot code, etc. The __START_KERNEL_map macro is the starting address of the kernel in virtual memory, which corresponds to physical address 0.

Since the address of the _text symbol is 0xffffffff81000000, does that mean the calculation result of _text(%rip) is exactly this address? Let's not jump to conclusions, let's look at the compiled result of the leaq_text(%rip), %rbp instruction.

Before disassembly, we need to find the address of the startup_64 symbol and the next symbol after it:

$ nm vmlinux|sort -k1|grep -A 1 startup_64
0000000001000000 A phys_startup_64
ffffffff81000000 T startup_64
ffffffff81000000 T _text
ffffffff81000110 T secondary_startup_64
ffffffff810001b0 T start_cpu0

As we can see, the link-time address of startup_64 is 0xffffffff81000000, and the link-time address of the next symbol secondary_startup_64 is 0xffffffff81000110. We can then use the objdump command to disassemble vmlinux:

$ objdump -d vmlinux --start-address=0xffffffff81000000 --stop-address=0xffffffff81000110

vmlinux: file format elf64-x86-64

Disassembly of section .text:

ffffffff81000000 <_text>:
ffffffff81000000: 48 8d 2d f9 ff ff ff lea -0x7(%rip),%rbp # ffffffff81000000 <_text>
ffffffff81000007: 48 81 ed 00 00 00 01 sub $0x1000000,%rbp
ffffffff8100000e: 48 89 e8 mov %rbp,%rax
ffffffff81000011: 25 ff ff 1f 00 and $0x1fffff,%eax
ffffffff81000016: 85 c0 test %eax,%eax
ffffffff81000018: 0f 85 a7 01 00 00 jne ffffffff810001c5 <bad_address>
ffffffff8100001e: 48 8d 05 db ff ff ff lea -0x25(%rip),%rax # ffffffff81000000 <_text>
ffffffff81000025: 48 c1 e8 2e shr $0x2e,%rax


</bad_address>

We can see that leaq_text(%rip), %rbp is compiled to lea -0x7(%rip),%rbp, with a calculated offset of -0x7. Why is the offset -0x7? Because the link-time address of the _text symbol is 0xffffffff81000000, the current instruction is 7 bytes long, so the start address of the next instruction is 0xffffffff81000007; to reach _text starting from the next instruction, the required offset is -0x7.

We can see that _text(%rip) is compiled into -0x7(%rip), which no longer retains the symbol address, and becomes the offset of the symbol relative to %rip. The final result depends on the actual value of %rip at runtime, because -0x7(%rip) only has one variable, which is %rip.

So what is the value of %rip at this point?

3. Debugging and Conclusion

We will use qemu + gdb to debug this issue. First set a breakpoint at the startup_64 symbol, then run, and the result is as follows:

(gdb) file vmlinux
Reading symbols from vmlinux...
(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in perf_throttled_count ()
(gdb) b startup_64
Breakpoint 1 at 0xffffffff81000000: file arch/x86/kernel/head_64.S, line 72.
(gdb) c
Continuing.

As we can see, even though we set a breakpoint at startup_64 (0xffffffff81000000), the kernel did not stop there and ran all the way through. This means the program never executed at this address.

Restart qemu and gdb, set a breakpoint at the physical address 0x1000000, run, and the result is as follows:

(gdb) b *0x1000000
Breakpoint 1 at 0x1000000
(gdb) c
Continuing.

Breakpoint 1, 0x0000000001000000 in ?? ()

This time, the kernel stopped at the breakpoint. Let's check the currently executing instruction:

(gdb) display /3i $pc
2: x/3i $pc
=> 0x1000000:	lea    -0x7(%rip),%rbp        # 0x1000000
   0x1000007:	sub    $0x1000000,%rbp
   0x100000e:	mov    %rbp,%rax
Conclusion:

As we can see, the instruction at this address is exactly the instruction at startup_64 we obtained from disassembly earlier. Therefore:

The value of _text(%rip) is 0x1000000, not 0xffffffff81000000.

4. Cause Analysis

Why is this the case?

When execution reaches startup_64, we are already in 64-bit mode, and paging has been enabled:

// file: arch/x86/boot/compressed/head_64.S
/* Enter paged protected Mode, activating Long Mode */
movl$(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
movl%eax, %cr0

And the early page table has already been constructed:

// file: arch/x86/boot/compressed/head_64.S
/*
  * Build early 4G boot pagetable
  */
/* Initialize Page tables to 0 */
lealpgtable(%ebx), %edi
xorl%eax, %eax
movl$((4096*6)/4), %ecx
repstosl

/* Build Level 4 */
lealpgtable + 0(%ebx), %edi
leal0x1007 (%edi), %eax
movl%eax, 0(%edi)

/* Build Level 3 */
lealpgtable + 0x1000(%ebx), %edi
leal0x1007(%edi), %eax
movl$4, %ecx
1:movl%eax, 0x00(%edi)
addl$0x00001000, %eax
addl$8, %edi
decl%ecx
jnz1b

/* Build Level 2 */
lealpgtable + 0x2000(%ebx), %edi
movl$0x00000183, %eax
movl$2048, %ecx
1:movl%eax, 0(%edi)
addl$0x00200000, %eax
addl$8, %edi
decl%ecx
jnz1b

/* Enable the boot page tables */
lealpgtable(%ebx), %eax
movl%eax, %cr3

However, since we are still in the early boot stage, this is only a temporary page table that only maps 4GB of physical memory, and physical memory and virtual memory are identity-mapped (one-to-one mapped). This means that at this point, 0x1000000 is both the physical address and the virtual address. Mapping the kernel to virtual addresses based on __START_KERNEL_map is done at a later stage, and has not been completed yet.

5. References

1. Intel 64 and IA-32 Architectures Software Developer Manuals Volume 2A Chapter 2 Instruction Format 2.2.1.6 RIP-Relative Addressing

2. Online ld documentation: 9.16.7 Memory References

3. Online ld documentation: 3.10.5 The Location Counter.

4. How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work?


This is a discussion topic separated from the original thread at https://juejin.cn/post/7368753329668128831