Home >Operation and Maintenance >Linux Operation and Maintenance >RISC-V Linux startup page table creation analysis

RISC-V Linux startup page table creation analysis

嵌入式Linux充电站
嵌入式Linux充电站forward
2023-08-01 15:39:362035browse

The previous article analyzed the assembly startup process of RISC-V Linux, which mentioned that relocate redirection requires turning on the MMU. Today we analyze the page table creation of RISC-V Linux.

Note: This article is based on the linux5.10.111 kernel

##sv39 mmu

RISC-V Linux supports

sv32, sv39, sv48 and other virtual address formats, which respectively represent 32-bit virtual address, 38-bit virtual address and 48-bit virtual address. address. RISC-V Linux also uses the sv39 format by default. The virtual address, physical address, and PTE format of sv39 are as follows:

Virtual address format:

RISC-V Linux startup page table creation analysis
Physical address format :

RISC-V Linux startup page table creation analysis
PTE format:

RISC-V Linux startup page table creation analysis
The virtual address is represented by 39 bits, of which the low 12 bits represent the page offset and the high bit It is divided into three parts: VP N[0], VP N[1] and VP N[2], which respectively represent the index of the virtual address VA in PTE, PMD and PGD.

The physical address is represented by 56 bits, the lower 12 bits represent the page offset, and the higher bits are the physical pages PPN[0], PPN[1] and PPN[2]

PTE saves the physical page PPN[0] , PPN[1] and PPN[2], corresponding to the PPN in the physical address; the lower 10 bits of the PTE represent the access rights of the physical address. When RWX is all 0, it means that the address stored in the PTE is The physical address of the next level page table, otherwise it means that the current page table is the last level page table .

Look at the page table format of sv39. sv39 uses a three-level page table, PGD, PMD and PTE, each The level page table is represented by 9 bits, that is, each level page table has 512 page table entries.

In the code, create an array with 512 elements to represent a page table. A PTE has 512 page table entries, each page table entry occupies 8 bytes, 512*8=4096 bytes, so a PTE represents 4K. A PMD also has 512 page table entries, each entry can represent a PTE, 512 *4 K=2M, so a PMD represents 2M. By analogy, one PGD represents 512 * 2M = 1G.

Important conclusion: PGD represents 1G, PMD represents 2M, and PTE represents 4K. The default page size of sv39 is 4K.

Schematic diagram of the process of converting the virtual address of the third-level page table to the physical address: RISC-V Linux startup page table creation analysis

sv39 The process of converting the virtual address of the third-level page table to the physical address:

MMU passes satp The register obtains the physical address of PGD, and combines it with the PGD index (i.e., V PN[2]) to find the PMD; after finding the PMD, it then combines it with the PMD index (i.e., V PN[1]) to find the PTE, and then combines it with the PTE index (i.e., V PN[0] ]) Get the value of VA in the PTE index to get the physical address.

Finally, take out PPN[2], PPN[1] and PPN[0] from the PTE, and then add them to the low 12-bit offset of the virtual address to get the final physical address.

Temporary page table analysis

Before starting the MMU, you need to create kernel, dtb, trampoline and other page tables. So that after the MMU is turned on and before the memory management module is run, the kernel can be initialized normally and the dtb can be parsed normally. This part of the page table is a temporary page table, and the final page table is created in setup_vm_final().

Temporary page table creation sequence:

First create early PGD and PMD for fixmap. At this time, PGD uses early_pg_dir. Then create a secondary page table for the first 2M of memory starting from the kernel. At this time, PGD uses trampoline_pg_dir. The page table created for these 2M is also called superpage. Then, create a secondary page table for the entire kernel. At this time, PGD uses early_pg_dir. Finally, reserve 4M size for dtb to create a secondary page table.

Page table creation function

##create_pgd_mapping()
void __init create_pgd_mapping(pgd_t *pgdp,
          uintptr_t va, phys_addr_t pa,
          phys_addr_t sz, pgprot_t prot)

pgdp: PGD page table

va: virtual address

pa: physical address

sz:映射大小,PGDIR_SIZE或PMD_SIZE或PTE_SIZE

prot:PAGE_KERNEL_EXEC/PAGE_KERNEL表示当前是最后一级页表,否则pa代表下一级页表的物理地址

create_pmd_mapping()

static void __init create_pmd_mapping(pmd_t *pmdp,
          uintptr_t va, phys_addr_t pa,
          phys_addr_t sz, pgprot_t prot)

pmdp:PMD页表

va:虚拟地址

pa:物理地址

sz:映射大小,PMD_SIZE或PAGE_SIZE

prot:权限,PAGE_KERNEL_EXEC/PAGE_KERNEL表示当前是最后一级页表,否则pa代表下一级页表的物理地址

create_pte_mapping()

static void __init create_pte_mapping(pte_t *ptep,
          uintptr_t va, phys_addr_t pa,
          phys_addr_t sz, pgprot_t prot)

ptep:PTE页表

va:虚拟地址

pa:物理地址

sz:映射大小,PAGE_SIZE

prot:权限,PAGE_KERNEL_EXEC/PAGE_KERNEL表示当前是最后一级页表,否则pa代表下一级页表的物理地址

使用举例

例如,将虚拟地址PAGE_OFFSET映射到物理地址pa,映射大小为4K,创建三级页表PGD、PMD和PTE:

create_pgd_mapping(early_pg_dir,PAGE_OFFSET,
                   (uintptr_t)early_pmd,PGDIR_SIZE,PAGE_TABLE);
create_pmd_mapping(early_pmd,PAGE_OFFSET,
                   (uintptr_t)early_pte,PGDIR_SIZE,PAGE_TABLE);
create_pte_mapping(early_pte,PAGE_OFFSET,
                   (uintptr_t)pa,PAGE_SIZE,PAGE_KERNEL_EXEC);

这样创建后,MMU就会根据PAGE_OFFSET在PGD中找到PMD,然后再PMD中找到PTE,最后取出物理地址。

页表创建源码分析

RISC-V Linux启动,经历了两次页表创建过程,第一次使用C函数setup_vm()创建临时页表,第二次使用C函数setup_vm_final()创建最终页表。

具体细节参考代码中的注释,下面的代码省略了一些不重要的部分。

setup_vm()

asmlinkage void __init setup_vm(uintptr_t dtb_pa)
{
 uintptr_t va, pa, end_va;
 uintptr_t load_pa = (uintptr_t)(&_start);
 uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
 uintptr_t map_size;
 //load_pa就是kernel加载的其实物理地址
    //load_sz就是kernel的实际大小

    //page_offset就是kernel的起始物理地址对应的虚拟地址,va_pa_offset是他们的偏移量
 va_pa_offset = PAGE_OFFSET - load_pa;
    
    //计算得到kernel起始物理地址的物理页,PFN_DOWN是将物理地址右移12位,因为sv39的物理地址的低12位是pa_offset,所以右移12位,得到pfn
 pfn_base = PFN_DOWN(load_pa);

 map_size = PMD_SIZE;//PMD_SIZE为2M,在当前,map_size只能为PGDIR_SIZE或PMD_SIZE。这时kernel默认不允许建立PTE。

 //检查PAGE_OFFSET是否1G对齐,以及kernel入口地址是否2M对齐
 BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
 BUG_ON((load_pa % map_size) != 0);

    //allc_pte_early里面是BUG(),对于临时页表,kernel不允许我们建立PTE
 pt_ops.alloc_pte = alloc_pte_early;
 pt_ops.get_pte_virt = get_pte_virt_early;
#ifndef __PAGETABLE_PMD_FOLDED
 pt_ops.alloc_pmd = alloc_pmd_early;
 pt_ops.get_pmd_virt = get_pmd_virt_early;
#endif
 /* 设置 early PGD for fixmap */
 create_pgd_mapping(early_pg_dir, FIXADDR_START,
      (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);


 /* 设置 fixmap PMD */
 create_pmd_mapping(fixmap_pmd, FIXADDR_START,
      (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
 /* 设置 trampoline PGD and PMD */
 create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
      (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
 create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
      load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);

 /*
  * 设置覆盖整个内核的早期PGD,这将使我们能够达到paging_init()。
  * 稍后在下面的 setup_vm_final() 中映射所有内存。
  */
 end_va = PAGE_OFFSET + load_sz;
 for (va = PAGE_OFFSET; va < end_va; va += map_size)
  create_pgd_mapping(early_pg_dir, va,
       load_pa + (va - PAGE_OFFSET),
       map_size, PAGE_KERNEL_EXEC);

 /* 为dtb创建早期的PMD */
 create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
      (uintptr_t)early_dtb_pmd, PGDIR_SIZE, PAGE_TABLE);
 /* 为 FDT 早期扫描创建两个连续的 PMD 映射 */
 pa = dtb_pa & ~(PMD_SIZE - 1);
 create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
      pa, PMD_SIZE, PAGE_KERNEL);
 create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA + PMD_SIZE,
      pa + PMD_SIZE, PMD_SIZE, PAGE_KERNEL);
 dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PMD_SIZE - 1));
 ......

}

setup_vm()在最开始就进行了kernel入口地址的对齐检查,要求入口地址2M对齐。假设内存起始地址为0x80000000,那么kernel只能放在0x80000000、0x80200000等2M对齐处。为什么会有这种对齐要求呢?

我猜测单纯是为给opensbi预留了2M空间,因为kernel之前还有opensbi,而opensbi运行完之后,默认跳转地址就是偏移2M,kernel只是为了跟opensbi对应,所以设置了2M对齐。

那opensbi需要占用2M这么大?实际上只需要几百KB,因此opensbi和kernel中间有一段内存是空闲的,没有人使用。这个问题我们下篇再讲。

setup_vm_final()

在该函数中开始为整个物理内存做内存映射,通过swapper页表来管理,并且清除掉汇编阶段的页表。

static void __init setup_vm_final(void)
{
 uintptr_t va, map_size;
 phys_addr_t pa, start, end;
 u64 i;

 /**
  * 此时MMU已经开启,但是页表还没完全建立。
  */
 pt_ops.alloc_pte = alloc_pte_fixmap;
 pt_ops.get_pte_virt = get_pte_virt_fixmap;
#ifndef __PAGETABLE_PMD_FOLDED
 pt_ops.alloc_pmd = alloc_pmd_fixmap;
 pt_ops.get_pmd_virt = get_pmd_virt_fixmap;
#endif
 /* Setup swapper PGD for fixmap */
 create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
      __pa_symbol(fixmap_pgd_next),
      PGDIR_SIZE, PAGE_TABLE);

 /* 为整个物理内存创建页表 */
 for_each_mem_range(i, &start, &end) {
  if (start >= end)
   break;
  if (start <= __pa(PAGE_OFFSET) &&
      __pa(PAGE_OFFSET) < end)
   start = __pa(PAGE_OFFSET);

        //best_map_size是选择合适的映射大小,kernel入口地址2M对齐或者kernel大小能被2M整除时,map_size就是2M,否则就是4K。
  map_size = best_map_size(start, end - start);
  for (pa = start; pa < end; pa += map_size) {
   va = (uintptr_t)__va(pa);
   create_pgd_mapping(swapper_pg_dir, va, pa,
        map_size, PAGE_KERNEL_EXEC);
  }
 }

 /* 清除fixmap的PMD和PTE */
 clear_fixmap(FIX_PTE);
 clear_fixmap(FIX_PMD);

 /* 切换到swapper页表,这个是最终的页表,汇编阶段relocate开启MMU的操作,跟下面这句是一样的。 */
 csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
 local_flush_tlb_all();//刷新TLB

 ......
}

说明:

在setup_vm_final()函数中,通过swapper_pg_dir页表来管理整个物理内存的访问。并且清除汇编阶段的页表fixmap_pte和early_pg_dir。(本质上就是把该页表项的内容清0,即赋值为0)

最终把swapper_pg_dir页表的物理地址赋值给SATP寄存器。这样CPU就可以通过该页表访问整个物理内存。

切换页表通过如下实现:

csr_write(CSR_SATP,PFN_DOWN(_pa(swapper_pg_dir))|SATP_MODE);

在swapper_pg_dir管理的kernel space中,其虚拟地址与物理地址空间的偏移是固定的,为va_pa_offset(定义在arch/riscv/mm/init.c中的一个全局变量)

注意:swapper_pg_dir管理的是kernel space的页表,即它把物理内存映射到的虚拟地址空间是只能kernel访问的。user space不能访问,用户空间如果访问,必须自行建立页表,把物理地址映射到user space的虚拟地址空间。kernel线程共享这个swapper_pg_dir页表。

Summary

The page table creation when RISC-V Linux starts is relatively easy to understand. They are all created in C language, and the code is relatively small. The main two page table creation functions are setup_vm() and setup_vm_final(). After understanding some of the address formats of sv39, it will be easier to analyze the source code. However, the codes of different kernel versions are different and require detailed analysis of specific situations.

This article mentioned that setup_vm() will check whether the kernel entry address is 2M aligned. If it is not aligned, the kernel cannot start. But in fact, we can lift this 2M alignment restriction and make use of this part of the space. The next article will teach you Optimize this part of memory.

The above is the detailed content of RISC-V Linux startup page table creation analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:嵌入式Linux充电站. If there is any infringement, please contact admin@php.cn delete