Empathy List Archives

gem5-users@gem5.org

The gem5 Users mailing list

Re: BUG: kernel NULL pointer dereference occurs when restoring a checkpoint generated by KVM core in FS mode

YongjieHuang

Fri, Aug 16, 2024 8:38 AM

Btw, I have also tried that if I don't write checkpoint, just use KVM CPU to boot and then run lulesh completely, it can be successful:

/mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/base.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86KvmCPU --mem-size=8GB --script=./test.rcS

test.rcS:
/bin/lulesh2.0 -p -s 100
m5 exit

The lulesh2.0 can be run completely according to the output. And gem5 exits naturally.

Original

From:"YongjieHuang via gem5-users"< gem5-users@gem5.org >;

Date:2024/8/16 15:57

To:"gem5-users"< gem5-users@gem5.org >;

CC:"YongjieHuang"< 876167080@qq.com >;

Subject:[gem5-users] BUG: kernel NULL pointer dereference occurs when restoring a checkpoint generated by KVM core in FS mode

Dear all,

I want to use KVM core to write checkpoints and use O3 core to restore the checkpoints. But I meet a kernel BUG.

My Gem5 is V23.0.0.1. The image and kernel were downloaded from https://www.gem5.org/project/2020/03/09/boot-tests.html.

The kernel is 'vmlinux-5.4.49' and the image is 'disk image (GZIPPED)'.

I used X86KvmCPU to write a checkpoint during the time when lulesh2.0 is running with openMP. Below is the script for booting the system and write a checkpoint when hitting 9 Billion instructions.
/mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/disk.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86KvmCPU --mem-size=8GB --checkpoint-dir=ckptest --at-instruction --take-checkpoints 9000000000 --script=./test.rcS

test.rcS is the script for running lulesh2.0 which is already located in /bin of the disk image manually by sudo mount :

/bin/lulesh2.0 -p -s 100
m5 exit

However, when I use O3 cpu to restore the checkpoint written above with the command line below, I can see a kernel BUG in system.pc.com_1.device file instead of seeing the lulesh process continuing.

command line: /mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/disk.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86O3CPU --caches --cpu-clock=2.4GHz --l1i_size=32kB --l1i_assoc=8 --l1d_size=64kB --l1d_assoc=8 --l2cache --l2_size=1MB --l2_assoc=16 --l3cache --l3_size=16MB --l3_assoc=16 --mem-size=8GB --checkpoint-dir=ckptest -r 1

BUG: kernel NULL pointer dereference, address: 0000000000000040#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.4.49 #8
Hardware name: , BIOS 06/08/2008
Workqueue: 0x0 (events)
RIP: 0010:set_next_entity+0x9/0x65
Code: 48 89 df 5b 5d 41 5c e9 fb a0 ff ff 59 48 89 df 31 d2 5b 5d 41 5c e9 35 a4 ff ff 58 5b 5d 41 5c c3 41 55 41 54 55 53 48 89 fd <83> 7e 40 00 48 89 f3 74 35 4c 8d 66 18 4c 3b 67 40 4c 8d 6f 38 75
RSP: 0018:ffffc90000037e30 EFLAGS: 0000006e
RAX: 0000000000000000 RBX: ffff888238a26440 RCX: ffffffff81a19d80
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888238a26480
RBP: ffff888238a26480 R08: 00000003e5663c00 R09: 00000000000000ff
R10: 00000000fffbad80 R11: 0000000000000800 R12: 0000000000000000
R13: ffff888238a26480 R14: ffff8882379749b0 R15: 0000000000000000
FS: 00007faa47824700(0000) GS:ffff888238a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000b0 CR3: 00000002358e8000 CR4: 00000000000006f0
Call Trace:
pick_next_task_fair+0xe5/0x18c
__schedule+0x1e3/0x40a
? do_raw_spin_lock+0x2b/0x52
? create_worker+0x16a/0x16a
schedule+0x75/0x9f
worker_thread+0x1e7/0x22f
kthread+0xf0/0xf5
? kthread_destroy_worker+0x39/0x39
ret_from_fork+0x22/0x40
Modules linked in:
CR2: 0000000000000040
---[ end trace 25f0872c331972c4 ]---
BUG: kernel NULL pointer dereference, address: 0000000000000040
RIP: 0010:set_next_entity+0x9/0x65

In addition, I am sure in the checkpoinit generating process, the lulesh2.0 is running successfully in the guest system accoding to the ouput of -p parameter of lulesh.

Can anyone tell me what should I do ？
I really appreciate your helps!

Best,
Yongjie

Btw, I have also tried that if I don't write checkpoint, just use KVM CPU to boot and then run lulesh completely, it can be successful: /mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/base.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86KvmCPU --mem-size=8GB --script=./test.rcS test.rcS: /bin/lulesh2.0  -p -s 100 m5 exit The lulesh2.0 can be run completely according to the output. And gem5 exits naturally. Original From:"YongjieHuang via gem5-users"< gem5-users@gem5.org >; Date:2024/8/16 15:57 To:"gem5-users"< gem5-users@gem5.org >; CC:"YongjieHuang"< 876167080@qq.com >; Subject:[gem5-users] BUG: kernel NULL pointer dereference occurs when restoring a checkpoint generated by KVM core in FS mode Dear all, I want to use KVM core to write checkpoints and use O3 core to restore the checkpoints. But I meet a kernel BUG. My Gem5 is V23.0.0.1. The image and kernel were downloaded from https://www.gem5.org/project/2020/03/09/boot-tests.html. The kernel is 'vmlinux-5.4.49' and the image is  'disk image (GZIPPED)'. I used X86KvmCPU to write a checkpoint during the time when lulesh2.0 is running with openMP.  Below is the script for booting the system and write a checkpoint when hitting 9 Billion instructions. /mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/disk.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86KvmCPU --mem-size=8GB --checkpoint-dir=ckptest --at-instruction --take-checkpoints 9000000000 --script=./test.rcS test.rcS is the script for running lulesh2.0 which is already located in /bin of the disk image manually by sudo mount : /bin/lulesh2.0  -p -s 100 m5 exit However, when I use O3 cpu to restore the checkpoint written above with the command line below, I can see a kernel BUG in system.pc.com_1.device file instead of seeing the lulesh process continuing. command line: /mydata/gem5/build/X86/gem5.opt --outdir=m5out_p16 -r -e fs.py --disk-image=./x86-64-system/disks/disk.img --kernel=./x86-64-system/binaries/vmlinux-5.4.49 --num-cpus=16 --cpu-type=X86O3CPU --caches --cpu-clock=2.4GHz --l1i_size=32kB --l1i_assoc=8 --l1d_size=64kB --l1d_assoc=8 --l2cache --l2_size=1MB --l2_assoc=16 --l3cache --l3_size=16MB --l3_assoc=16 --mem-size=8GB --checkpoint-dir=ckptest -r 1 BUG: kernel NULL pointer dereference, address: 0000000000000040#PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.4.49 #8 Hardware name:  , BIOS  06/08/2008 Workqueue:  0x0 (events) RIP: 0010:set_next_entity+0x9/0x65 Code: 48 89 df 5b 5d 41 5c e9 fb a0 ff ff 59 48 89 df 31 d2 5b 5d 41 5c e9 35 a4 ff ff 58 5b 5d 41 5c c3 41 55 41 54 55 53 48 89 fd <83> 7e 40 00 48 89 f3 74 35 4c 8d 66 18 4c 3b 67 40 4c 8d 6f 38 75 RSP: 0018:ffffc90000037e30 EFLAGS: 0000006e RAX: 0000000000000000 RBX: ffff888238a26440 RCX: ffffffff81a19d80 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888238a26480 RBP: ffff888238a26480 R08: 00000003e5663c00 R09: 00000000000000ff R10: 00000000fffbad80 R11: 0000000000000800 R12: 0000000000000000 R13: ffff888238a26480 R14: ffff8882379749b0 R15: 0000000000000000 FS:  00007faa47824700(0000) GS:ffff888238a00000(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b0 CR3: 00000002358e8000 CR4: 00000000000006f0 Call Trace:  pick_next_task_fair+0xe5/0x18c  __schedule+0x1e3/0x40a  ? do_raw_spin_lock+0x2b/0x52  ? create_worker+0x16a/0x16a  schedule+0x75/0x9f  worker_thread+0x1e7/0x22f  kthread+0xf0/0xf5  ? kthread_destroy_worker+0x39/0x39  ret_from_fork+0x22/0x40 Modules linked in: CR2: 0000000000000040 ---[ end trace 25f0872c331972c4 ]--- BUG: kernel NULL pointer dereference, address: 0000000000000040 RIP: 0010:set_next_entity+0x9/0x65 In addition, I am sure in the checkpoinit generating process, the lulesh2.0 is running successfully in the guest system accoding to the ouput of -p parameter of lulesh. Can anyone tell me what should I do ？ I really appreciate your helps! Best, Yongjie