Empathy List Archives

gem5-users@gem5.org

The gem5 Users mailing list

Question about running GPU emulation in gem5

Beser, Nicholas D.

Sat, Oct 5, 2024 3:48 PM

I am teaching an advanced computer architecture class and had the class run the GPU example that was run in the 2024 bootcamp:

docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0 gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 --gfx-version=gfx902 -
c gem5-resources/src/gpu/square/bin/square

The example ran, however the stats.txt file had two Simulation statistics runs. The second did not appear to have any activity on the CUs. Can someone tell me why the simulation had two runs? We are using the first run as the GPU simulation statistics.

We also ran the simulation while varying the number of CU's. We did not see much change in performance. I thought it was due to the benchmark that was run. One of my students modified the benchmark to use more threads, but we did not see much change. My thoughts were that this was due to the benchmark again, that the resources required were not stressed by the 4 CU's and changing the number to larger one's also did not stress the CU's.

Nick

I am teaching an advanced computer architecture class and had the class run the GPU example that was run in the 2024 bootcamp: docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0 gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 --gfx-version=gfx902 - c gem5-resources/src/gpu/square/bin/square The example ran, however the stats.txt file had two Simulation statistics runs. The second did not appear to have any activity on the CUs. Can someone tell me why the simulation had two runs? We are using the first run as the GPU simulation statistics. We also ran the simulation while varying the number of CU's. We did not see much change in performance. I thought it was due to the benchmark that was run. One of my students modified the benchmark to use more threads, but we did not see much change. My thoughts were that this was due to the benchmark again, that the resources required were not stressed by the 4 CU's and changing the number to larger one's also did not stress the CU's. Nick

Matt Sinclair

Sun, Oct 6, 2024 12:57 AM

Hi Nicholas,

Really glad to hear these GPU tests are useful for your class! I am not in
front of a terminal, so I can't confirm every single thing, but here is
what I think is happening:

You mention there are 2 sets of stats. This is potentially because a
recent commit (https://github.com/gem5/gem5/pull/1217) added support for
each GPU kernel to dump and reset the stats.
Why are there 2 sets of stats if only 1 kernel seems to be launched?
Well GPUs have special kernels that are not visible to users that do things
like DMA operations (e.g., hipMemcpy's), copying kernel code, etc. This is
what is happening in your case. Specifically, there is a Blit/SDMA kernel
happening (probably doing a DMA operation). The above commit made it so
these special kernels keep stats separately because otherwise the hits and
misses would not line up for the "real" kernels (e.g., DMA operations would
affect the L2, but would not have any activity on the CUs).
However, you mentioned that the second set of stats was the "empty" one
(with no activity on the CUs). This is slightly surprising, as I would
have expected the first one to be empty (e.g., because it was copying the
kernel to the GPU). But perhaps in your case the second set of stats is
for a hipMemcpy after the "real" kernel... or it's just the CPU portion of
the program after the kernel completes (e.g.,
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79).
Based on the context you provided, the latter sounds more likely. In any
event, the stats for the phase without CU activity can effectively be
ignored if you only want to look at the GPU phase. You could also consider
putting in m5_work_begin and m5_work_end markers in the code to help ensure
stats from outside the ROI are not included (e.g.,
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183).
Also, to verify how many GPU phases are actually happening, you could run
with the GPUKernelInfo debug flag -- this will basically only print for
each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel). If there is a
blit/SDMA kernel there should be (at least) 2 kernels launched.
Finally, in terms of the input size and baseline GPU configuration. You
are right that the baseline GPU configuration is not a particularly large
GPU. My group has an artifact we're releasing with an upcoming MICRO paper
that models something more substantive that I can point you to, but in the
meantime let me explain what is happening. Increasing the number of
threads by itself is not going to increase the amount of work being done in
square. Instead, the GPU kernel's work depends on the size of the input
array (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45),
which gets set here (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54).
Increasing the number of threads without increasing the work the kernel is
doing will just result in threads that almost immediately exit the GPU
kernel because they are attempting to access indices out of bounds, which
the above loop I linked ignores. Also, square in its default state is
really sized for running very quick validation tests in gem5's daily
regression. So, instead, you'd need to change line 54 to increase N (and
then increase the number of work groups) if you want to make square run
something larger. Probably we could update the code that determines work
groups (
https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72)
to be more directly configured with input size too.
One last thing you may consider: when I have run the GPU portion of the
gem5 tutorials and bootcamps in the past, I've used other architectural
features such as register allocation to demonstrate performance impact for
applications with very short runtimes. For example, you may consider the
example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp
with square, which I subsequently updated to run with MFMA (AMD's
equivalent to TensorCore) operations in the 2024 bootcamp (
https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf,
slide 58).

Hope this helps,
Matt

On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users <
gem5-users@gem5.org> wrote:

I am teaching an advanced computer architecture class and had the class
run the GPU example that was run in the 2024 bootcamp:

docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0
gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3
--gfx-version=gfx902 -

c gem5-resources/src/gpu/square/bin/square

The example ran, however the stats.txt file had two Simulation statistics
runs. The second did not appear to have any activity on the CUs. Can
someone tell me why the simulation had two runs? We are using the first run
as the GPU simulation statistics.

We also ran the simulation while varying the number of CU’s. We did not
see much change in performance. I thought it was due to the benchmark that
was run. One of my students modified the benchmark to use more threads, but
we did not see much change. My thoughts were that this was due to the
benchmark again, that the resources required were not stressed by the 4
CU’s and changing the number to larger one’s also did not stress the CU’s.

Nick

gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Hi Nicholas, Really glad to hear these GPU tests are useful for your class! I am not in front of a terminal, so I can't confirm every single thing, but here is what I think is happening: - You mention there are 2 sets of stats. This is potentially because a recent commit (https://github.com/gem5/gem5/pull/1217) added support for each GPU kernel to dump and reset the stats. - Why are there 2 sets of stats if only 1 kernel seems to be launched? Well GPUs have special kernels that are not visible to users that do things like DMA operations (e.g., hipMemcpy's), copying kernel code, etc. This is what is happening in your case. Specifically, there is a Blit/SDMA kernel happening (probably doing a DMA operation). The above commit made it so these special kernels keep stats separately because otherwise the hits and misses would not line up for the "real" kernels (e.g., DMA operations would affect the L2, but would not have any activity on the CUs). - However, you mentioned that the second set of stats was the "empty" one (with no activity on the CUs). This is slightly surprising, as I would have expected the first one to be empty (e.g., because it was copying the kernel to the GPU). But perhaps in your case the second set of stats is for a hipMemcpy after the "real" kernel... or it's just the CPU portion of the program after the kernel completes (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79). Based on the context you provided, the latter sounds more likely. In any event, the stats for the phase without CU activity can effectively be ignored if you only want to look at the GPU phase. You could also consider putting in m5_work_begin and m5_work_end markers in the code to help ensure stats from outside the ROI are not included (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183). Also, to verify how many GPU phases are actually happening, you could run with the GPUKernelInfo debug flag -- this will basically only print for each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel). If there is a blit/SDMA kernel there should be (at least) 2 kernels launched. - Finally, in terms of the input size and baseline GPU configuration. You are right that the baseline GPU configuration is not a particularly large GPU. My group has an artifact we're releasing with an upcoming MICRO paper that models something more substantive that I can point you to, but in the meantime let me explain what is happening. Increasing the number of threads by itself is not going to increase the amount of work being done in square. Instead, the GPU kernel's work depends on the size of the input array ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45), which gets set here ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54). Increasing the number of threads without increasing the work the kernel is doing will just result in threads that almost immediately exit the GPU kernel because they are attempting to access indices out of bounds, which the above loop I linked ignores. Also, square in its default state is really sized for running very quick validation tests in gem5's daily regression. So, instead, you'd need to change line 54 to increase N (and then increase the number of work groups) if you want to make square run something larger. Probably we could update the code that determines work groups ( https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72) to be more directly configured with input size too. - One last thing you may consider: when I have run the GPU portion of the gem5 tutorials and bootcamps in the past, I've used other architectural features such as register allocation to demonstrate performance impact for applications with very short runtimes. For example, you may consider the example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp with square, which I subsequently updated to run with MFMA (AMD's equivalent to TensorCore) operations in the 2024 bootcamp ( https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf, slide 58). Hope this helps, Matt On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users < gem5-users@gem5.org> wrote: > I am teaching an advanced computer architecture class and had the class > run the GPU example that was run in the 2024 bootcamp: > > > > docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0 > gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 > --gfx-version=gfx902 - > > c gem5-resources/src/gpu/square/bin/square > > > > The example ran, however the stats.txt file had two Simulation statistics > runs. The second did not appear to have any activity on the CUs. Can > someone tell me why the simulation had two runs? We are using the first run > as the GPU simulation statistics. > > > > We also ran the simulation while varying the number of CU’s. We did not > see much change in performance. I thought it was due to the benchmark that > was run. One of my students modified the benchmark to use more threads, but > we did not see much change. My thoughts were that this was due to the > benchmark again, that the resources required were not stressed by the 4 > CU’s and changing the number to larger one’s also did not stress the CU’s. > > > > Nick > > > > > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-leave@gem5.org >

Poremba, Matthew

Tue, Oct 8, 2024 3:03 PM

[AMD Official Use Only - AMD Internal Distribution Only]

Re: The two GPU stats. I believe square running on APU will only have one kernel since it does not need to DMA. The stats sections are mostly likely (1) system boot until end of 1st and only kernel due to the path Matt pointed out and (2) the stat dump gem5 does at exit. The second section is only capturing the application between last kernel ending and gem5 exiting. Usually this is just application cleanup / tear down and maybe a verification step, all of which run on CPU, thus GPU stats would be zeros

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Saturday, October 5, 2024 5:58 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Beser, Nicholas D. Nick.Beser@jhuapl.edu; Poremba, Matthew Matthew.Poremba@amd.com
Subject: Re: [gem5-users] Question about running GPU emulation in gem5

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Nicholas,

Really glad to hear these GPU tests are useful for your class! I am not in front of a terminal, so I can't confirm every single thing, but here is what I think is happening:

You mention there are 2 sets of stats. This is potentially because a recent commit (https://github.com/gem5/gem5/pull/1217) added support for each GPU kernel to dump and reset the stats.
Why are there 2 sets of stats if only 1 kernel seems to be launched? Well GPUs have special kernels that are not visible to users that do things like DMA operations (e.g., hipMemcpy's), copying kernel code, etc. This is what is happening in your case. Specifically, there is a Blit/SDMA kernel happening (probably doing a DMA operation). The above commit made it so these special kernels keep stats separately because otherwise the hits and misses would not line up for the "real" kernels (e.g., DMA operations would affect the L2, but would not have any activity on the CUs).
However, you mentioned that the second set of stats was the "empty" one (with no activity on the CUs). This is slightly surprising, as I would have expected the first one to be empty (e.g., because it was copying the kernel to the GPU). But perhaps in your case the second set of stats is for a hipMemcpy after the "real" kernel... or it's just the CPU portion of the program after the kernel completes (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79). Based on the context you provided, the latter sounds more likely. In any event, the stats for the phase without CU activity can effectively be ignored if you only want to look at the GPU phase. You could also consider putting in m5_work_begin and m5_work_end markers in the code to help ensure stats from outside the ROI are not included (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183). Also, to verify how many GPU phases are actually happening, you could run with the GPUKernelInfo debug flag -- this will basically only print for each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel). If there is a blit/SDMA kernel there should be (at least) 2 kernels launched.
Finally, in terms of the input size and baseline GPU configuration. You are right that the baseline GPU configuration is not a particularly large GPU. My group has an artifact we're releasing with an upcoming MICRO paper that models something more substantive that I can point you to, but in the meantime let me explain what is happening. Increasing the number of threads by itself is not going to increase the amount of work being done in square. Instead, the GPU kernel's work depends on the size of the input array (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45), which gets set here (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54). Increasing the number of threads without increasing the work the kernel is doing will just result in threads that almost immediately exit the GPU kernel because they are attempting to access indices out of bounds, which the above loop I linked ignores. Also, square in its default state is really sized for running very quick validation tests in gem5's daily regression. So, instead, you'd need to change line 54 to increase N (and then increase the number of work groups) if you want to make square run something larger. Probably we could update the code that determines work groups (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72) to be more directly configured with input size too.
One last thing you may consider: when I have run the GPU portion of the gem5 tutorials and bootcamps in the past, I've used other architectural features such as register allocation to demonstrate performance impact for applications with very short runtimes. For example, you may consider the example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp with square, which I subsequently updated to run with MFMA (AMD's equivalent to TensorCore) operations in the 2024 bootcamp (https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf, slide 58).

Hope this helps,
Matt

On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users <gem5-users@gem5.org mailto:gem5-users@gem5.org> wrote:
I am teaching an advanced computer architecture class and had the class run the GPU example that was run in the 2024 bootcamp:

docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0http://ghcr.io/gem5/gcn-gpu:v24-0 gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 --gfx-version=gfx902 -
c gem5-resources/src/gpu/square/bin/square

We also ran the simulation while varying the number of CU’s. We did not see much change in performance. I thought it was due to the benchmark that was run. One of my students modified the benchmark to use more threads, but we did not see much change. My thoughts were that this was due to the benchmark again, that the resources required were not stressed by the 4 CU’s and changing the number to larger one’s also did not stress the CU’s.

Nick

gem5-users mailing list -- gem5-users@gem5.org mailto:gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org mailto:gem5-users-leave@gem5.org

[AMD Official Use Only - AMD Internal Distribution Only] Re: The two GPU stats. I believe square running on APU will only have one kernel since it does not need to DMA. The stats sections are mostly likely (1) system boot until end of 1st and only kernel due to the path Matt pointed out and (2) the stat dump gem5 does at exit. The second section is only capturing the application between last kernel ending and gem5 exiting. Usually this is just application cleanup / tear down and maybe a verification step, all of which run on CPU, thus GPU stats would be zeros -Matt From: Matt Sinclair <mattdsinclair.wisc@gmail.com> Sent: Saturday, October 5, 2024 5:58 PM To: The gem5 Users mailing list <gem5-users@gem5.org> Cc: Beser, Nicholas D. <Nick.Beser@jhuapl.edu>; Poremba, Matthew <Matthew.Poremba@amd.com> Subject: Re: [gem5-users] Question about running GPU emulation in gem5 Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. Hi Nicholas, Really glad to hear these GPU tests are useful for your class! I am not in front of a terminal, so I can't confirm every single thing, but here is what I think is happening: - You mention there are 2 sets of stats. This is potentially because a recent commit (https://github.com/gem5/gem5/pull/1217) added support for each GPU kernel to dump and reset the stats. - Why are there 2 sets of stats if only 1 kernel seems to be launched? Well GPUs have special kernels that are not visible to users that do things like DMA operations (e.g., hipMemcpy's), copying kernel code, etc. This is what is happening in your case. Specifically, there is a Blit/SDMA kernel happening (probably doing a DMA operation). The above commit made it so these special kernels keep stats separately because otherwise the hits and misses would not line up for the "real" kernels (e.g., DMA operations would affect the L2, but would not have any activity on the CUs). - However, you mentioned that the second set of stats was the "empty" one (with no activity on the CUs). This is slightly surprising, as I would have expected the first one to be empty (e.g., because it was copying the kernel to the GPU). But perhaps in your case the second set of stats is for a hipMemcpy after the "real" kernel... or it's just the CPU portion of the program after the kernel completes (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L79). Based on the context you provided, the latter sounds more likely. In any event, the stats for the phase without CU activity can effectively be ignored if you only want to look at the GPU phase. You could also consider putting in m5_work_begin and m5_work_end markers in the code to help ensure stats from outside the ROI are not included (e.g., https://github.com/gem5/gem5-resources/blob/stable/src/gpu/pannotia/color/coloring_maxmin.cpp#L183). Also, to verify how many GPU phases are actually happening, you could run with the GPUKernelInfo debug flag -- this will basically only print for each new GPU kernel ("real" GPU kernel or Blit/SDMA kernel). If there is a blit/SDMA kernel there should be (at least) 2 kernels launched. - Finally, in terms of the input size and baseline GPU configuration. You are right that the baseline GPU configuration is not a particularly large GPU. My group has an artifact we're releasing with an upcoming MICRO paper that models something more substantive that I can point you to, but in the meantime let me explain what is happening. Increasing the number of threads by itself is not going to increase the amount of work being done in square. Instead, the GPU kernel's work depends on the size of the input array (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L45), which gets set here (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L54). Increasing the number of threads without increasing the work the kernel is doing will just result in threads that almost immediately exit the GPU kernel because they are attempting to access indices out of bounds, which the above loop I linked ignores. Also, square in its default state is really sized for running very quick validation tests in gem5's daily regression. So, instead, you'd need to change line 54 to increase N (and then increase the number of work groups) if you want to make square run something larger. Probably we could update the code that determines work groups (https://github.com/gem5/gem5-resources/blob/stable/src/gpu/square/square.cpp#L72) to be more directly configured with input size too. - One last thing you may consider: when I have run the GPU portion of the gem5 tutorials and bootcamps in the past, I've used other architectural features such as register allocation to demonstrate performance impact for applications with very short runtimes. For example, you may consider the example here (https://youtu.be/1a9Yj-QaQoo?t=5388) from the 2022 bootcamp with square, which I subsequently updated to run with MFMA (AMD's equivalent to TensorCore) operations in the 2024 bootcamp (https://github.com/gem5bootcamp/2024/blob/main/slides/04-GPU-model/gpu-slides.pdf, slide 58). Hope this helps, Matt On Sat, Oct 5, 2024 at 10:50 AM Beser, Nicholas D. via gem5-users <gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote: I am teaching an advanced computer architecture class and had the class run the GPU example that was run in the 2024 bootcamp: docker run --volume $(pwd):$(pwd) -w $(pwd) ghcr.io/gem5/gcn-gpu:v24-0<http://ghcr.io/gem5/gcn-gpu:v24-0> gem5/build/VEGA_X86/gem5.opt gem5/configs/example/apu_se.py -n 3 --gfx-version=gfx902 - c gem5-resources/src/gpu/square/bin/square The example ran, however the stats.txt file had two Simulation statistics runs. The second did not appear to have any activity on the CUs. Can someone tell me why the simulation had two runs? We are using the first run as the GPU simulation statistics. We also ran the simulation while varying the number of CU’s. We did not see much change in performance. I thought it was due to the benchmark that was run. One of my students modified the benchmark to use more threads, but we did not see much change. My thoughts were that this was due to the benchmark again, that the resources required were not stressed by the 4 CU’s and changing the number to larger one’s also did not stress the CU’s. Nick _______________________________________________ gem5-users mailing list -- gem5-users@gem5.org<mailto:gem5-users@gem5.org> To unsubscribe send an email to gem5-users-leave@gem5.org<mailto:gem5-users-leave@gem5.org>