gem5-users@gem5.org

The gem5 Users mailing list

View all threads

Sanity check on HIP execution on GCN3

AM
Anoop Mysore
Tue, Jul 11, 2023 6:40 PM

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?
2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)
2b. Each CU has 4 SIMD units
3a. Each waveform has 64 threads -- to be executed in lockstep
3b. Each SIMD unit has 16 lanes
4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
lanes) in lockstep. So all SIMD units will necessarily always execute the
same instruction at any given cycle (unless there's a scalar instruction,
or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

[image: image.png]
Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf

In the default configuration of apu_se.py: 1. There are 4 CUs Are the following statements correct? 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) 2b. Each CU has 4 SIMD units 3a. Each waveform has 64 threads -- to be executed in lockstep 3b. Each SIMD unit has 16 lanes 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 lanes) in lockstep. So all SIMD units will necessarily always execute the same instruction at any given cycle (unless there's a scalar instruction, or in case of a divergence where some threads are masked off). If so, what's the use of having 4 distinct 16-lane SIMD units over 1 64-lane SIMD unit? [image: image.png] Figure from http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf
MS
Matt Sinclair
Tue, Jul 11, 2023 8:26 PM

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes" --
this is how the default/baseline gem5 GPU configuration works.  Except they
are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down
to design decisions (also would be interested to see if Brad or Matt P have
a different opinion here).  One could have a 64-wide unit instead.  But in
terms of why that is not how it's done, vectors are not guaranteed to be
fully occupied.  So, the longer/wider we make a vector, the harder it will
be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only need
to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way you
are assuming.  But the reasons why are related more to how real hardware
works than what is simplest logically.

Hope this helps,
Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?
2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)
2b. Each CU has 4 SIMD units
3a. Each waveform has 64 threads -- to be executed in lockstep
3b. Each SIMD unit has 16 lanes
4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
lanes) in lockstep. So all SIMD units will necessarily always execute the
same instruction at any given cycle (unless there's a scalar instruction,
or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

[image: image.png]
Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Hi Anoop, Broadly I would say the answer to all of your questions 1-4 are "yes" -- this is how the default/baseline gem5 GPU configuration works. Except they are "wavefronts", not "waveforms". In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down to design decisions (also would be interested to see if Brad or Matt P have a different opinion here). One could have a 64-wide unit instead. But in terms of why that is not how it's done, vectors are not guaranteed to be fully occupied. So, the longer/wider we make a vector, the harder it will be to fully utilize the entire vector all the time. And in vector processing, to get efficiency we want to fill all the lanes all the time if possible. I believe at a hardware level having a wider vector also has some area implications too, although I'm not an expert at that part. Now, by having 4 16-wide units we reduce pressure on this -- we only need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given wavefront does not have enough work to fill all 4 16-wide units, we can also run a different wavefront on them at the same time. So ultimately I think logically you can think of them behaving the way you are assuming. But the reasons why are related more to how real hardware works than what is simplest logically. Hope this helps, Matt S. On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < gem5-users@gem5.org> wrote: > In the default configuration of apu_se.py: > 1. There are 4 CUs > > Are the following statements correct? > 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) > 2b. Each CU has 4 SIMD units > 3a. Each waveform has 64 threads -- to be executed in lockstep > 3b. Each SIMD unit has 16 lanes > 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 > lanes) in lockstep. So all SIMD units will necessarily always execute the > same instruction at any given cycle (unless there's a scalar instruction, > or in case of a divergence where some threads are masked off). > > If so, what's the use of having 4 distinct 16-lane SIMD units over 1 > 64-lane SIMD unit? > > [image: image.png] > Figure from > http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-leave@gem5.org >
PM
Poremba, Matthew
Tue, Jul 11, 2023 9:46 PM

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per SIMD.  I think gem5 emulates this by having instructions take 4 cycles even though it reads/writes all 64 work-items’ data values at once in the instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew Matthew.Poremba@amd.com
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes" -- this is how the default/baseline gem5 GPU configuration works.  Except they are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down to design decisions (also would be interested to see if Brad or Matt P have a different opinion here).  One could have a 64-wide unit instead.  But in terms of why that is not how it's done, vectors are not guaranteed to be fully occupied.  So, the longer/wider we make a vector, the harder it will be to fully utilize the entire vector all the time.  And in vector processing, to get efficiency we want to fill all the lanes all the time if possible.  I believe at a hardware level having a wider vector also has some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given wavefront does not have enough work to fill all 4 16-wide units, we can also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way you are assuming.  But the reasons why are related more to how real hardware works than what is simplest logically.

Hope this helps,
Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <gem5-users@gem5.orgmailto:gem5-users@gem5.org> wrote:
In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?
2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)
2b. Each CU has 4 SIMD units
3a. Each waveform has 64 threads -- to be executed in lockstep
3b. Each SIMD unit has 16 lanes
4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 lanes) in lockstep. So all SIMD units will necessarily always execute the same instruction at any given cycle (unless there's a scalar instruction, or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1 64-lane SIMD unit?

[cid:image001.png@01D9B404.6E5DA500]
Figure from http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.orgmailto:gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.orgmailto:gem5-users-leave@gem5.org

[AMD Official Use Only - General] Hi Anoop, One small correction for #3a/#4: GCN3 hardware executes one wavefront (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per SIMD. I think gem5 emulates this by having instructions take 4 cycles even though it reads/writes all 64 work-items’ data values at once in the instruction implementations. -Matt From: Matt Sinclair <mattdsinclair.wisc@gmail.com> Sent: Tuesday, July 11, 2023 1:27 PM To: The gem5 Users mailing list <gem5-users@gem5.org> Cc: Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew <Matthew.Poremba@amd.com> Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3 Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. Hi Anoop, Broadly I would say the answer to all of your questions 1-4 are "yes" -- this is how the default/baseline gem5 GPU configuration works. Except they are "wavefronts", not "waveforms". In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down to design decisions (also would be interested to see if Brad or Matt P have a different opinion here). One could have a 64-wide unit instead. But in terms of why that is not how it's done, vectors are not guaranteed to be fully occupied. So, the longer/wider we make a vector, the harder it will be to fully utilize the entire vector all the time. And in vector processing, to get efficiency we want to fill all the lanes all the time if possible. I believe at a hardware level having a wider vector also has some area implications too, although I'm not an expert at that part. Now, by having 4 16-wide units we reduce pressure on this -- we only need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given wavefront does not have enough work to fill all 4 16-wide units, we can also run a different wavefront on them at the same time. So ultimately I think logically you can think of them behaving the way you are assuming. But the reasons why are related more to how real hardware works than what is simplest logically. Hope this helps, Matt S. On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote: In the default configuration of apu_se.py: 1. There are 4 CUs Are the following statements correct? 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) 2b. Each CU has 4 SIMD units 3a. Each waveform has 64 threads -- to be executed in lockstep 3b. Each SIMD unit has 16 lanes 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 lanes) in lockstep. So all SIMD units will necessarily always execute the same instruction at any given cycle (unless there's a scalar instruction, or in case of a divergence where some threads are masked off). If so, what's the use of having 4 distinct 16-lane SIMD units over 1 64-lane SIMD unit? [cid:image001.png@01D9B404.6E5DA500] Figure from http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf _______________________________________________ gem5-users mailing list -- gem5-users@gem5.org<mailto:gem5-users@gem5.org> To unsubscribe send an email to gem5-users-leave@gem5.org<mailto:gem5-users-leave@gem5.org>
AM
Anoop Mysore
Thu, Jul 13, 2023 12:51 PM

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew Matthew.Poremba@amd.com
wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront (=
64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per
SIMD.  I think gem5 emulates this by having instructions take 4 cycles even
though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes" --
this is how the default/baseline gem5 GPU configuration works.  Except they
are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down
to design decisions (also would be interested to see if Brad or Matt P have
a different opinion here).  One could have a 64-wide unit instead.  But in
terms of why that is not how it's done, vectors are not guaranteed to be
fully occupied.  So, the longer/wider we make a vector, the harder it will
be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only need
to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way you
are assuming.  But the reasons why are related more to how real hardware
works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
    lanes) in lockstep. So all SIMD units will necessarily always execute the
    same instruction at any given cycle (unless there's a scalar instruction,
    or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Very cool! Thanks for the explanation. On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <Matthew.Poremba@amd.com> wrote: > [AMD Official Use Only - General] > > Hi Anoop, > > > > > > One small correction for #3a/#4: GCN3 hardware executes one wavefront (= > 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per > SIMD. I think gem5 emulates this by having instructions take 4 cycles even > though it reads/writes all 64 work-items’ data values at once in the > instruction implementations. > > > > > > -Matt > > > > *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> > *Sent:* Tuesday, July 11, 2023 1:27 PM > *To:* The gem5 Users mailing list <gem5-users@gem5.org> > *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < > Matthew.Poremba@amd.com> > *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 > > > > *Caution:* This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > > > > Hi Anoop, > > > > Broadly I would say the answer to all of your questions 1-4 are "yes" -- > this is how the default/baseline gem5 GPU configuration works. Except they > are "wavefronts", not "waveforms". > > > > In terms of why 4 16-wide units instead of 1 64-wide unit, this comes down > to design decisions (also would be interested to see if Brad or Matt P have > a different opinion here). One could have a 64-wide unit instead. But in > terms of why that is not how it's done, vectors are not guaranteed to be > fully occupied. So, the longer/wider we make a vector, the harder it will > be to fully utilize the entire vector all the time. And in vector > processing, to get efficiency we want to fill all the lanes all the time if > possible. I believe at a hardware level having a wider vector also has > some area implications too, although I'm not an expert at that part. > > > > Now, by having 4 16-wide units we reduce pressure on this -- we only need > to fill 16-wide units instead of a 64-wide unit. Moreover, if a given > wavefront does not have enough work to fill all 4 16-wide units, we can > also run a different wavefront on them at the same time. > > > > So ultimately I think logically you can think of them behaving the way you > are assuming. But the reasons why are related more to how real hardware > works than what is simplest logically. > > > > Hope this helps, > > Matt S. > > > > On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < > gem5-users@gem5.org> wrote: > > In the default configuration of apu_se.py: > 1. There are 4 CUs > > > > Are the following statements correct? > > 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) > > 2b. Each CU has 4 SIMD units > > 3a. Each waveform has 64 threads -- to be executed in lockstep > > 3b. Each SIMD unit has 16 lanes > > 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 > lanes) in lockstep. So all SIMD units will necessarily always execute the > same instruction at any given cycle (unless there's a scalar instruction, > or in case of a divergence where some threads are masked off). > > > > If so, what's the use of having 4 distinct 16-lane SIMD units over 1 > 64-lane SIMD unit? > > > Figure from > http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf > > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-leave@gem5.org > >
AM
Anoop Mysore
Tue, Sep 19, 2023 7:44 PM

Followup:
The scalar (data) cache within the CU -- is of the same type as the SQC
(which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew Matthew.Poremba@amd.com
wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront (=
64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per
SIMD.  I think gem5 emulates this by having instructions take 4 cycles even
though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes" --
this is how the default/baseline gem5 GPU configuration works.  Except they
are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes
down to design decisions (also would be interested to see if Brad or Matt P
have a different opinion here).  One could have a 64-wide unit instead.
But in terms of why that is not how it's done, vectors are not guaranteed
to be fully occupied.  So, the longer/wider we make a vector, the harder it
will be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only need
to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way
you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
    lanes) in lockstep. So all SIMD units will necessarily always execute the
    same instruction at any given cycle (unless there's a scalar instruction,
    or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Followup: The scalar (data) cache within the CU -- is of the same type as the SQC (which is an instruction cache with no code in it to process writes). Are writes to the scalar cache not handled? I see instructions like s_store_dword being possible in the ISA (though I have not come across any in the disassembly of the kernel codes). TIA On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> wrote: > Very cool! Thanks for the explanation. > > On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <Matthew.Poremba@amd.com> > wrote: > >> [AMD Official Use Only - General] >> >> Hi Anoop, >> >> >> >> >> >> One small correction for #3a/#4: GCN3 hardware executes one wavefront (= >> 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per >> SIMD. I think gem5 emulates this by having instructions take 4 cycles even >> though it reads/writes all 64 work-items’ data values at once in the >> instruction implementations. >> >> >> >> >> >> -Matt >> >> >> >> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >> *Sent:* Tuesday, July 11, 2023 1:27 PM >> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >> Matthew.Poremba@amd.com> >> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >> >> >> >> *Caution:* This message originated from an External Source. Use proper >> caution when opening attachments, clicking links, or responding. >> >> >> >> Hi Anoop, >> >> >> >> Broadly I would say the answer to all of your questions 1-4 are "yes" -- >> this is how the default/baseline gem5 GPU configuration works. Except they >> are "wavefronts", not "waveforms". >> >> >> >> In terms of why 4 16-wide units instead of 1 64-wide unit, this comes >> down to design decisions (also would be interested to see if Brad or Matt P >> have a different opinion here). One could have a 64-wide unit instead. >> But in terms of why that is not how it's done, vectors are not guaranteed >> to be fully occupied. So, the longer/wider we make a vector, the harder it >> will be to fully utilize the entire vector all the time. And in vector >> processing, to get efficiency we want to fill all the lanes all the time if >> possible. I believe at a hardware level having a wider vector also has >> some area implications too, although I'm not an expert at that part. >> >> >> >> Now, by having 4 16-wide units we reduce pressure on this -- we only need >> to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >> wavefront does not have enough work to fill all 4 16-wide units, we can >> also run a different wavefront on them at the same time. >> >> >> >> So ultimately I think logically you can think of them behaving the way >> you are assuming. But the reasons why are related more to how real >> hardware works than what is simplest logically. >> >> >> >> Hope this helps, >> >> Matt S. >> >> >> >> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >> gem5-users@gem5.org> wrote: >> >> In the default configuration of apu_se.py: >> 1. There are 4 CUs >> >> >> >> Are the following statements correct? >> >> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >> >> 2b. Each CU has 4 SIMD units >> >> 3a. Each waveform has 64 threads -- to be executed in lockstep >> >> 3b. Each SIMD unit has 16 lanes >> >> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 >> lanes) in lockstep. So all SIMD units will necessarily always execute the >> same instruction at any given cycle (unless there's a scalar instruction, >> or in case of a divergence where some threads are masked off). >> >> >> >> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >> 64-lane SIMD unit? >> >> >> Figure from >> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >> >> _______________________________________________ >> gem5-users mailing list -- gem5-users@gem5.org >> To unsubscribe send an email to gem5-users-leave@gem5.org >> >>
MS
Matt Sinclair
Tue, Sep 19, 2023 7:48 PM

Yes, we have a separate instantiation for the scalar cache that handles
scalar instructions.

Matt

On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore mysanoop@gmail.com wrote:

Followup:
The scalar (data) cache within the CU -- is of the same type as the SQC
(which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <
Matthew.Poremba@amd.com> wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront (=
64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per
SIMD.  I think gem5 emulates this by having instructions take 4 cycles even
though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes" --
this is how the default/baseline gem5 GPU configuration works.  Except they
are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes
down to design decisions (also would be interested to see if Brad or Matt P
have a different opinion here).  One could have a 64-wide unit instead.
But in terms of why that is not how it's done, vectors are not guaranteed
to be fully occupied.  So, the longer/wider we make a vector, the harder it
will be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only
need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way
you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
    lanes) in lockstep. So all SIMD units will necessarily always execute the
    same instruction at any given cycle (unless there's a scalar instruction,
    or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Yes, we have a separate instantiation for the scalar cache that handles scalar instructions. Matt On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore <mysanoop@gmail.com> wrote: > Followup: > The scalar (data) cache within the CU -- is of the same type as the SQC > (which is an instruction cache with no code in it to process writes). > Are writes to the scalar cache not handled? I see instructions like > s_store_dword being possible in the ISA (though I have not come across any > in the disassembly of the kernel codes). > > TIA > > On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> wrote: > >> Very cool! Thanks for the explanation. >> >> On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew < >> Matthew.Poremba@amd.com> wrote: >> >>> [AMD Official Use Only - General] >>> >>> Hi Anoop, >>> >>> >>> >>> >>> >>> One small correction for #3a/#4: GCN3 hardware executes one wavefront (= >>> 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle per >>> SIMD. I think gem5 emulates this by having instructions take 4 cycles even >>> though it reads/writes all 64 work-items’ data values at once in the >>> instruction implementations. >>> >>> >>> >>> >>> >>> -Matt >>> >>> >>> >>> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >>> *Sent:* Tuesday, July 11, 2023 1:27 PM >>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >>> Matthew.Poremba@amd.com> >>> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >>> >>> >>> >>> *Caution:* This message originated from an External Source. Use proper >>> caution when opening attachments, clicking links, or responding. >>> >>> >>> >>> Hi Anoop, >>> >>> >>> >>> Broadly I would say the answer to all of your questions 1-4 are "yes" -- >>> this is how the default/baseline gem5 GPU configuration works. Except they >>> are "wavefronts", not "waveforms". >>> >>> >>> >>> In terms of why 4 16-wide units instead of 1 64-wide unit, this comes >>> down to design decisions (also would be interested to see if Brad or Matt P >>> have a different opinion here). One could have a 64-wide unit instead. >>> But in terms of why that is not how it's done, vectors are not guaranteed >>> to be fully occupied. So, the longer/wider we make a vector, the harder it >>> will be to fully utilize the entire vector all the time. And in vector >>> processing, to get efficiency we want to fill all the lanes all the time if >>> possible. I believe at a hardware level having a wider vector also has >>> some area implications too, although I'm not an expert at that part. >>> >>> >>> >>> Now, by having 4 16-wide units we reduce pressure on this -- we only >>> need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >>> wavefront does not have enough work to fill all 4 16-wide units, we can >>> also run a different wavefront on them at the same time. >>> >>> >>> >>> So ultimately I think logically you can think of them behaving the way >>> you are assuming. But the reasons why are related more to how real >>> hardware works than what is simplest logically. >>> >>> >>> >>> Hope this helps, >>> >>> Matt S. >>> >>> >>> >>> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >>> gem5-users@gem5.org> wrote: >>> >>> In the default configuration of apu_se.py: >>> 1. There are 4 CUs >>> >>> >>> >>> Are the following statements correct? >>> >>> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >>> >>> 2b. Each CU has 4 SIMD units >>> >>> 3a. Each waveform has 64 threads -- to be executed in lockstep >>> >>> 3b. Each SIMD unit has 16 lanes >>> >>> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 >>> lanes) in lockstep. So all SIMD units will necessarily always execute the >>> same instruction at any given cycle (unless there's a scalar instruction, >>> or in case of a divergence where some threads are masked off). >>> >>> >>> >>> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >>> 64-lane SIMD unit? >>> >>> >>> Figure from >>> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >>> >>> _______________________________________________ >>> gem5-users mailing list -- gem5-users@gem5.org >>> To unsubscribe send an email to gem5-users-leave@gem5.org >>> >>>
AM
Anoop Mysore
Tue, Sep 19, 2023 8:17 PM

I'm sorry, could you elaborate?
I do recognize that there are two instantiations of SQC_Controller -- one
for the SQC cache and one for the Scalar cache. Both of these seem to be
instantiating the same read-only cache that's in
src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate
instantiation?

On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair mattdsinclair.wisc@gmail.com
wrote:

Yes, we have a separate instantiation for the scalar cache that handles
scalar instructions.

Matt

On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore mysanoop@gmail.com wrote:

Followup:
The scalar (data) cache within the CU -- is of the same type as the SQC
(which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <
Matthew.Poremba@amd.com> wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront
(= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle
per SIMD.  I think gem5 emulates this by having instructions take 4 cycles
even though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes"
-- this is how the default/baseline gem5 GPU configuration works.  Except
they are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes
down to design decisions (also would be interested to see if Brad or Matt P
have a different opinion here).  One could have a 64-wide unit instead.
But in terms of why that is not how it's done, vectors are not guaranteed
to be fully occupied.  So, the longer/wider we make a vector, the harder it
will be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only
need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way
you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64
    lanes) in lockstep. So all SIMD units will necessarily always execute the
    same instruction at any given cycle (unless there's a scalar instruction,
    or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

I'm sorry, could you elaborate? I do recognize that there are two instantiations of SQC_Controller -- one for the SQC cache and one for the Scalar cache. Both of these seem to be instantiating the same read-only cache that's in src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate instantiation? On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair <mattdsinclair.wisc@gmail.com> wrote: > Yes, we have a separate instantiation for the scalar cache that handles > scalar instructions. > > Matt > > On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore <mysanoop@gmail.com> wrote: > >> Followup: >> The scalar (data) cache within the CU -- is of the same type as the SQC >> (which is an instruction cache with no code in it to process writes). >> Are writes to the scalar cache not handled? I see instructions like >> s_store_dword being possible in the ISA (though I have not come across any >> in the disassembly of the kernel codes). >> >> TIA >> >> On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> wrote: >> >>> Very cool! Thanks for the explanation. >>> >>> On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew < >>> Matthew.Poremba@amd.com> wrote: >>> >>>> [AMD Official Use Only - General] >>>> >>>> Hi Anoop, >>>> >>>> >>>> >>>> >>>> >>>> One small correction for #3a/#4: GCN3 hardware executes one wavefront >>>> (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle >>>> per SIMD. I think gem5 emulates this by having instructions take 4 cycles >>>> even though it reads/writes all 64 work-items’ data values at once in the >>>> instruction implementations. >>>> >>>> >>>> >>>> >>>> >>>> -Matt >>>> >>>> >>>> >>>> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >>>> *Sent:* Tuesday, July 11, 2023 1:27 PM >>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >>>> Matthew.Poremba@amd.com> >>>> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >>>> >>>> >>>> >>>> *Caution:* This message originated from an External Source. Use proper >>>> caution when opening attachments, clicking links, or responding. >>>> >>>> >>>> >>>> Hi Anoop, >>>> >>>> >>>> >>>> Broadly I would say the answer to all of your questions 1-4 are "yes" >>>> -- this is how the default/baseline gem5 GPU configuration works. Except >>>> they are "wavefronts", not "waveforms". >>>> >>>> >>>> >>>> In terms of why 4 16-wide units instead of 1 64-wide unit, this comes >>>> down to design decisions (also would be interested to see if Brad or Matt P >>>> have a different opinion here). One could have a 64-wide unit instead. >>>> But in terms of why that is not how it's done, vectors are not guaranteed >>>> to be fully occupied. So, the longer/wider we make a vector, the harder it >>>> will be to fully utilize the entire vector all the time. And in vector >>>> processing, to get efficiency we want to fill all the lanes all the time if >>>> possible. I believe at a hardware level having a wider vector also has >>>> some area implications too, although I'm not an expert at that part. >>>> >>>> >>>> >>>> Now, by having 4 16-wide units we reduce pressure on this -- we only >>>> need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >>>> wavefront does not have enough work to fill all 4 16-wide units, we can >>>> also run a different wavefront on them at the same time. >>>> >>>> >>>> >>>> So ultimately I think logically you can think of them behaving the way >>>> you are assuming. But the reasons why are related more to how real >>>> hardware works than what is simplest logically. >>>> >>>> >>>> >>>> Hope this helps, >>>> >>>> Matt S. >>>> >>>> >>>> >>>> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >>>> gem5-users@gem5.org> wrote: >>>> >>>> In the default configuration of apu_se.py: >>>> 1. There are 4 CUs >>>> >>>> >>>> >>>> Are the following statements correct? >>>> >>>> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >>>> >>>> 2b. Each CU has 4 SIMD units >>>> >>>> 3a. Each waveform has 64 threads -- to be executed in lockstep >>>> >>>> 3b. Each SIMD unit has 16 lanes >>>> >>>> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= 64 >>>> lanes) in lockstep. So all SIMD units will necessarily always execute the >>>> same instruction at any given cycle (unless there's a scalar instruction, >>>> or in case of a divergence where some threads are masked off). >>>> >>>> >>>> >>>> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >>>> 64-lane SIMD unit? >>>> >>>> >>>> Figure from >>>> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >>>> >>>> _______________________________________________ >>>> gem5-users mailing list -- gem5-users@gem5.org >>>> To unsubscribe send an email to gem5-users-leave@gem5.org >>>> >>>>
MS
Matt Sinclair
Wed, Sep 20, 2023 4:35 PM

Just wanted to acknowledge that I now understand your issue.  I've been
trying to write a test to validate, but unfortunately I cannot find a test
we have that uses s_store_dword (or s_store_dwordx2) and when I am writing
a microbenchmark for this with inlined assembly for s_store_dword, it fails
on the real GPU ... do you have a test that actually uses s_store_dword?

Matt

On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore mysanoop@gmail.com wrote:

I'm sorry, could you elaborate?
I do recognize that there are two instantiations of SQC_Controller -- one
for the SQC cache and one for the Scalar cache. Both of these seem to be
instantiating the same read-only cache that's in
src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate
instantiation?

On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Yes, we have a separate instantiation for the scalar cache that handles
scalar instructions.

Matt

On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore mysanoop@gmail.com wrote:

Followup:
The scalar (data) cache within the CU -- is of the same type as the SQC
(which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <
Matthew.Poremba@amd.com> wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront
(= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle
per SIMD.  I think gem5 emulates this by having instructions take 4 cycles
even though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use
proper caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes"
-- this is how the default/baseline gem5 GPU configuration works.  Except
they are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes
down to design decisions (also would be interested to see if Brad or Matt P
have a different opinion here).  One could have a 64-wide unit instead.
But in terms of why that is not how it's done, vectors are not guaranteed
to be fully occupied.  So, the longer/wider we make a vector, the harder it
will be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only
need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the way
you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (=
    64 lanes) in lockstep. So all SIMD units will necessarily always execute
    the same instruction at any given cycle (unless there's a scalar
    instruction, or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Just wanted to acknowledge that I now understand your issue. I've been trying to write a test to validate, but unfortunately I cannot find a test we have that uses s_store_dword (or s_store_dwordx2) and when I am writing a microbenchmark for this with inlined assembly for s_store_dword, it fails on the real GPU ... do you have a test that actually uses s_store_dword? Matt On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore <mysanoop@gmail.com> wrote: > I'm sorry, could you elaborate? > I do recognize that there are two instantiations of SQC_Controller -- one > for the SQC cache and one for the Scalar cache. Both of these seem to be > instantiating the same read-only cache that's in > src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate > instantiation? > > On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair < > mattdsinclair.wisc@gmail.com> wrote: > >> Yes, we have a separate instantiation for the scalar cache that handles >> scalar instructions. >> >> Matt >> >> On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore <mysanoop@gmail.com> wrote: >> >>> Followup: >>> The scalar (data) cache within the CU -- is of the same type as the SQC >>> (which is an instruction cache with no code in it to process writes). >>> Are writes to the scalar cache not handled? I see instructions like >>> s_store_dword being possible in the ISA (though I have not come across any >>> in the disassembly of the kernel codes). >>> >>> TIA >>> >>> On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> wrote: >>> >>>> Very cool! Thanks for the explanation. >>>> >>>> On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew < >>>> Matthew.Poremba@amd.com> wrote: >>>> >>>>> [AMD Official Use Only - General] >>>>> >>>>> Hi Anoop, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> One small correction for #3a/#4: GCN3 hardware executes one wavefront >>>>> (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle >>>>> per SIMD. I think gem5 emulates this by having instructions take 4 cycles >>>>> even though it reads/writes all 64 work-items’ data values at once in the >>>>> instruction implementations. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -Matt >>>>> >>>>> >>>>> >>>>> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >>>>> *Sent:* Tuesday, July 11, 2023 1:27 PM >>>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>>> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >>>>> Matthew.Poremba@amd.com> >>>>> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >>>>> >>>>> >>>>> >>>>> *Caution:* This message originated from an External Source. Use >>>>> proper caution when opening attachments, clicking links, or responding. >>>>> >>>>> >>>>> >>>>> Hi Anoop, >>>>> >>>>> >>>>> >>>>> Broadly I would say the answer to all of your questions 1-4 are "yes" >>>>> -- this is how the default/baseline gem5 GPU configuration works. Except >>>>> they are "wavefronts", not "waveforms". >>>>> >>>>> >>>>> >>>>> In terms of why 4 16-wide units instead of 1 64-wide unit, this comes >>>>> down to design decisions (also would be interested to see if Brad or Matt P >>>>> have a different opinion here). One could have a 64-wide unit instead. >>>>> But in terms of why that is not how it's done, vectors are not guaranteed >>>>> to be fully occupied. So, the longer/wider we make a vector, the harder it >>>>> will be to fully utilize the entire vector all the time. And in vector >>>>> processing, to get efficiency we want to fill all the lanes all the time if >>>>> possible. I believe at a hardware level having a wider vector also has >>>>> some area implications too, although I'm not an expert at that part. >>>>> >>>>> >>>>> >>>>> Now, by having 4 16-wide units we reduce pressure on this -- we only >>>>> need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >>>>> wavefront does not have enough work to fill all 4 16-wide units, we can >>>>> also run a different wavefront on them at the same time. >>>>> >>>>> >>>>> >>>>> So ultimately I think logically you can think of them behaving the way >>>>> you are assuming. But the reasons why are related more to how real >>>>> hardware works than what is simplest logically. >>>>> >>>>> >>>>> >>>>> Hope this helps, >>>>> >>>>> Matt S. >>>>> >>>>> >>>>> >>>>> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >>>>> gem5-users@gem5.org> wrote: >>>>> >>>>> In the default configuration of apu_se.py: >>>>> 1. There are 4 CUs >>>>> >>>>> >>>>> >>>>> Are the following statements correct? >>>>> >>>>> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >>>>> >>>>> 2b. Each CU has 4 SIMD units >>>>> >>>>> 3a. Each waveform has 64 threads -- to be executed in lockstep >>>>> >>>>> 3b. Each SIMD unit has 16 lanes >>>>> >>>>> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= >>>>> 64 lanes) in lockstep. So all SIMD units will necessarily always execute >>>>> the same instruction at any given cycle (unless there's a scalar >>>>> instruction, or in case of a divergence where some threads are masked off). >>>>> >>>>> >>>>> >>>>> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >>>>> 64-lane SIMD unit? >>>>> >>>>> >>>>> Figure from >>>>> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >>>>> >>>>> _______________________________________________ >>>>> gem5-users mailing list -- gem5-users@gem5.org >>>>> To unsubscribe send an email to gem5-users-leave@gem5.org >>>>> >>>>>
AM
Anoop Mysore
Thu, Sep 21, 2023 7:48 AM

No, I have not come across that instruction in any of the benchmarks I've
run so far; I got to know about the instruction from the GCN3 ISA document
https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf
which described scalar data cache as a read-write cache in GCN3 (was a read
only in GCN2 -- so SQC_Controller would have had no issues there.)

On Wed, Sep 20, 2023 at 6:35 PM Matt Sinclair mattdsinclair.wisc@gmail.com
wrote:

Just wanted to acknowledge that I now understand your issue.  I've been
trying to write a test to validate, but unfortunately I cannot find a test
we have that uses s_store_dword (or s_store_dwordx2) and when I am writing
a microbenchmark for this with inlined assembly for s_store_dword, it fails
on the real GPU ... do you have a test that actually uses s_store_dword?

Matt

On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore mysanoop@gmail.com wrote:

I'm sorry, could you elaborate?
I do recognize that there are two instantiations of SQC_Controller -- one
for the SQC cache and one for the Scalar cache. Both of these seem to be
instantiating the same read-only cache that's in
src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate
instantiation?

On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Yes, we have a separate instantiation for the scalar cache that handles
scalar instructions.

Matt

On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore mysanoop@gmail.com wrote:

Followup:
The scalar (data) cache within the CU -- is of the same type as the SQC
(which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com
wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <
Matthew.Poremba@amd.com> wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one wavefront
(= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle
per SIMD.  I think gem5 emulates this by having instructions take 4 cycles
even though it reads/writes all 64 work-items’ data values at once in the
instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use
proper caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are "yes"
-- this is how the default/baseline gem5 GPU configuration works.  Except
they are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this comes
down to design decisions (also would be interested to see if Brad or Matt P
have a different opinion here).  One could have a 64-wide unit instead.
But in terms of why that is not how it's done, vectors are not guaranteed
to be fully occupied.  So, the longer/wider we make a vector, the harder it
will be to fully utilize the entire vector all the time.  And in vector
processing, to get efficiency we want to fill all the lanes all the time if
possible.  I believe at a hardware level having a wider vector also has
some area implications too, although I'm not an expert at that part.

Now, by having 4 16-wide units we reduce pressure on this -- we only
need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the
way you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (=
    64 lanes) in lockstep. So all SIMD units will necessarily always execute
    the same instruction at any given cycle (unless there's a scalar
    instruction, or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

No, I have not come across that instruction in any of the benchmarks I've run so far; I got to know about the instruction from the GCN3 ISA document <https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf> which described scalar data cache as a read-write cache in GCN3 (was a read only in GCN2 -- so SQC_Controller would have had no issues there.) On Wed, Sep 20, 2023 at 6:35 PM Matt Sinclair <mattdsinclair.wisc@gmail.com> wrote: > Just wanted to acknowledge that I now understand your issue. I've been > trying to write a test to validate, but unfortunately I cannot find a test > we have that uses s_store_dword (or s_store_dwordx2) and when I am writing > a microbenchmark for this with inlined assembly for s_store_dword, it fails > on the real GPU ... do you have a test that actually uses s_store_dword? > > Matt > > On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore <mysanoop@gmail.com> wrote: > >> I'm sorry, could you elaborate? >> I do recognize that there are two instantiations of SQC_Controller -- one >> for the SQC cache and one for the Scalar cache. Both of these seem to be >> instantiating the same read-only cache that's in >> src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate >> instantiation? >> >> On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair < >> mattdsinclair.wisc@gmail.com> wrote: >> >>> Yes, we have a separate instantiation for the scalar cache that handles >>> scalar instructions. >>> >>> Matt >>> >>> On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore <mysanoop@gmail.com> wrote: >>> >>>> Followup: >>>> The scalar (data) cache within the CU -- is of the same type as the SQC >>>> (which is an instruction cache with no code in it to process writes). >>>> Are writes to the scalar cache not handled? I see instructions like >>>> s_store_dword being possible in the ISA (though I have not come across any >>>> in the disassembly of the kernel codes). >>>> >>>> TIA >>>> >>>> On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> >>>> wrote: >>>> >>>>> Very cool! Thanks for the explanation. >>>>> >>>>> On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew < >>>>> Matthew.Poremba@amd.com> wrote: >>>>> >>>>>> [AMD Official Use Only - General] >>>>>> >>>>>> Hi Anoop, >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> One small correction for #3a/#4: GCN3 hardware executes one wavefront >>>>>> (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items per cycle >>>>>> per SIMD. I think gem5 emulates this by having instructions take 4 cycles >>>>>> even though it reads/writes all 64 work-items’ data values at once in the >>>>>> instruction implementations. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -Matt >>>>>> >>>>>> >>>>>> >>>>>> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >>>>>> *Sent:* Tuesday, July 11, 2023 1:27 PM >>>>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>>>> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >>>>>> Matthew.Poremba@amd.com> >>>>>> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >>>>>> >>>>>> >>>>>> >>>>>> *Caution:* This message originated from an External Source. Use >>>>>> proper caution when opening attachments, clicking links, or responding. >>>>>> >>>>>> >>>>>> >>>>>> Hi Anoop, >>>>>> >>>>>> >>>>>> >>>>>> Broadly I would say the answer to all of your questions 1-4 are "yes" >>>>>> -- this is how the default/baseline gem5 GPU configuration works. Except >>>>>> they are "wavefronts", not "waveforms". >>>>>> >>>>>> >>>>>> >>>>>> In terms of why 4 16-wide units instead of 1 64-wide unit, this comes >>>>>> down to design decisions (also would be interested to see if Brad or Matt P >>>>>> have a different opinion here). One could have a 64-wide unit instead. >>>>>> But in terms of why that is not how it's done, vectors are not guaranteed >>>>>> to be fully occupied. So, the longer/wider we make a vector, the harder it >>>>>> will be to fully utilize the entire vector all the time. And in vector >>>>>> processing, to get efficiency we want to fill all the lanes all the time if >>>>>> possible. I believe at a hardware level having a wider vector also has >>>>>> some area implications too, although I'm not an expert at that part. >>>>>> >>>>>> >>>>>> >>>>>> Now, by having 4 16-wide units we reduce pressure on this -- we only >>>>>> need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >>>>>> wavefront does not have enough work to fill all 4 16-wide units, we can >>>>>> also run a different wavefront on them at the same time. >>>>>> >>>>>> >>>>>> >>>>>> So ultimately I think logically you can think of them behaving the >>>>>> way you are assuming. But the reasons why are related more to how real >>>>>> hardware works than what is simplest logically. >>>>>> >>>>>> >>>>>> >>>>>> Hope this helps, >>>>>> >>>>>> Matt S. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >>>>>> gem5-users@gem5.org> wrote: >>>>>> >>>>>> In the default configuration of apu_se.py: >>>>>> 1. There are 4 CUs >>>>>> >>>>>> >>>>>> >>>>>> Are the following statements correct? >>>>>> >>>>>> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >>>>>> >>>>>> 2b. Each CU has 4 SIMD units >>>>>> >>>>>> 3a. Each waveform has 64 threads -- to be executed in lockstep >>>>>> >>>>>> 3b. Each SIMD unit has 16 lanes >>>>>> >>>>>> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= >>>>>> 64 lanes) in lockstep. So all SIMD units will necessarily always execute >>>>>> the same instruction at any given cycle (unless there's a scalar >>>>>> instruction, or in case of a divergence where some threads are masked off). >>>>>> >>>>>> >>>>>> >>>>>> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >>>>>> 64-lane SIMD unit? >>>>>> >>>>>> >>>>>> Figure from >>>>>> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >>>>>> >>>>>> _______________________________________________ >>>>>> gem5-users mailing list -- gem5-users@gem5.org >>>>>> To unsubscribe send an email to gem5-users-leave@gem5.org >>>>>> >>>>>>
MS
Matt Sinclair
Thu, Sep 21, 2023 3:38 PM

Ok, glad you are not blocked on this.  I'll keep trying in the background
to figure out how to write a program that uses these instructions -- else
we can't really test that any support works -- but thanks for bringing this
to our attention.

Matt

On Thu, Sep 21, 2023 at 2:48 AM Anoop Mysore mysanoop@gmail.com wrote:

No, I have not come across that instruction in any of the benchmarks I've
run so far; I got to know about the instruction from the GCN3 ISA document
https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf
which described scalar data cache as a read-write cache in GCN3 (was a read
only in GCN2 -- so SQC_Controller would have had no issues there.)

On Wed, Sep 20, 2023 at 6:35 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Just wanted to acknowledge that I now understand your issue.  I've been
trying to write a test to validate, but unfortunately I cannot find a test
we have that uses s_store_dword (or s_store_dwordx2) and when I am writing
a microbenchmark for this with inlined assembly for s_store_dword, it fails
on the real GPU ... do you have a test that actually uses s_store_dword?

Matt

On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore mysanoop@gmail.com wrote:

I'm sorry, could you elaborate?
I do recognize that there are two instantiations of SQC_Controller --
one for the SQC cache and one for the Scalar cache. Both of these seem to
be instantiating the same read-only cache that's in
src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate
instantiation?

On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Yes, we have a separate instantiation for the scalar cache that handles
scalar instructions.

Matt

On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore mysanoop@gmail.com
wrote:

Followup:
The scalar (data) cache within the CU -- is of the same type as the
SQC (which is an instruction cache with no code in it to process writes).
Are writes to the scalar cache not handled? I see instructions like
s_store_dword being possible in the ISA (though I have not come across any
in the disassembly of the kernel codes).

TIA

On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore mysanoop@gmail.com
wrote:

Very cool! Thanks for the explanation.

On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew <
Matthew.Poremba@amd.com> wrote:

[AMD Official Use Only - General]

Hi Anoop,

One small correction for #3a/#4: GCN3 hardware executes one
wavefront (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items
per cycle per SIMD.  I think gem5 emulates this by having instructions take
4 cycles even though it reads/writes all 64 work-items’ data values at once
in the instruction implementations.

-Matt

From: Matt Sinclair mattdsinclair.wisc@gmail.com
Sent: Tuesday, July 11, 2023 1:27 PM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com; Poremba, Matthew <
Matthew.Poremba@amd.com>
Subject: Re: [gem5-users] Sanity check on HIP execution on GCN3

Caution: This message originated from an External Source. Use
proper caution when opening attachments, clicking links, or responding.

Hi Anoop,

Broadly I would say the answer to all of your questions 1-4 are
"yes" -- this is how the default/baseline gem5 GPU configuration works.
Except they are "wavefronts", not "waveforms".

In terms of why 4 16-wide units instead of 1 64-wide unit, this
comes down to design decisions (also would be interested to see if Brad or
Matt P have a different opinion here).  One could have a 64-wide unit
instead.  But in terms of why that is not how it's done, vectors are not
guaranteed to be fully occupied.  So, the longer/wider we make a vector,
the harder it will be to fully utilize the entire vector all the time.  And
in vector processing, to get efficiency we want to fill all the lanes all
the time if possible.  I believe at a hardware level having a wider vector
also has some area implications too, although I'm not an expert at that
part.

Now, by having 4 16-wide units we reduce pressure on this -- we only
need to fill 16-wide units instead of a 64-wide unit.  Moreover, if a given
wavefront does not have enough work to fill all 4 16-wide units, we can
also run a different wavefront on them at the same time.

So ultimately I think logically you can think of them behaving the
way you are assuming.  But the reasons why are related more to how real
hardware works than what is simplest logically.

Hope this helps,

Matt S.

On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users <
gem5-users@gem5.org> wrote:

In the default configuration of apu_se.py:

  1. There are 4 CUs

Are the following statements correct?

2a. Each CU can hold 40 waveform contexts (10 per SIMD unit)

2b. Each CU has 4 SIMD units

3a. Each waveform has 64 threads -- to be executed in lockstep

3b. Each SIMD unit has 16 lanes

  1. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (=
    64 lanes) in lockstep. So all SIMD units will necessarily always execute
    the same instruction at any given cycle (unless there's a scalar
    instruction, or in case of a divergence where some threads are masked off).

If so, what's the use of having 4 distinct 16-lane SIMD units over 1
64-lane SIMD unit?

Figure from
http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Ok, glad you are not blocked on this. I'll keep trying in the background to figure out how to write a program that uses these instructions -- else we can't really test that any support works -- but thanks for bringing this to our attention. Matt On Thu, Sep 21, 2023 at 2:48 AM Anoop Mysore <mysanoop@gmail.com> wrote: > No, I have not come across that instruction in any of the benchmarks I've > run so far; I got to know about the instruction from the GCN3 ISA document > <https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf> > which described scalar data cache as a read-write cache in GCN3 (was a read > only in GCN2 -- so SQC_Controller would have had no issues there.) > > > On Wed, Sep 20, 2023 at 6:35 PM Matt Sinclair < > mattdsinclair.wisc@gmail.com> wrote: > >> Just wanted to acknowledge that I now understand your issue. I've been >> trying to write a test to validate, but unfortunately I cannot find a test >> we have that uses s_store_dword (or s_store_dwordx2) and when I am writing >> a microbenchmark for this with inlined assembly for s_store_dword, it fails >> on the real GPU ... do you have a test that actually uses s_store_dword? >> >> Matt >> >> On Tue, Sep 19, 2023 at 3:18 PM Anoop Mysore <mysanoop@gmail.com> wrote: >> >>> I'm sorry, could you elaborate? >>> I do recognize that there are two instantiations of SQC_Controller -- >>> one for the SQC cache and one for the Scalar cache. Both of these seem to >>> be instantiating the same read-only cache that's in >>> src/mem/ruby/protocol/GPU_VIPER-SQC.sm. What do you mean by a separate >>> instantiation? >>> >>> On Tue, Sep 19, 2023 at 9:48 PM Matt Sinclair < >>> mattdsinclair.wisc@gmail.com> wrote: >>> >>>> Yes, we have a separate instantiation for the scalar cache that handles >>>> scalar instructions. >>>> >>>> Matt >>>> >>>> On Tue, Sep 19, 2023 at 2:44 PM Anoop Mysore <mysanoop@gmail.com> >>>> wrote: >>>> >>>>> Followup: >>>>> The scalar (data) cache within the CU -- is of the same type as the >>>>> SQC (which is an instruction cache with no code in it to process writes). >>>>> Are writes to the scalar cache not handled? I see instructions like >>>>> s_store_dword being possible in the ISA (though I have not come across any >>>>> in the disassembly of the kernel codes). >>>>> >>>>> TIA >>>>> >>>>> On Thu, Jul 13, 2023 at 2:51 PM Anoop Mysore <mysanoop@gmail.com> >>>>> wrote: >>>>> >>>>>> Very cool! Thanks for the explanation. >>>>>> >>>>>> On Tue, Jul 11, 2023 at 11:46 PM Poremba, Matthew < >>>>>> Matthew.Poremba@amd.com> wrote: >>>>>> >>>>>>> [AMD Official Use Only - General] >>>>>>> >>>>>>> Hi Anoop, >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> One small correction for #3a/#4: GCN3 hardware executes one >>>>>>> wavefront (= 64 work-items) over 4 cycles per SIMD. That is 16 work-items >>>>>>> per cycle per SIMD. I think gem5 emulates this by having instructions take >>>>>>> 4 cycles even though it reads/writes all 64 work-items’ data values at once >>>>>>> in the instruction implementations. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Matt >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Matt Sinclair <mattdsinclair.wisc@gmail.com> >>>>>>> *Sent:* Tuesday, July 11, 2023 1:27 PM >>>>>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>>>>> *Cc:* Anoop Mysore <mysanoop@gmail.com>; Poremba, Matthew < >>>>>>> Matthew.Poremba@amd.com> >>>>>>> *Subject:* Re: [gem5-users] Sanity check on HIP execution on GCN3 >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Caution:* This message originated from an External Source. Use >>>>>>> proper caution when opening attachments, clicking links, or responding. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Anoop, >>>>>>> >>>>>>> >>>>>>> >>>>>>> Broadly I would say the answer to all of your questions 1-4 are >>>>>>> "yes" -- this is how the default/baseline gem5 GPU configuration works. >>>>>>> Except they are "wavefronts", not "waveforms". >>>>>>> >>>>>>> >>>>>>> >>>>>>> In terms of why 4 16-wide units instead of 1 64-wide unit, this >>>>>>> comes down to design decisions (also would be interested to see if Brad or >>>>>>> Matt P have a different opinion here). One could have a 64-wide unit >>>>>>> instead. But in terms of why that is not how it's done, vectors are not >>>>>>> guaranteed to be fully occupied. So, the longer/wider we make a vector, >>>>>>> the harder it will be to fully utilize the entire vector all the time. And >>>>>>> in vector processing, to get efficiency we want to fill all the lanes all >>>>>>> the time if possible. I believe at a hardware level having a wider vector >>>>>>> also has some area implications too, although I'm not an expert at that >>>>>>> part. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Now, by having 4 16-wide units we reduce pressure on this -- we only >>>>>>> need to fill 16-wide units instead of a 64-wide unit. Moreover, if a given >>>>>>> wavefront does not have enough work to fill all 4 16-wide units, we can >>>>>>> also run a different wavefront on them at the same time. >>>>>>> >>>>>>> >>>>>>> >>>>>>> So ultimately I think logically you can think of them behaving the >>>>>>> way you are assuming. But the reasons why are related more to how real >>>>>>> hardware works than what is simplest logically. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hope this helps, >>>>>>> >>>>>>> Matt S. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 11, 2023 at 1:44 PM Anoop Mysore via gem5-users < >>>>>>> gem5-users@gem5.org> wrote: >>>>>>> >>>>>>> In the default configuration of apu_se.py: >>>>>>> 1. There are 4 CUs >>>>>>> >>>>>>> >>>>>>> >>>>>>> Are the following statements correct? >>>>>>> >>>>>>> 2a. Each CU can hold 40 waveform contexts (10 per SIMD unit) >>>>>>> >>>>>>> 2b. Each CU has 4 SIMD units >>>>>>> >>>>>>> 3a. Each waveform has 64 threads -- to be executed in lockstep >>>>>>> >>>>>>> 3b. Each SIMD unit has 16 lanes >>>>>>> >>>>>>> 4. Combining 3a and 3b -- a waveform will invoke all 4 SIMD units (= >>>>>>> 64 lanes) in lockstep. So all SIMD units will necessarily always execute >>>>>>> the same instruction at any given cycle (unless there's a scalar >>>>>>> instruction, or in case of a divergence where some threads are masked off). >>>>>>> >>>>>>> >>>>>>> >>>>>>> If so, what's the use of having 4 distinct 16-lane SIMD units over 1 >>>>>>> 64-lane SIMD unit? >>>>>>> >>>>>>> >>>>>>> Figure from >>>>>>> http://old.gem5.org/wiki/images/1/19/AMD_gem5_APU_simulator_isca_2018_gem5_wiki.pdf >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gem5-users mailing list -- gem5-users@gem5.org >>>>>>> To unsubscribe send an email to gem5-users-leave@gem5.org >>>>>>> >>>>>>>