Empathy List Archives

JO

Javed Osmany

Fri, Jul 29, 2022 10:22 AM

Hello

I am modelling the following system:

a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)

b) All CPUs have private L1I and L1D caches.

c) Each cluster has a shared and unified L2$.

d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.

e) 4 x HNF/LLC/Directory

f) 1 x SNF

I am using gem5-21.2.1.0.

An example of the command used to run the lu_ncb benchmark being:
./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t
rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'

I am running the Parsec/Splash2 benchmark suite.

Extracting the stats from the stats.txt file, I have the following:

Blackscoles

Canneal

Swaptions

Cholesky

FFT

Fmm

Lu_cb

Lu_ncb

Raytrace

Volrend

Water_sq

Water_sp

Demand L2$ miss, little cluster

7019

9605353

7656

2724902

2930037

1365976

58955

1026556

594351

93401

24063

11435

Demand L2$ accesses, little cluster

13506

33101031

1207307

6206252

3511657

3199668

794479

4665754

2471593

1039411

393792

166955

Demand L3$ accesses, total

7165

10359847

9992

2686126

2929728

1321580

54026

51745

131095

22744

12840

8843

If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).

QS: Why don't all the L2$ misses make their way to the L3$?

In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.

Any insight greatly appreciated.

Best regards
JO

Hello I am modelling the following system: a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU) b) All CPUs have private L1I and L1D caches. c) Each cluster has a shared and unified L2$. d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node. e) 4 x HNF/LLC/Directory f) 1 x SNF I am using gem5-21.2.1.0. An example of the command used to run the lu_ncb benchmark being: ./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16' I am running the Parsec/Splash2 benchmark suite. Extracting the stats from the stats.txt file, I have the following: Blackscoles Canneal Swaptions Cholesky FFT Fmm Lu_cb Lu_ncb Raytrace Volrend Water_sq Water_sp Demand L2$ miss, little cluster 7019 9605353 7656 2724902 2930037 1365976 58955 1026556 594351 93401 24063 11435 Demand L2$ accesses, little cluster 13506 33101031 1207307 6206252 3511657 3199668 794479 4665754 2471593 1039411 393792 166955 Demand L3$ accesses, total 7165 10359847 9992 2686126 2929728 1321580 54026 51745 131095 22744 12840 8843 If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster). QS: Why don't all the L2$ misses make their way to the L3$? In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt. Any insight greatly appreciated. Best regards JO

TM

Tiago Muck

Fri, Jul 29, 2022 7:06 PM

Hi Javed,

It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):

transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) {
Initiate_Request;
Initiate_CleanUnique;
Pop_ReqRdyQueue;
ProcessNextState;
}

Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3.

Could you create a JIRA ticket to track this bug ?

Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3.

Thanks,
Tiago

From: Javed Osmany javed.osmany@huawei.com
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list gem5-users@gem5.org
Cc: Javed Osmany javed.osmany@huawei.com
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)

Hello

I am modelling the following system:

a) Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)

b) All CPUs have private L1I and L1D caches.

c) Each cluster has a shared and unified L2$.

d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.

e) 4 x HNF/LLC/Directory

f) 1 x SNF

I am using gem5-21.2.1.0.

An example of the command used to run the lu_ncb benchmark being:

./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache –debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE –bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t

rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'

I am running the Parsec/Splash2 benchmark suite.

Extracting the stats from the stats.txt file, I have the following:

Blackscoles

Canneal

Swaptions

Cholesky

FFT

Fmm

Lu_cb

Lu_ncb

Raytrace

Volrend

Water_sq

Water_sp

Demand L2$ miss, little cluster

7019

9605353

7656

2724902

2930037

1365976

58955

1026556

594351

93401

24063

11435

Demand L2$ accesses, little cluster

13506

33101031

1207307

6206252

3511657

3199668

794479

4665754

2471593

1039411

393792

166955

Demand L3$ accesses, total

7165

10359847

9992

2686126

2929728

1321580

54026

51745

131095

22744

12840

8843

If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).

QS: Why don’t all the L2$ misses make their way to the L3$?

In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.

Any insight greatly appreciated.

Best regards

JO

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Hi Javed, It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm): transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC, SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) { Initiate_Request; Initiate_CleanUnique; Pop_ReqRdyQueue; ProcessNextState; } Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3. Could you create a JIRA ticket to track this bug ? Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3. Thanks, Tiago ________________________________ From: Javed Osmany <javed.osmany@huawei.com> Sent: Friday, July 29, 2022 5:22 AM To: gem5 users mailing list <gem5-users@gem5.org> Cc: Javed Osmany <javed.osmany@huawei.com> Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF) Hello I am modelling the following system: a) Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU) b) All CPUs have private L1I and L1D caches. c) Each cluster has a shared and unified L2$. d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node. e) 4 x HNF/LLC/Directory f) 1 x SNF I am using gem5-21.2.1.0. An example of the command used to run the lu_ncb benchmark being: ./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache –debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE –bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16' I am running the Parsec/Splash2 benchmark suite. Extracting the stats from the stats.txt file, I have the following: Blackscoles Canneal Swaptions Cholesky FFT Fmm Lu_cb Lu_ncb Raytrace Volrend Water_sq Water_sp Demand L2$ miss, little cluster 7019 9605353 7656 2724902 2930037 1365976 58955 1026556 594351 93401 24063 11435 Demand L2$ accesses, little cluster 13506 33101031 1207307 6206252 3511657 3199668 794479 4665754 2471593 1039411 393792 166955 Demand L3$ accesses, total 7165 10359847 9992 2686126 2929728 1321580 54026 51745 131095 22744 12840 8843 If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster). QS: Why don’t all the L2$ misses make their way to the L3$? In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt. Any insight greatly appreciated. Best regards JO IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

JO

Javed Osmany

Mon, Aug 1, 2022 6:31 AM

Hello Tiago

Thank you for the explanation.

Will create a Jira bug request for this.

Best regards
J.Osmany

From: Tiago Muck [mailto:Tiago.Muck@arm.com]
Sent: 29 July 2022 20:06
To: gem5 users mailing list gem5-users@gem5.org
Subject: [gem5-users] Re: CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)

Hi Javed,

It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):

transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) {
Initiate_Request;
Initiate_CleanUnique;
Pop_ReqRdyQueue;
ProcessNextState;
}

Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3.

Could you create a JIRA ticket to track this bug ?

Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3.

Thanks,
Tiago

From: Javed Osmany <javed.osmany@huawei.com mailto:javed.osmany@huawei.com>
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list <gem5-users@gem5.org mailto:gem5-users@gem5.org>
Cc: Javed Osmany <javed.osmany@huawei.com mailto:javed.osmany@huawei.com>
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)

Hello

I am modelling the following system:

a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)

b) All CPUs have private L1I and L1D caches.

c) Each cluster has a shared and unified L2$.

d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.

e) 4 x HNF/LLC/Directory

f) 1 x SNF

I am using gem5-21.2.1.0.

An example of the command used to run the lu_ncb benchmark being:

./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t

rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'

I am running the Parsec/Splash2 benchmark suite.

Extracting the stats from the stats.txt file, I have the following:

Blackscoles

Canneal

Swaptions

Cholesky

FFT

Fmm

Lu_cb

Lu_ncb

Raytrace

Volrend

Water_sq

Water_sp

Demand L2$ miss, little cluster

7019

9605353

7656

2724902

2930037

1365976

58955

1026556

594351

93401

24063

11435

Demand L2$ accesses, little cluster

13506

33101031

1207307

6206252

3511657

3199668

794479

4665754

2471593

1039411

393792

166955

Demand L3$ accesses, total

7165

10359847

9992

2686126

2929728

1321580

54026

51745

131095

22744

12840

8843

If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).

QS: Why don't all the L2$ misses make their way to the L3$?

In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.

Any insight greatly appreciated.

Best regards

JO

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Hello Tiago Thank you for the explanation. Will create a Jira bug request for this. Best regards J.Osmany From: Tiago Muck [mailto:Tiago.Muck@arm.com] Sent: 29 July 2022 20:06 To: gem5 users mailing list <gem5-users@gem5.org> Subject: [gem5-users] Re: CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF) Hi Javed, It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm): transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC, SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) { Initiate_Request; Initiate_CleanUnique; Pop_ReqRdyQueue; ProcessNextState; } Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3. Could you create a JIRA ticket to track this bug ? Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3. Thanks, Tiago ________________________________ From: Javed Osmany <javed.osmany@huawei.com<mailto:javed.osmany@huawei.com>> Sent: Friday, July 29, 2022 5:22 AM To: gem5 users mailing list <gem5-users@gem5.org<mailto:gem5-users@gem5.org>> Cc: Javed Osmany <javed.osmany@huawei.com<mailto:javed.osmany@huawei.com>> Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF) Hello I am modelling the following system: a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU) b) All CPUs have private L1I and L1D caches. c) Each cluster has a shared and unified L2$. d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node. e) 4 x HNF/LLC/Directory f) 1 x SNF I am using gem5-21.2.1.0. An example of the command used to run the lu_ncb benchmark being: ./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16' I am running the Parsec/Splash2 benchmark suite. Extracting the stats from the stats.txt file, I have the following: Blackscoles Canneal Swaptions Cholesky FFT Fmm Lu_cb Lu_ncb Raytrace Volrend Water_sq Water_sp Demand L2$ miss, little cluster 7019 9605353 7656 2724902 2930037 1365976 58955 1026556 594351 93401 24063 11435 Demand L2$ accesses, little cluster 13506 33101031 1207307 6206252 3511657 3199668 794479 4665754 2471593 1039411 393792 166955 Demand L3$ accesses, total 7165 10359847 9992 2686126 2929728 1321580 54026 51745 131095 22744 12840 8843 If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster). QS: Why don't all the L2$ misses make their way to the L3$? In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt. Any insight greatly appreciated. Best regards JO IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.