Hello
I am modelling the following system:
a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)
b) All CPUs have private L1I and L1D caches.
c) Each cluster has a shared and unified L2$.
d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.
e) 4 x HNF/LLC/Directory
f) 1 x SNF
I am using gem5-21.2.1.0.
An example of the command used to run the lu_ncb benchmark being:
./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t
rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'
I am running the Parsec/Splash2 benchmark suite.
Extracting the stats from the stats.txt file, I have the following:
Blackscoles
Canneal
Swaptions
Cholesky
FFT
Fmm
Lu_cb
Lu_ncb
Raytrace
Volrend
Water_sq
Water_sp
Demand L2$ miss, little cluster
7019
9605353
7656
2724902
2930037
1365976
58955
1026556
594351
93401
24063
11435
Demand L2$ accesses, little cluster
13506
33101031
1207307
6206252
3511657
3199668
794479
4665754
2471593
1039411
393792
166955
Demand L3$ accesses, total
7165
10359847
9992
2686126
2929728
1321580
54026
51745
131095
22744
12840
8843
If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).
QS: Why don't all the L2$ misses make their way to the L3$?
In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.
Any insight greatly appreciated.
Best regards
JO
Hi Javed,
It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):
transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) {
Initiate_Request;
Initiate_CleanUnique;
Pop_ReqRdyQueue;
ProcessNextState;
}
Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3.
Could you create a JIRA ticket to track this bug ?
Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3.
Thanks,
Tiago
From: Javed Osmany javed.osmany@huawei.com
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list gem5-users@gem5.org
Cc: Javed Osmany javed.osmany@huawei.com
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)
Hello
I am modelling the following system:
a) Three clusters – big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)
b) All CPUs have private L1I and L1D caches.
c) Each cluster has a shared and unified L2$.
d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.
e) 4 x HNF/LLC/Directory
f) 1 x SNF
I am using gem5-21.2.1.0.
An example of the command used to run the lu_ncb benchmark being:
./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache –debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE –bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t
rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'
I am running the Parsec/Splash2 benchmark suite.
Extracting the stats from the stats.txt file, I have the following:
Blackscoles
Canneal
Swaptions
Cholesky
FFT
Fmm
Lu_cb
Lu_ncb
Raytrace
Volrend
Water_sq
Water_sp
Demand L2$ miss, little cluster
7019
9605353
7656
2724902
2930037
1365976
58955
1026556
594351
93401
24063
11435
Demand L2$ accesses, little cluster
13506
33101031
1207307
6206252
3511657
3199668
794479
4665754
2471593
1039411
393792
166955
Demand L3$ accesses, total
7165
10359847
9992
2686126
2929728
1321580
54026
51745
131095
22744
12840
8843
If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).
QS: Why don’t all the L2$ misses make their way to the L3$?
In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.
Any insight greatly appreciated.
Best regards
JO
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hello Tiago
Thank you for the explanation.
Will create a Jira bug request for this.
Best regards
J.Osmany
From: Tiago Muck [mailto:Tiago.Muck@arm.com]
Sent: 29 July 2022 20:06
To: gem5 users mailing list gem5-users@gem5.org
Subject: [gem5-users] Re: CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)
Hi Javed,
It seems there is a bug in handling CleanUnique requests. From the code (src/mem/ruby/protocol/chi/CHI-cache-transitions.sm):
transition({I, SC, UC, SD, UD, RU, RSC, RSD, RUSD, RUSC,
SC_RSC, SD_RSD, SD_RSC, UC_RSC, UC_RU, UD_RU, UD_RSD, UD_RSC}, CleanUnique, BUSY_BLKD) {
Initiate_Request;
Initiate_CleanUnique;
Pop_ReqRdyQueue;
ProcessNextState;
}
Profile_Miss/Profile_Hit are not being called so the stats are not being incremented for a CleanUnique arriving at the L3.
Could you create a JIRA ticket to track this bug ?
Also note that some requests that miss in the L2 never go the the L3. E.g.: if the line is UC/UD at one of the other cores L1, it will always count as miss in the L2 because you have to get the copy from the other core L1, but no request is generated to the L3.
Thanks,
Tiago
From: Javed Osmany <javed.osmany@huawei.commailto:javed.osmany@huawei.com>
Sent: Friday, July 29, 2022 5:22 AM
To: gem5 users mailing list <gem5-users@gem5.orgmailto:gem5-users@gem5.org>
Cc: Javed Osmany <javed.osmany@huawei.commailto:javed.osmany@huawei.com>
Subject: [gem5-users] CHI protocol - Adding an intermediate L3$ between L2$ and LLC (in HNF)
Hello
I am modelling the following system:
a) Three clusters - big (1 x CPU), Middle (3 x CPU), Little (4 x CPU)
b) All CPUs have private L1I and L1D caches.
c) Each cluster has a shared and unified L2$.
d) Model a shared and unified L3$, shared between [middle, little] clusters. The L3$ is modelled as a CHI_Node.
e) 4 x HNF/LLC/Directory
f) 1 x SNF
I am using gem5-21.2.1.0.
An example of the command used to run the lu_ncb benchmark being:
./build/ARM/gem5.opt --outdir=m5out_parsec_lu_ncb_134_8rnf_1snf_4hnf_3_clust_all_shr_l2_sincl_sincl_mincl_debug_ruby_cache -debug-flag=RubyCache configs/example/se_kirin_custom.py --ruby --topology=Crossbar --cpu-type=m1 --num-cpus=8 --num-dirs=1 --num-llc-caches=4 --num-cpu-bigclust=1 --num-cpu-middleclust=3 --num-cpu-littleclust=4 --num-clusters=3 --cpu-type-bigclust=m1 --cpu-type-middleclust=m1 --cpu-type-littleclust=a76 --bigclust-l2cache=shared --middleclust-l2cache=shared --littleclust-l2cache=shared --l1i-size-big=64kB --l1d-size-big=64kB --l1i-assoc-big=4 --l1d-assoc-big=4 --l1i-size-middle=64kB --l1d-size-middle=64kB --l1i-assoc-middle=4 --l1d-assoc-middle=4 --l1i-size-little=64kB --l1d-size-little=64kB --l1i-assoc-little=4 --l1d-assoc-little=4 --l2-size-big=2048kB --l2-assoc-big=8 --l2-size-middle=8192kB --l2-assoc-middle=16 --l2-size-little=8192kB --l2-assoc-little=16 --l3-size=2048kB --l3-assoc=16 --num-bigclust-subclust=1 --num-middleclust-subclust=1 --num-littleclust-subclust=1 --num-cpu-bigclust-subclust2=1 --num-cpu-middleclust-subclust2=1 --num-cpu-littleclust-subclust2=1 --bp-type-littleclust=LTAGE -bp-typemiddleclust=LTAGE --bp-type-bigclust=LTAGE --l2-big-clusivity=sincl --l2-middle-clusivity=sincl --l2-little-clusivity=sincl --l3-clusivity=sincl --l2-big-data-latency=12 --l2-middle-data-latency=12 --l2-little-data-latency=12 --l2-big-tag-latency=5 --l2-middle-tag-latency=5 --l2-little-tag-latency=5 --sc-size=1024kB --sc-assoc=16 --l3-data-latency=45 --l3-tag-latency=10 --sc-data-latency=60 --sc-tag-latency=20 --sc-clusivity=mincl --little-mid-clust-add-l3=true --big-cpu-clock=3GHz --middle-cpu-clock=2.6GHz --little-cpu-clock=2GHz --sys-clock=1.1GHz --ruby-clock=2GHz --cacheline_size=64 --verbose=t
rue --cmd=tests/parsec/splash2/lu_ncb/splash2x.lu_ncb.hooks -o ' -p4 -n512 -b16'
I am running the Parsec/Splash2 benchmark suite.
Extracting the stats from the stats.txt file, I have the following:
Blackscoles
Canneal
Swaptions
Cholesky
FFT
Fmm
Lu_cb
Lu_ncb
Raytrace
Volrend
Water_sq
Water_sp
Demand L2$ miss, little cluster
7019
9605353
7656
2724902
2930037
1365976
58955
1026556
594351
93401
24063
11435
Demand L2$ accesses, little cluster
13506
33101031
1207307
6206252
3511657
3199668
794479
4665754
2471593
1039411
393792
166955
Demand L3$ accesses, total
7165
10359847
9992
2686126
2929728
1321580
54026
51745
131095
22744
12840
8843
If I compare row1 and row3, the number of demand L3$ accesses is lower for the Splash2 benchmarks (and in some benchmarks, considerably lower) than the number of demand L2$ misses for the little cluster (the little cluster is the main compute cluster).
QS: Why don't all the L2$ misses make their way to the L3$?
In the attachment, I have included my versions of CHI.py, CHI_config.py, config.ini, stats.txt.
Any insight greatly appreciated.
Best regards
JO
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.