Hi
I was trying to understand 'mla' and 'sdot' instructions in ARM SVE ISA. I
am using the gem5 pipeline view and 03CPUall debug flags to generate the
trace needed by Konata to create the pipeline view.
I see that mla sometimes takes too many fetch cycles, but sdot almost
always takes the same number of fetch cycles.
Here are the screenshots for reference. It's from an execution where I am
trying to do matrix multiplication.
As you can see on the mla pipeline, it takes about 155 cycles in the fetch
stage. But the sdot on the other hand takes 19 cycles for sdot.
The codes are similar in the sense of functionality.
I am also ruling out the icache miss scenario here ( since there doesn't
seem to be an icache miss and because the program assembly size is minimal).
Any idea whats/why is it happening?
GS Nitesh Narayana https://nitesh8998.gitlab.io/
Department of Computer Architecture
Polytechnic University of Catalonia Barcelona 2021-2025
Webpage: nitesh8998.gitlab.io
Hello Nitesh.
To me, it seems that probably is a icache miss. Did you check if there is a cache miss or just assumed that is not a cache miss because the assembly is small?
Can you send me the binary file to run locally and check myself what could be?
Best regards
Francisco Carlos Silva Junior
Phd Student at University of Braislia
De: Nitesh Narayana GS nitesh@ac.upc.edu
Enviado: terça-feira, 16 de agosto de 2022 07:50
Para: gem5 users mailing list gem5-users@gem5.org
Assunto: [gem5-users] Fetch stage too long for some instructions
Hi
I was trying to understand 'mla' and 'sdot' instructions in ARM SVE ISA. I am using the gem5 pipeline view and 03CPUall debug flags to generate the trace needed by Konata to create the pipeline view.
I see that mla sometimes takes too many fetch cycles, but sdot almost always takes the same number of fetch cycles.
Here are the screenshots for reference. It's from an execution where I am trying to do matrix multiplication.
As you can see on the mla pipeline, it takes about 155 cycles in the fetch stage. But the sdot on the other hand takes 19 cycles for sdot.
The codes are similar in the sense of functionality.
I am also ruling out the icache miss scenario here ( since there doesn't seem to be an icache miss and because the program assembly size is minimal).
Any idea whats/why is it happening?
GS Nitesh Narayanahttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnitesh8998.gitlab.io%2F&data=05%7C01%7C%7C142c670ebe614d5a7a5408da7f75cd02%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637962441172549496%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAh%2Br4FW9YyxTTTOWlArq%2BEyjZnKFYD5NxdyEk0lcTA%3D&reserved=0
Department of Computer Architecture
Polytechnic University of Catalonia Barcelona 2021-2025
Webpage: nitesh8998.gitlab.iohttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnitesh8998.gitlab.io%2F&data=05%7C01%7C%7C142c670ebe614d5a7a5408da7f75cd02%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637962441172549496%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAh%2Br4FW9YyxTTTOWlArq%2BEyjZnKFYD5NxdyEk0lcTA%3D&reserved=0