Empathy List Archives

KS

Khan Shaikhul Hadi

Thu, Jul 6, 2023 5:47 PM

In my configuration I used CPUTypes.O3 and PrivateL1SharedL2CacheHeirarchy
to check how clflush and fence impacts the timing of workload. In my
workload I run 10,000 iteration to update an array value, 200 updates per
thread. In workload, I have :
for( ;index <end_index-1;index++){
ARR[index]=thread_ID;
ARR[index+1]=thread_ID;
FENCE;
}
to simulate two consecutive localized write operations and see the
impact of the fence. Insertion of FENCE ( macro to insert mfence ) increase
execution time by 24%. In second scenario, I have :

for( ;index <end_index-1;index++){

ARR[index]=thread_ID;
FLUSH(&ARR[index]);
FENCE;
}

Where FLUSH (macro for _mm_clflush) should take more time to complete than
ARR[index+1]=thread_ID as this memory update should be highly localized and
flush needs to get acknowledgement from all levels of cache before
complete. So, FENCE should have much more penalty for flush compared to
write operation. So, I was hoping to see a high execution time increase for
insertion of fences in the second scenario. But insertion of the fence only
increases 2% execution time which is counter intuitive.
Can anyone explain why I'm seeing this behaviour ? As far as I understand,
the memory fence should let the following instruction execute after all
previous instructions are completed and removed from the store buffer
in which case clflush should take more time than regular write operation.

Best
Shaikhul

In my configuration I used CPUTypes.O3 and PrivateL1SharedL2CacheHeirarchy to check how clflush and fence impacts the timing of workload. In my workload I run 10,000 iteration to update an array value, 200 updates per thread. In workload, I have : for( ;index <end_index-1;index++){ ARR[index]=thread_ID; ARR[index+1]=thread_ID; FENCE; } to simulate two consecutive localized write operations and see the impact of the fence. Insertion of FENCE ( macro to insert mfence ) increase execution time by 24%. In second scenario, I have : for( ;index <end_index-1;index++){ > ARR[index]=thread_ID; > FLUSH(&ARR[index]); > FENCE; > } > Where FLUSH (macro for _mm_clflush) should take more time to complete than ARR[index+1]=thread_ID as this memory update should be highly localized and flush needs to get acknowledgement from all levels of cache before complete. So, FENCE should have much more penalty for flush compared to write operation. So, I was hoping to see a high execution time increase for insertion of fences in the second scenario. But insertion of the fence only increases 2% execution time which is counter intuitive. Can anyone explain why I'm seeing this behaviour ? As far as I understand, the memory fence should let the following instruction execute after all previous instructions are completed and removed from the store buffer in which case clflush should take more time than regular write operation. Best Shaikhul

EM

Eliot Moss

Wed, Jul 12, 2023 10:31 PM

On 7/6/2023 1:47 PM, Khan Shaikhul Hadi via gem5-users wrote:

In my configuration I used CPUTypes.O3 and PrivateL1SharedL2CacheHeirarchy to check how clflush and
fence impacts the timing of workload. In my workload I run 10,000 iteration to update an array
value, 200 updates per thread. In workload, I have :
for( ;index <end_index-1;index++){
ARR[index]=thread_ID;
ARR[index+1]=thread_ID;
FENCE;
}
to simulate two consecutive localized write operations and see the impact of the fence. Insertion of
FENCE ( macro to insert mfence ) increase execution time by 24%. In second scenario, I have :

 for( ;index <end_index-1;index++){
 ARR[index]=thread_ID;
 FLUSH(&ARR[index]);
 FENCE;
 }

Where FLUSH (macro for _mm_clflush) should take more time to complete than ARR[index+1]=thread_ID
as this memory update should be highly localized and flush needs to get acknowledgement from all
levels of cache before complete. So, FENCE should have much more penalty for flush compared to write
operation. So, I was hoping to see a high execution time increase for insertion of fences in the
second scenario. But insertion of the fence only increases 2% execution time which is counter
intuitive.
Can anyone explain why I'm seeing this behaviour ? As far as I understand, the memory fence should
let the following instruction execute after all previous instructions are completed and removed from
the store buffer in which case clflush should take more time than regular write operation.

Sorry I am only now seeing this ...

IIRC from my work on improving cache write back / flush behavior,
the gem5 implementation considers the flush complete when the
operation reaches the L1 cache - similar to what happens with
stores. I agree that from a timing standpoint this is wrong,
which is why I undertook some substantial surgery. I need to
forward port to more recent releases, do testing, etc., but in
principle have a solution that:

Gives line flush instructions timing where they are not complete
until any write back makes it to the memory bus.
Deals with the weaker ordering of clwb and clflushopt (which
required retooling the store unit queue processing order).
Supports invd, wbinvd, and wbnoind in addition to the line
flush operations.

Not sure when I will be able to accomplish putting these together
as patches for the powers that be to review ...

Regards - Eliot Moss

On 7/6/2023 1:47 PM, Khan Shaikhul Hadi via gem5-users wrote: > In my configuration I used CPUTypes.O3 and PrivateL1SharedL2CacheHeirarchy to check how clflush and > fence impacts the timing of workload. In my workload I run 10,000 iteration to update an array > value, 200 updates per thread. In workload, I have : > for( ;index <end_index-1;index++){ > ARR[index]=thread_ID; > ARR[index+1]=thread_ID; > FENCE; > } > to simulate two consecutive localized write operations and see the impact of the fence. Insertion of > FENCE ( macro to insert mfence ) increase execution time by 24%. In second scenario, I have : > > for( ;index <end_index-1;index++){ > ARR[index]=thread_ID; > FLUSH(&ARR[index]); > FENCE; > } > > > Where FLUSH (macro for _mm_clflush) should take more time to complete than ARR[index+1]=thread_ID > as this memory update should be highly localized and flush needs to get acknowledgement from all > levels of cache before complete. So, FENCE should have much more penalty for flush compared to write > operation. So, I was hoping to see a high execution time increase for insertion of fences in the > second scenario. But insertion of the fence only increases 2% execution time which is counter > intuitive. > Can anyone explain why I'm seeing this behaviour ? As far as I understand, the memory fence should > let the following instruction execute after all previous instructions are completed and removed from > the store buffer in which case clflush should take more time than regular write operation. Sorry I am only now seeing this ... IIRC from my work on improving cache write back / flush behavior, the gem5 implementation considers the flush complete when the operation reaches the L1 cache - similar to what happens with stores. I agree that from a timing standpoint this is wrong, which is why I undertook some substantial surgery. I need to forward port to more recent releases, do testing, etc., but in principle have a solution that: - Gives line flush instructions timing where they are not complete until any write back makes it to the memory bus. - Deals with the weaker ordering of clwb and clflushopt (which required retooling the store unit queue processing order). - Supports invd, wbinvd, and wbnoind in addition to the line flush operations. Not sure when I will be able to accomplish putting these together as patches for the powers that be to review ... Regards - Eliot Moss

KS

Khan Shaikhul Hadi

Tue, Aug 13, 2024 9:01 PM

Hi,
is the new version of gem5 addressing this issue (considering flush
complete when the operation reaches the L1 cache instead of when data is
completely flushed from the system) for classic cache?

On Wed, Jul 12, 2023 at 6:31 PM Eliot Moss moss@cs.umass.edu wrote:

On 7/6/2023 1:47 PM, Khan Shaikhul Hadi via gem5-users wrote:

In my configuration I used CPUTypes.O3 and

PrivateL1SharedL2CacheHeirarchy to check how clflush and

fence impacts the timing of workload. In my workload I run 10,000

iteration to update an array

value, 200 updates per thread. In workload, I have :
for( ;index <end_index-1;index++){
ARR[index]=thread_ID;
ARR[index+1]=thread_ID;
FENCE;
}
to simulate two consecutive localized write operations and see the

impact of the fence. Insertion of

FENCE ( macro to insert mfence ) increase execution time by 24%. In

second scenario, I have :

 for( ;index <end_index-1;index++){
 ARR[index]=thread_ID;
 FLUSH(&ARR[index]);
 FENCE;
 }

Where FLUSH (macro for _mm_clflush) should take more time to

complete than ARR[index+1]=thread_ID

as this memory update should be highly localized and flush needs to get

acknowledgement from all

levels of cache before complete. So, FENCE should have much more penalty

for flush compared to write

operation. So, I was hoping to see a high execution time increase for

insertion of fences in the

second scenario. But insertion of the fence only increases 2% execution

time which is counter

intuitive.
Can anyone explain why I'm seeing this behaviour ? As far as I

understand, the memory fence should

let the following instruction execute after all previous instructions

are completed and removed from

the store buffer in which case clflush should take more time than

regular write operation.

Sorry I am only now seeing this ...

IIRC from my work on improving cache write back / flush behavior,
the gem5 implementation considers the flush complete when the
operation reaches the L1 cache - similar to what happens with
stores. I agree that from a timing standpoint this is wrong,
which is why I undertook some substantial surgery. I need to
forward port to more recent releases, do testing, etc., but in
principle have a solution that:

Gives line flush instructions timing where they are not complete
until any write back makes it to the memory bus.
Deals with the weaker ordering of clwb and clflushopt (which
required retooling the store unit queue processing order).
Supports invd, wbinvd, and wbnoind in addition to the line
flush operations.

Not sure when I will be able to accomplish putting these together
as patches for the powers that be to review ...

Regards - Eliot Moss

Hi, is the new version of gem5 addressing this issue (considering flush complete when the operation reaches the L1 cache instead of when data is completely flushed from the system) for classic cache? On Wed, Jul 12, 2023 at 6:31 PM Eliot Moss <moss@cs.umass.edu> wrote: > On 7/6/2023 1:47 PM, Khan Shaikhul Hadi via gem5-users wrote: > > In my configuration I used CPUTypes.O3 and > PrivateL1SharedL2CacheHeirarchy to check how clflush and > > fence impacts the timing of workload. In my workload I run 10,000 > iteration to update an array > > value, 200 updates per thread. In workload, I have : > > for( ;index <end_index-1;index++){ > > ARR[index]=thread_ID; > > ARR[index+1]=thread_ID; > > FENCE; > > } > > to simulate two consecutive localized write operations and see the > impact of the fence. Insertion of > > FENCE ( macro to insert mfence ) increase execution time by 24%. In > second scenario, I have : > > > > for( ;index <end_index-1;index++){ > > ARR[index]=thread_ID; > > FLUSH(&ARR[index]); > > FENCE; > > } > > > > > > Where FLUSH (macro for _mm_clflush) should take more time to > complete than ARR[index+1]=thread_ID > > as this memory update should be highly localized and flush needs to get > acknowledgement from all > > levels of cache before complete. So, FENCE should have much more penalty > for flush compared to write > > operation. So, I was hoping to see a high execution time increase for > insertion of fences in the > > second scenario. But insertion of the fence only increases 2% execution > time which is counter > > intuitive. > > Can anyone explain why I'm seeing this behaviour ? As far as I > understand, the memory fence should > > let the following instruction execute after all previous instructions > are completed and removed from > > the store buffer in which case clflush should take more time than > regular write operation. > > Sorry I am only now seeing this ... > > IIRC from my work on improving cache write back / flush behavior, > the gem5 implementation considers the flush complete when the > operation reaches the L1 cache - similar to what happens with > stores. I agree that from a timing standpoint this is wrong, > which is why I undertook some substantial surgery. I need to > forward port to more recent releases, do testing, etc., but in > principle have a solution that: > > - Gives line flush instructions timing where they are not complete > until any write back makes it to the memory bus. > > - Deals with the weaker ordering of clwb and clflushopt (which > required retooling the store unit queue processing order). > > - Supports invd, wbinvd, and wbnoind in addition to the line > flush operations. > > Not sure when I will be able to accomplish putting these together > as patches for the powers that be to review ... > > Regards - Eliot Moss >

gem5-users@gem5.org

Can't explain timing result for flush and fence in classical cache hierarchy