gem5-users@gem5.org

The gem5 Users mailing list

View all threads

CXL (Compute Express Link) in gem5?

EM
Eliot Moss
Sat, Mar 25, 2023 7:49 PM

I'm wondering what work has been done to model CXL in gem5.

Is it something that can be modeled with existing gem5
components by adjusting their timing and other parameters,
or would modeling it well require new components?

From a quick high-level review of what CXL is (Wikipedia),
I think I'm most interested in CXL.cache (giving a device
high performance coherent access to memory) and possibly
CXL.mem.  I'm more interested in modeling the performance
than in modeling all the parameter read-out and setup that
would be in CXL.io, as I understand it.

Regards - Eliot Moss

I'm wondering what work has been done to model CXL in gem5. Is it something that can be modeled with existing gem5 components by adjusting their timing and other parameters, or would modeling it well require new components? From a quick high-level review of what CXL is (Wikipedia), I *think* I'm most interested in CXL.cache (giving a device high performance coherent access to memory) and possibly CXL.mem. I'm more interested in modeling the performance than in modeling all the parameter read-out and setup that would be in CXL.io, as I understand it. Regards - Eliot Moss
GB
gabriel.busnot@arteris.com
Mon, Mar 27, 2023 10:13 AM

Hi Eliot,

I can’t provide you with an assertive answer but I’ve also been looking at CXL recently so here is what I understand so far.

From a functional perspective, the classic cache system seems able to support the hierarchical coherency aspects just fine with the coherent Xbar of each chip connected to a CPU side port of the other chip’s Xbar. The performance will probably be quite off, though. You could improve on it by implementing a kind of throttle adapter SimObject that would model the CXL link layer between the 2 xbars. Snoop performance modeling will remain atomic/blocking just as with any classic cache configuration.

As for Ruby, the goal is further away. AFAIK, no protocols supports hierarchical coherency (home node to home node requests, snoopable home node, etc.). If you don’t care too much about these details, then I would argue that configuring any Ruby protocol as usual and configuring your topology to force traffic through a single link could get you closer to a CXL-style configuration. You could also implement a link adapter/bridge component to model the CXL link layer better.

Regards,

Gabriel

Hi Eliot, I can’t provide you with an assertive answer but I’ve also been looking at CXL recently so here is what I understand so far. From a functional perspective, the classic cache system seems able to support the hierarchical coherency aspects just fine with the coherent Xbar of each chip connected to a CPU side port of the other chip’s Xbar. The performance will probably be quite off, though. You could improve on it by implementing a kind of throttle adapter SimObject that would model the CXL link layer between the 2 xbars. Snoop performance modeling will remain atomic/blocking just as with any classic cache configuration. As for Ruby, the goal is further away. AFAIK, no protocols supports hierarchical coherency (home node to home node requests, snoopable home node, etc.). If you don’t care too much about these details, then I would argue that configuring any Ruby protocol as usual and configuring your topology to force traffic through a single link could get you closer to a CXL-style configuration. You could also implement a link adapter/bridge component to model the CXL link layer better. Regards, Gabriel
EM
Eliot Moss
Tue, Apr 25, 2023 1:56 AM

On 3/27/2023 6:13 AM, gabriel.busnot--- via gem5-users wrote:

Thanks, Gabriel, for your response, now a month ago.  I want to turn my
attention back to this ... :-)

I can’t provide you with an assertive answer but I’ve also been looking at
CXL recently so here is what I understand so far.

From a functional perspective, the classic cache system seems able to
support the hierarchical coherency aspects just fine with the coherent Xbar
of each chip connected to a CPU side port of the other chip’s Xbar. The
performance will probably be quite off, though. You could improve on it by
implementing a kind of throttle adapter SimObject that would model the CXL
link layer between the 2 xbars. Snoop performance modeling will remain
atomic/blocking just as with any classic cache configuration.

I'm trying to envision doing this in a way that would work.  First, I
interpret you as saying that each component that plays this CXL "game" has its
own coherent Xbar.  You seemed to say that they would be cross connected.
Suppose we have two devices, X and Y.  The mem side of X would be connected to
the cpu side of Y, and mem side of Y to the cpu side of X.  What confuses me
about this is that it seems it would lead to infinite forwarding to mem sides.
It also seems to make it difficult to offer a single point of coherence.

A second arrangement I thought of is that CXL memories could be "level
infinity" caches, i.e., act like caches though the set of lines they hold is
fixed, and their lines are always valid.  Their mem sides would go to a final
coherency Xbar that would serve as the point-of-coherence of the system.  A
CXL memory would always fast-route requests having to with things outside its
address space to this coherency bus, so that that some other memory could
respond.

A third arrangement would be a variation on the second one: CXL memories are
level-infinity caches on the other side of a coherent Xbar "memory bus" with
routing such that each CXL memory get requests pertaining only to its part of
the physical address space.  A CXL device that has its own cache would connect
like a cpu+cache to the memory bus.  A CXL device that has no cache could
connect directly to the coherent Xbar memory bus.  It is not clear to me how
that is different from the current sort of arrangement.

The setup I would like to be able to assemble is this:

  • Regular cpu cores with a regular L1/L2/L3 cache hierarchy
  • A memory system like the smart memory cube [SMC] - highly parallel
  • A processor-in-memory [PIM] that:
    • has more-direct access to the SMC, but that access is still coherent
    • has a private scratch pad memory (non-coherent)
    • has its own its own cache that is coherent with the regular cores' memory
      hierarchy
    • has its own DMA units that transport data between coherent memory and the
      private scratchpad

I have previously built most of this, but the PIM's cache and DMA were not
coherent, and going through extra protocols to deal with that dragged
performance down.

As for Ruby, the goal is further away. AFAIK, no protocols supports
hierarchical coherency (home node to home node requests, snoopable home
node, etc.). If you don’t care too much about these details, then I would
argue that configuring any Ruby protocol as usual and configuring your
topology to force traffic through a single link could get you closer to a
CXL-style configuration.  You could also implement a link adapter/bridge
component to model the CXL link layer better.

I'm not really interested in Ruby - I've generally "rolled my own", so to
speak.

Maybe it would be useful to set up a Zoom meeting where we can sketch systems
diagrams or something!

Best wishes - Eliot

On 3/27/2023 6:13 AM, gabriel.busnot--- via gem5-users wrote: Thanks, Gabriel, for your response, now a month ago. I want to turn my attention back to this ... :-) > I can’t provide you with an assertive answer but I’ve also been looking at > CXL recently so here is what I understand so far. > From a functional perspective, the classic cache system seems able to > support the hierarchical coherency aspects just fine with the coherent Xbar > of each chip connected to a CPU side port of the other chip’s Xbar. The > performance will probably be quite off, though. You could improve on it by > implementing a kind of throttle adapter SimObject that would model the CXL > link layer between the 2 xbars. Snoop performance modeling will remain > atomic/blocking just as with any classic cache configuration. I'm trying to envision doing this in a way that would work. First, I interpret you as saying that each component that plays this CXL "game" has its own coherent Xbar. You seemed to say that they would be cross connected. Suppose we have two devices, X and Y. The mem side of X would be connected to the cpu side of Y, and mem side of Y to the cpu side of X. What confuses me about this is that it seems it would lead to infinite forwarding to mem sides. It also seems to make it difficult to offer a single point of coherence. A second arrangement I thought of is that CXL memories could be "level infinity" caches, i.e., act like caches though the set of lines they hold is fixed, and their lines are always valid. Their mem sides would go to a final coherency Xbar that would serve as the point-of-coherence of the system. A CXL memory would always fast-route requests having to with things outside its address space to this coherency bus, so that that some other memory could respond. A third arrangement would be a variation on the second one: CXL memories are level-infinity caches on the other side of a coherent Xbar "memory bus" with routing such that each CXL memory get requests pertaining only to its part of the physical address space. A CXL device that has its own cache would connect like a cpu+cache to the memory bus. A CXL device that has no cache could connect directly to the coherent Xbar memory bus. It is not clear to me how that is different from the current sort of arrangement. The setup I would like to be able to assemble is this: - Regular cpu cores with a regular L1/L2/L3 cache hierarchy - A memory system like the smart memory cube [SMC] - highly parallel - A processor-in-memory [PIM] that: - has more-direct access to the SMC, but that access is still coherent - has a private scratch pad memory (non-coherent) - has its own its own cache that is coherent with the regular cores' memory hierarchy - has its own DMA units that transport data between coherent memory and the private scratchpad I have previously built most of this, but the PIM's cache and DMA were not coherent, and going through extra protocols to deal with that dragged performance down. > As for Ruby, the goal is further away. AFAIK, no protocols supports > hierarchical coherency (home node to home node requests, snoopable home > node, etc.). If you don’t care too much about these details, then I would > argue that configuring any Ruby protocol as usual and configuring your > topology to force traffic through a single link could get you closer to a > CXL-style configuration. You could also implement a link adapter/bridge > component to model the CXL link layer better. I'm not really interested in Ruby - I've generally "rolled my own", so to speak. Maybe it would be useful to set up a Zoom meeting where we can sketch systems diagrams or something! Best wishes - Eliot