Flexible Memory Access
async_mmap
is TAPA’s flexible interface to access external memory through
the AXI protocol.
It exposes the five AXI channels (AR, R, AW, W, B) in C++.
In this way,
users have maximal control over how they want to access the external memory.
The async_mmap
feature has several advantages:
Richer expressiveness and more flexible memory access pattern.
Runtime burst detection.
Smaller area overhead.
A successful case is the SPLAG accelerator for single-source shortest path (SSSP), which achieves up to a 4.9× speedup over state-of-the-art SSSP accelerators, up to a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and up to a 0.9× speedup over an A100 GPU (that has 4.1× power budget and 3.4× HBM bandwidth).
That being said,
async_mmap
is relatively more difficult to use when compared to the
conventional approach (abstract external memory as a simple array)
because of the extra expressiveness.
This tutorial shows examples of how to bring out the most potential of
async_mmap
.
The Programming Model of async_mmap
The following code shows the definition of async_mmap
:
template <typename T>
struct async_mmap {
using addr_t = int64_t;
using resp_t = uint8_t;
tapa::ostream<addr_t> read_addr;
tapa::istream<T> read_data;
tapa::ostream<addr_t> write_addr;
tapa::ostream<T> write_data;
tapa::istream<resp_t> write_resp;
};
async_mmap
abstracts an external memory as an interface consisting of 5 channels/streams/FIFOs, as shown in the following figure.
On the read side, if we send one address to the
read_addr
channel, the data of typeT
stored in that address will appear later in theread_data
channel.If we send multiple addresses to the
read_addr
channel, the corresponding data (i.e., the read responses) will appear in order in theread_data
channel.On the write side, if we (1) send an address to the
write_addr
channel, and (2) send the corresponding data to thewrite_data
channel, then the data will be written into the associated address.If there are multiple outstanding write requests, they will be committed in order.
The
write_resp
channel will receive data that represent how many write transactions have succeeded.
Basic Usage of async_mmap
async_mmap
is a special implementation of mmap
that should be used
only as formal parameters in lower-level tasks [1].
async_mmap
can be constructed from mmap
,
and we could pass an mmap
argument to an async_mmap
parameter.
Due to certain limitations from the Vitis HLS compiler,
async_mmap
must be passed by reference, i.e., with &
.
In contrast, mmap
must be passed by value, i.e., without &
.
void task1(tapa::async_mmap<data_t>& mem);
void task2(tapa:: mmap<data_t> mem);
// Note the &
void task1(tapa::async_mmap<data_t>& mem) {
// ...
mem.read_addr.write(...);
mem.read_data.read();
// ...
}
// Note no &
void task2(tapa::mmap<data_t> mem) {
// ...
mem[i] = foo;
bar = mem[j];
// ...
}
void top(tapa::mmap<data_t> mem1, tapa::mmap<data_t> mem2) {
tapa::task()
.invoke(task1, mem1)
.invoke(task2, mem2)
;
}
Runtime Burst Detection
mmap
(which uses Vitis HLS #pragma HLS interface m_axi
under the hood)
are synchronous memory interfaces that heavily rely on memory bursts.
Without memory bursts, the access pattern looks like the following:
An obvious problem is that the long memory latency (typically 100 ~ 200 ns) can result in very low memory throughput. To solve this problem, memory bursts have been used extensively, which allows the kernel to receive many pieces of data using a single memory request:
However,
memory bursts are only available when the memory access pattern is consecutive.
To solve the problem, TAPA async_mmap
takes a different approach,
which is to issue multiple outstanding requests at the same time:
Multi-outstanding asynchronous requests are much more efficient than single-outstanding synchronous requests, but for sequential access patterns, accessing memory in large bursts is still significantly more efficient than in small individual transactions on the external memory. For example, reading 4 KB of data in one AXI transaction is much faster than 512 smaller 8-byte AXI transactions. Existing HLS tools (e.g., Vitis HLS) generally rely on static analysis to infer bursts, which may generate unpredictable and limited hardware.
Instead, TAPA infers burst transactions at runtime. User only needs to issue individual read/write transactions, and TAPA provides optimized modules to combine and merge sequential transactions into burst transactions at runtime.
With asynchronous memory interfaces and runtime burst detection,
async_mmap
makes it possible to achieve high memory throughput for both
sequential and random memory accesses.
Smaller Area Overhead
When interacting with the AXI interface, Vitis HLS will buffer the entire burst transactions using on-chip memories. For a 512-bit AXI interface, the AXI buffers generated by Vitis HLS costs 15 BRAM_18K each for the read channel and the write channel. This becomes a huge problem for HBM devices, where the bottom SLR is packed with 32 HBM channels, and the AXI buffers along takes away >900 BRAM_18K from the bottom SLR.
In our settings, the read responses will be directly passed to the user logic through a stream interface, thus the AXI interface has much smaller area.
The following table shows quantitative results from a microbenchmark:
Memory Interface |
Clock/MHz |
LUT |
FF |
BRAM |
URAM |
DSP |
---|---|---|---|---|---|---|
|
300 |
1189 |
3740 |
15 |
0 |
0 |
|
300 |
1466 |
162 |
0 |
0 |
0 |
Example 1: Multi-Outstanding Random Memory Accesses
This example shows how to implement efficient random memory accesses using TAPA.
The key point is to allow multiple outstanding memory operations.
Even though random memory access cannot be merged into bursts,
it is still more effective to allow multiple outstanding transactions.
In the following example,
the issue_read_addr
task will keep issuing read requests as long as the AXI
interface is ready to accept,
while the receive_read_resp
task is only responsible for receiving and
process the responses.
void issue_read_addr(
tapa::async_mmap<data_t>& mem,
int n) {
addr_t random_addr[N];
for (int i = 0; i < n; ) {
#pragma HLS pipeline II=1
if (!mem.read_addr.full()) {
mem.read_addr.try_write(random_addr[i]);
i++;
}
}
}
void receive_read_resp(
tapa::async_mmap<data_t>& mem,
int n) {
for (int i = 0; i < n; ) {
#pragma HLS pipeline II=1
if (!mem.read_data.empty()) {
data_t d = mem.read_data.read();
i++;
// ...
}
}
}
void top(tapa::mmap<data_t> mem, int n) {
tapa::task()
// ...
.invoke(issue_read_addr, mem, n)
.invoke(receive_read_resp, mem, n)
// ...
;
}
This simple design is actually very hard or infeasible to implement in Vitis HLS. Consider the following Vitis HLS counterpart. The generated hardware will issue one read request, then wait for its response before issuing another read request, so there will be only 1 outstanding transactions.
// Inferior Vitis HLS code
for (int i = 0; i < n; i++) {
#pragma HLS pipeline II=1
data_t d = mem[ random_addr[i] ];
// ... process d
}
Example 2: Sequential Read from async_mmap
into an Array
Since the outbound
read_addr
channel and the inboundread_data
channel are separate, we use two iterator variablesi_req
andi_resp
to track the progress of each channel.When the number of responses
i_resp
match the targetcount
, the loop will terminate.In each loop iteration, we send a new read request if:
The number of read requests
i_req
is less than the total requestcount
.The
read_addr
channel of the async_mmapmem
is not full.We increment
i_req
if we successfully issue a read request.
In each loop iteration, we check if the
read_data
channel has data.If so, we get the data from the
read_data
channel and stores into an array.We increment
i_resp
when we receive a new read response.
Note that issuing read addresses and receiving read responses must all be non-blocking so that they could function in parallel.
template <typename mmap_t, typename addr_t, typename data_t>
void async_mmap_read_to_array(
tapa::async_mmap<mmap_t>& mem,
data_t* array,
addr_t base_addr,
unsigned int count,
unsigned int stride) {
for (int i_req = 0, i_resp = 0; i_resp < count;) {
#pragma HLS pipeline II=1
if (i_req < count &&
mem.read_addr.try_write(base_addr + i_req * stride)) {
++i_req;
}
if (!mem.read_data.empty()) {
array[i_resp] = mem.read_data.read(nullptr);
++i_resp;
}
}
}
Example 3: Sequential Write into async_mmap
from a FIFO
Compared to Example 2, this example is slightly more complicated because we are reading from a stream. Therefore, we need to additionally check if the stream/FIFO is empty before executing an operation.
Note that in this example, we don’t actually need the data from the write_resp
channel. Still, we need to dump the data from write_resp
, otherwise the FIFO will become full and block further write operations.
template <typename mmap_t, typename stream_t, typename addr_t, typename count_t, typename stride_t>
void async_mmap_write_from_fifo(
tapa::async_mmap<mmap_t>& mem,
tapa::istream<stream_t>& fifo,
addr_t base_addr,
count_t count,
stride_t stride) {
#pragma HLS inline
for(int i_req = 0, i_resp = 0; i_resp < count;) {
#pragma HLS pipeline II=1
// issue write requests
if (i_req < count &&
!fifo.empty() &&
!mem.write_addr.full() &&
!mem.write_data.full()) {
mem.write_addr.try_write(base_addr + i_req * stride);
mem.write_data.try_write(fifo.read(nullptr));
++i_req;
}
// receive acks of write success
if (!mem.write_resp.empty()) {
i_resp += unsigned(mem.write_resp.read(nullptr)) + 1;
}
}
}
Example 4: Simultaneous Read and Write to async_mmap
This example reads from the external memory, increment the data by 1,
then write to the same device in a fully pipelined fashion.
This is also a pattern that can hardly be described when abstracting the memory
as an array.
A naive implementation like mem[i] = foo(mem[i])
in Vitis HLS will result in
a low-performance implementation where there will only be one outstanding transaction (similar to the situation in Example 1).
void Copy(tapa::async_mmap<Elem>& mem, uint64_t n, uint64_t flags) {
Elem elem;
for (int64_t i_rd_req = 0, i_rd_resp = 0, i_wr_req = 0, i_wr_resp = 0;
i_rd_resp < n || i_wr_resp < n;) {
#pragma HLS pipeline II=1
bool can_read = !mem.read_data.empty();
bool can_write = !mem.write_addr.full() && !mem.write_data.full();
int64_t read_addr = i_rd_req;
int64_t write_addr = i_wr_req;
if (i_rd_req < n && mem.read_addr.try_write(read_addr)) {
++i_rd_req;
}
if (can_read && can_write) {
mem.read_data.try_read(elem);
mem.write_addr.write(write_addr);
mem.write_data.write(elem + 1);
++i_rd_resp;
++i_wr_req;
}
if (!mem.write_resp.empty()) {
i_wr_resp += mem.write_resp.read(nullptr) + 1;
}
}
}