Most digital systems use general-purpose data storage known as Random Access Memory, or RAM. FPGAs typically provide some form of Block RAM resources. In Xilinx FPGAs these are organized as configurable memory segments that can be arranged to make memories with different bit-width and storage capacity.
From the vantage point of RTL design, a RAM is an array. Each element of the array is called a word, and the words all have the same number of bits, called the bit width W
or word size. Common word sizes are 8, 16, 32, or 64 bits.
The length of the array is called the depth D
. Usually the depth is a power of 2, e.g. 128 words, or 256, 512, 1024, and so on.
A particular position in the array is called the address a
, and the number of address bits is N >= log2(D)
. Usually N = log2(D)
, but sometimes the memory depth is less than the total addressable range.
At minimum, a RAM needs these I/O ports:
a
(N
bits)d_in
(W
bits)d_out
(W
bits)rd
– read requestwr
– write requestMost RAMs will also require clk
and rst_l
ports.
A RAM array supports both READ and WRITE operations:
a
is first set to the desired addressrd
goes from 0 to 1d_out
a
is first set to the desired addressd_in
is set to the desired datawr
goes from 0 to 1d_in
to the corresponding address locationAn ambiguity occurs when there are simultaneous READ and WRITE events. In this situation, there are two protocols:
d_out
before overwriting that memory location with d_in
.d_in
and simultaneously writes d_in
onto d_out
.Many RAM circuits (including the Xilinx Block RAM) can select between these protocols. The distinction can be critical for memory-intensive algorithms, especially in high-speed real-time applications like signal- processing and graphics imaging.
RAMs can utilize a register to buffer the output data. Buffering is important if the RAM lies in the system’s critical path, meaning the whole system clock rate is limited by the RAM speed. By using a buffer, the signal propagation delay can be reduced allowing for a faster clock speed.
The drawback to buffering is that the RAM’s output data is delayed by a full clock cycle. This is called latency; the RAM (and the whole system) runs faster on average, but there is an added clock-cycle delay. To visualize latency, suppose we perform 100 consecutive READ operations. The data from the first READ will be delayed by a clock cycle, but after that initial delay the RAM will deliver one word per clock cycle. The initial delay is called latency, and the average output rate is called the throughput.
You will design and implement a READ-FIRST Buffered RAM and demonstrate it using the Basys3 board.
The Vivado synthesis tool is able to infer a RAM if the module is written in the format shown here:
module single_port_RAM
#(parameter DATA_WIDTH=8,
parameter ADDR_WIDTH=8
)
(input clk,
input rd,
input wr,
input [ADDR_WIDTH-1:0] addr,
input [DATA_WIDTH-1:0] d_in,
output reg [DATA_WIDTH-1:0] d_out
);
// The Memory array:
reg [DATA_WIDTH-1:0] ram[2**ADDR_WIDTH-1:0];
always @(posedge clk) begin
if (rd)
d_out <= ram[addr];if (wr)
ram[addr] <= d_in;end
endmodule
In this code segment, the bit width W
is declared as parameter DATA_WIDTH
, and the address bit width N
is declared as parameter ADDR_WIDTH
. The memory depth D
is 2^N
; in Verilog the exponential operation is a double asterisk, as in 2**ADDR_WIDTH
.
The RAM itself is represented as a Verilog array:
reg [DATA_WIDTH-1:0] ram[2**ADDR_WIDTH-1:0];
This line declares an array of bit vectors. Each bit vector has width DATA_WIDTH
. There are a total of 2**ADDR_WIDTH
vectors in the array. The individual vectors are addressed using C-style array indexing:
d_out <= ram[addr];
The above line returns DATA_WIDTH
bits from the memory at the location pointed to by addr
.
Make a Verilog module for the RAM design. Create a testbench to verify the RAM. For every address 0 to 255, do these steps, each in a separate clock cycle:
d_in
wr <= 1
, rd <= 0
)wr <= 0
, rd <= 1
)d_in
and d_out
(using $write
)d_in
wr <= 1
, rd <= 1
)d_in
and d_out
wr <= 0
, rd <= 1
)d_in
and d_out
againYou will need to make a state machine for these tests, since the steps unfold across several clock cycles. Note that wr
and rd
should be assigned 0
except when performing a WRITE or READ.
Here is a partial output from my testbench:
Address e3
1: d_in = 90 d_out = 90 <-- write then read
2: d_in = 51 d_out = 90 <-- write/read simultaneous
3: d_in = 51 d_out = 51 <-- read
Address e4
1: d_in = 69 d_out = 69 <-- write then read
2: d_in = 77 d_out = 69 <-- write/read simultaneous
3: d_in = 77 d_out = 77 <-- read
Address e5
1: d_in = 4a d_out = 4a <-- write then read
2: d_in = d8 d_out = 4a <-- write/read simultaneous
3: d_in = d8 d_out = d8 <-- read
This output demonstrates that the output data matches the input data, and that the RAM carries out read-before-write when read/write occur simultaneously.
Create XDC and build.tcl
files with these mappings:
addr
– the upper eight switchesd_in
– the lower eight switcvhesd_out
– the lower eight LEDsrd
– btnUwr
– btnDAfter synthesis is complete, open the utilization report and note the use of Block RAM (BRAM) resources.
Implement the design and demonstrate it on the Basys3 board. Record a short video demonstrating your design. In the video, you should:
Upload the video to Canvas.
Turn in your work using git
:
git add case* src/*.v *.v *.rpt *.txt *.tcl *.bit *.xdc
git commit . -m "Complete"
git push origin master