/Memory Bandwidth Napkin Math

## Memory Bandwidth Napkin Math

A few weeks ago I had lunch with a co-worker. He was complaining about some process being slow. He estimated how many bytes of data were generated, how many processing passes were performed, and ultimately how many bytes of RAM needed to be accessed. He suggested that modern GPUs with upwards of 500 gigabytes per second of memory bandwidth could eat his problem for breakfast.

I thought his perspective was interesting. That’s not how I think about problems.

I know about the processor-memory performance gap. I know how to write cache friendly code. I know approximate latency numbers. But I don’t know enough to ballpark memory throughput on a napkin.

Here’s my thought experiment. Imagine you have a contiguous array of one billion 32-bit integers in memory. That’s 4 gigabytes. How long will it take to iterate that array and sum the values? How many bytes of contiguous data can a CPU read from RAM per second? How many bytes of random access? How well can it be parallelized?

Now you may think these aren’t useful questions. Real programs are too complicated for such a naive benchmark to be meaningful. You’re not wrong! The real answer is “it depends”.

That said, I think the question is worth a blog post’s worth of exploration. I’m not trying to find the answer. But I do think we can identify some upper and lower bounds, some interesting points in the middle, and hopefully learn a few things along the way.

If you read programming blogs at some point you’ve probably come across “numbers every programmer should know”. It looks something like this.

```L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms
```

Source: Jonas Bonér

This is a great list. I see it on HackerNews at least once a year. Every programmer should know these numbers.

But these numbers are focused on a different question. Latency and throughput are not the same.

That list is from 2012. This post comes from 2020, and times have changed. Here are figures for an Intel i7 via StackOverflow.

```L1 CACHE hit, ~4 cycles(2.1 - 1.2 ns)
L2 CACHE hit, ~10 cycles(5.3 - 3.0 ns)
L3 CACHE hit, line unshared               ~40 cycles(21.4 - 12.0 ns)
L3 CACHE hit, shared line in another core ~65 cycles(34.8 - 19.5 ns)
L3 CACHE hit, modified in another core    ~75 cycles(40.2 - 22.5 ns)
local  RAM                                                   ~60 ns```

Interesting! What’s changed?

• L1 is slower; `0.5 ns -> 1.5 ns`
• L2 is faster; `7 ns -> 4.2 ns`
• L1 vs L2 relative speed is way closer; `2.5x vs 14x` 🤯
• L3 cache is now standard; `12 ns to 40 ns`
• Ram is faster; ` 100ns -> 60ns`

I don’t want to draw too many conclusions from this. It’s not clear how the original figures were calculated. We’re not comparing apples to apples.

Here’s some bandwidth and cache sizes for my CPU courtesy of wikichip.

```Memory Bandwidth: 39.74 gigabytes per second
L1 cache: 192 kilobytes (32 KB per core)
L2 cache: 1.5 megabytes (256 KB per core)
L3 cache: 12 megabytes  (shared; 2 MB per core)
```

Here’s what I’d like to know:

• What is the upper limit of RAM performance?
• What is the lower limit?
• What are the limits for L1/L2/L3 cache?

Let’s run some tests. I put together a naive C++ benchmark. It looks very roughly like this.

```// Generate random elements
std::vector<int> nums;
for (size_t i = 0; i < 1024*1024*1024; ++i) // one billion ints
nums.push_back(rng() % 1024); // small nums to prevent overflow

// Run test with 1 to 12 threads
auto slice_len = nums.size() / thread_count;

// partition data
auto begin = nums.begin() + thread * slice_len;
? nums.end() : begin + slice_len;

futures.push_back(std::async([begin, end] {

// sum ints sequentially
int64_t sum = 0;
for (auto ptr = begin; ptr < end; ++ptr)
sum += *ptr;
return sum;
}));
}

// combine results
int64_t sum = 0;
for (auto& future : futures)
sum += future.get();
}
```

I’m leaving out a few details. But you get the idea. Create a large, contiguous array of elements. Divide the array into non-overlapping chunks. Process each chunk on a different thread. Accumulate the results.

I also want to measure random access. This is tricky. I tried a few methods. Ultimately I chose to pre-compute and shuffle indices. Each index exists exactly one. The inner-loop then iterates indices and computes `sum += nums[index]`.

```std::vector<int> nums = /* ... */;
std::vector<uint32_t> indices = /* shuffled */;

// random access
int64_t sum = 0;
for (auto ptr = indices.begin(); ptr < indices.end(); ++ptr) {
auto idx = *ptr;
sum += nums[idx];
}
return sum;
```

I do not consider the memory of the indices array for my bandwidth calculations. I only count bytes that contribute to `sum`. I’m not benchmarking my hardware. I’m estimating my ability to work with data sets of different sizes and different access patterns.

I performed tests with three data types:

• `int` – basic 32-bit integer
• `matri4x4` – contains `int[16]`; fits on 64-byte cache line
• `matrix4x4_simd` – uses `__m256i` intrinsics

My first test operates on a large block of memory. A `1 GB` block of `T` elements is allocated and filled with small, random values. A simple loop iterates the array N times so it accesses `N GB` worth of memory to produce an `int64_t` sum. Multi-threading partitions the array and access the exact same number of elements.

Tada! This plot takes the average time to sum elements and converts it from `runtime_in_nanoseconds` to `gigabytes_per_second`.

This is pretty neat. `int32` can sequentially read `11 GB/s` on a single thread. It scales linearly until flattening out at `38 GB/s`. `matrix4x4` and `matrix4x4_simd` are faster, but hit a similar ceiling.

There is a clear and obvious upper limit on how much data we can read from RAM per second. On my system it’s roughly `40 GB/s`. This lines up with modern specifications listed above.

Looking at the bottom 3 lines, random access is slow. Very very slow. Single-threaded `int32` performance is a paltry `0.46 GB/s`. This is `24x` slower than its `11.03 GB/s` sequential sum! `matrix4x4` is better since it operates on full cache-lines. But it’s still four to seven times slower than sequential and peaks at just `8 GB/s`.

My system has per-thread L1/L2/L3 cache sizes of `32 KB`, `256 KB`, and `2 MB`. What happens if we take a `32 KB` block of elements and iterate across it 125,000 times? That’s `4 GB` worth of memory access, but we’ll ~always hit the cache.

Awesome! Single-threaded performance is similar to the large block read, `~12 GB/s`. Except this time multi-threading smashes through the `40 GB/s` ceiling. This makes perfect sense. The data stays in cache so the RAM bottleneck doesn’t come into play. Data sizes that don’t fit into L3 cache hit the same `~38 GB/s` ceiling.

`matrix4x4` follows a similar pattern, but is even faster; `31 GB/s` single-thread, `171 GB/s` multi-threaded.

Now let’s look at `matrix4x4_simd`. Pay close attention to the y-axis.

`matrix4x4_simd` is omgwtfbbq fast. It’s upwards of 10x faster than `int32`. When operating on a `16 KB` block it actually breaks `1000 GB/s`. 🤯

Obviously this is a superficial synthetic test. Most applications don’t perform the same operation on the same data a million times in a row. This is not indicative of real world performance.

But the lesson is real. Operating on data inside the cache is fast. With a very high ceiling when using simd; `>100 GB/s` single-thread, `>1000 GB/s` multi-threaded. Getting data into the cache is slow and has a hard cap, `~40 GB/s` total.

Let’s do the same thing, but with random access. This is my favorite part of this post.

Reading random values RAM is slow, `0.46 GB/s`. Reading random values from L1 cache is very very fast, `13 GB/s`. This is faster than the `11 GB/s` performance we got sequentially reading `int32` from RAM.

`matrix4x4` follows the same pattern, but is ~2x faster than `int`.

`matrix4x4_simd` random access is crazy fast.

Random access from RAM is slow. Catastrophically slow. Less than `1 GB/s` slow for both `int32`. Random access from the cache is remarkably quick. It’s comparable to sequential RAM performance.

Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 `16 KB` to L2-sized `256 KB` is 2x or less.

I think this has profound implications.

Pointer chasing is bad. Really, really bad. Just how bad is it? I made an extra test that wraps `matrix4x4` in `std::unique_ptr`. Each access has to go through a pointer. Here’s the terrible, horrible, no good, very bad result.

```     1 Thread       |   matrix4x4   | unique_ptr |  diff  |
--------------------|---------------|------------|--------|
Large Block - Seq   |   14.8 GB/s   |  0.8 GB/s  |  19x   |
16 KB - Seq         |   31.6 GB/s   |  2.2 GB/s  |  14x   |
256 KB - Seq        |   22.2 GB/s   |  1.9 GB/s  |  12x   |
Large Block - Rand  |    2.2 GB/s   |  0.1 GB/s  |  22x   |
16 KB - Rand        |   23.2 GB/s   |  1.7 GB/s  |  14x   |
256 KB - Rand       |   15.2 GB/s   |  0.8 GB/s  |  19x   |

6 Threads      |   matrix4x4   | unique_ptr |  diff  |
--------------------|---------------|------------|--------|
Large Block - Seq   |   34.4 GB/s   |  2.5 GB/s  |  14x   |
16 KB - Seq         |  154.8 GB/s   |  8.0 GB/s  |  19x   |
256 KB - Seq        |  111.6 GB/s   |  5.7 GB/s  |  20x   |
Large Block - Rand  |    7.1 GB/s   |  0.4 GB/s  |  18x   |
16 KB - Rand        |   95.0 GB/s   |  7.8 GB/s  |  12x   |
256 KB - Rand       |   58.3 GB/s   |  1.6 GB/s  |  36x   |
```

Sequentially summing values behind a pointer runs at less than `1 GB/s`. Random access, which misses the cache twice, runs at just `0.1 GB/s`.

Pointer chasing is 10 to 20 times slower. Friends don’t let friends used linked lists. Please, think of the children cache.

It’s standard for game developers to set CPU and memory allocation budgets for subsystems. But I’m not sure I’ve ever seen a memory access budget.

Modern games run at increasingly high frame rates. `60 fps` is table stakes. VR runs at `90 Hz`. I have a `144 Hz` gaming monitor. It’s awesome and 60fps feels like crap. I’m never going back. E-Sports/Twitch players use `240 Hz` monitors. Asus unveiled a `360 Hz` monster at CES this year.

My CPU has an upper limit of `~40 GB/s`. That sounds like a lot! At `240 Hz` that’s just `167 MB` per frame. A more realistic application might consume `5 GB/s` at 144 Hz which is just `69 MB` per frame.

Here’s a table with a few figures.

```        |   1   |   10   |   30   |   60   |   90   |  144   |  240   |  360   |
--------|-------|--------|--------|--------|--------|--------|--------|--------|
40 GB/s | 40 GB |  4 GB  | 1.3 GB | 667 MB | 444 MB | 278 MB | 167 MB | 111 MB |
10 GB/s | 10 GB |  1 GB  | 333 MB | 166 MB | 111 MB |  69 MB |  42 MB |  28 MB |
1 GB/s |  1 GB | 100 MB |  33 MB |  17 MB |  11 MB |   7 MB |   4 MB |   3 MB |
```

I believe thinking about problems this way is useful. It can help you realize that some ideas aren’t feasible. Hitting 240 Hz isn’t easy. It won’t happen by accident.

The “Numbers Every Programmer Should Know” list is out of date. It needs to be refreshed for 2020.

Here are a few numbers for my home PC. They’re a blend of AIDA64, Sandra, and my benchmarks. These figures don’t tell the whole story and are just a starting point.

```L1 Latency:    1 ns
L2 Latency:    2.5 ns
L3 Latency:    10 ns
RAM Latency:   50 ns

L1 Bandwidth:  210 GB/s
L2 Bandwidth:  80 GB/s
L3 Bandwidth:  60 GB/s

(whole system)
RAM Bandwidth: 45 GB/s
```

I think it would be cool if there was a small, simple, open source benchmark tool for this. A single C file that could run on desktop, server, mobile, consoles, etc. I’m not the right person to write that tool.

Measuring this stuff is hard. Really hard. My code probably has bugs. There are many compounding factors not discussed. If you want to critize my benchmark methodology you probably aren’t wrong.

Ultimately I think that’s ok. This post isn’t about the precise performance of my home desktop. It’s about framing problems from a certain perspective. And learning how to perfrom some basic napkin math.

A co-worker shared a way of thinking about programming problems I hadn’t considered. That sent me on a journey to explore modern memory performance.

For the purpose of napkin math here’s some ballpark figures for a modern desktop PC.

• RAM Performance
• Upper Limit: `45 GB/s`
• Napkin Estimate: `5 GB/s`
• Lower Limit: `1 GB/s`
• Cache Performance — L1/L2/L3 (per core)
• Upper Limit (w/ simd): `210 GB/s` / `80 GB/s` / `60 GB/s`
• Napkin Estimate: `25 GB/s` / `15 GB/s` / `9 GB/s`
• Lower Limit: `13 GB/s` / `8 GB/s` / `3.5 GB/s`

Napkin estimates come from observed performance with `matrix4x4`. RealCode™ is never that simple. But for napkin math estimates I think this is a reasonable starting point. You will need to adjust based on the access patterns and characteristics of your hardware and your code.

That said, the most important thing I learned is a new way to think about problems. Thinking about bytes-per-second or bytes-per-frame is another lens to look through. It’s a useful tool to keep in my back pocket.