Performance benchmarks with dual MI60 cards - MLC-LLM delivers impressive speeds for AMD hardware

Quick summary: AMD cards like Radeon VII, VEGA II, MI50 and MI60 (gfx906/gfx907) work great with flash attention, llama.cpp and MLC-LLM. Good performance numbers included below.

Hey folks!

Last month I found two AMD MI60 cards on eBay for $300 each. The 32GB VRAM was really appealing compared to RTX 3060 12GB at similar prices. Getting almost triple the memory made it worth trying.

I swapped out my dual 3060 12GB setup for these MI60 cards. Many forum posts warned that AMD GPU support is tricky and compiling dependencies can be a nightmare even for experienced users. But ROCm has improved a lot recently.

First attempt was with vLLM. Compilation worked with many warnings but models wouldn’t deploy due to missing paged attention support in ROCm.

Next I tried another batch processing engine called aphrodite-engine. Compilation failed completely because of paged attention issues in ROCm.

Exllamav2 compiled fine and loaded LLaMa3.1 70B successfully. Speed was disappointing though - only 4.61 tokens/s. Main issue was missing flash attention library for AMD.

I managed to compile flash attention from the ROCm repository for MI60 (had to modify setup.py line 126 to add “gfx906” to allowed_archs - took 3 hours). It works but couldn’t figure out how to make exllamav2 use it.

Flash attention benchmark results:

python performance_tests/test_flash_attn.py

### causal=False, head_dim=64, batch_sz=32, seq_len=512 ###
Flash2 forward: 49.30 TFLOPs/s, backward: 30.33 TFLOPs/s, combined: 34.08 TFLOPs/s
Pytorch forward: 5.30 TFLOPs/s, backward: 7.77 TFLOPs/s, combined: 6.86 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s

### causal=False, head_dim=64, batch_sz=16, seq_len=1024 ###
Flash2 forward: 64.35 TFLOPs/s, backward: 36.21 TFLOPs/s, combined: 41.38 TFLOPs/s
Pytorch forward: 5.60 TFLOPs/s, backward: 8.48 TFLOPs/s, combined: 7.39 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s

### causal=False, head_dim=128, batch_sz=16, seq_len=1024 ###
Flash2 forward: 70.61 TFLOPs/s, backward: 17.20 TFLOPs/s, combined: 21.95 TFLOPs/s
Pytorch forward: 5.07 TFLOPs/s, backward: 6.51 TFLOPs/s, combined: 6.02 TFLOPs/s

Triton showed zeros because installation wasn’t complete.

Then I tried llama.cpp which compiled without issues.

Llama.cpp GGUF performance (first 100 tokens):

Model Quantization Speed (tok/s)
Qwen2.5-7B-Instruct Q8_0 57.41
Meta-Llama-3.1-8B-Instruct Q4_K_M 58.36
Qwen2.5-14B-Instruct Q8_0 27.14
gemma-2-27b-it Q8_0 16.72
Meta-Llama-3.1-70B-Instruct Q5_K_M 9.30
Qwen2.5-72B-Instruct Q5_K_M 8.90

Decent results for the price but I wanted better performance.

Finally tested MLC-LLM. Installation was just one pip command (assuming ROCm 6.2 is already installed). Super easy setup even though AMD dropped MI60 support. Not only did it work perfectly but it gave the best speeds! MLC uses custom quantization which might be why it’s less popular.

MLC-LLM results:

Model Quantization Speed (tok/s)
Llama-3-8B-Instruct q4f16_1 81.5
Qwen2.5-7B-Instruct q4f16_1 81.5
Qwen2.5-14B-Instruct q4f16_1 49.9
Qwen2.5-32B-Instruct q4f16_1 23.8
Qwen2.5-72B-Instruct q4f16_1 8.90

Really happy with this setup overall. MI60 makes a solid RTX 3060 alternative. Wish there were more $300 options with 24GB+ VRAM but this works great for now.

Hope this helps others considering AMD GPUs for inference work.

Note: My MI60s run at 225W instead of 300W due to PSU limitations. Might see 10-20% better performance with full power.

Nice find on those MI60s! I’ve been fighting with AMD setup too - MLC-LLM might be the answer. How’s your memory looking? Are the 70B models eating up all 32GB or do you have room left? And does that custom quantization hurt quality compared to regular GGUF?

Nice work with the MI60! I hit the exact same wall with vLLM and aphrodite-engine on my MI50 - that paged attention issue is everywhere with ROCm. And yeah, had to add gfx906 for flash attention too.

That performance gap really stands out though. 81.5 tok/s vs 58.36 tok/s on similar 8B models? MLC’s clearly got better AMD optimization than llama.cpp. But I’m wondering - does MLC’s q4f16_1 quantization mess with quality compared to Q4_K_M? The speed boost is huge, but how’s it handle longer chats or tricky reasoning?

wow, those mlc-llm numbers look awesome! i’ve been eyeing some mi50s on ebay too, but wasn’t sure about the hassle. are you facing any stability issues during long inference? does the 225w limit affect performance much? how do the temps stack up against your old 3060s?