Performance benchmarks with dual MI60 cards - MLC-LLM delivers impressive speeds for AMD hardware

Max_Energetic · August 20, 2025, 8:15pm

Quick summary: AMD cards like Radeon VII, VEGA II, MI50 and MI60 (gfx906/gfx907) work great with flash attention, llama.cpp and MLC-LLM. Good performance numbers included below.

Hey folks!

Last month I found two AMD MI60 cards on eBay for $300 each. The 32GB VRAM was really appealing compared to RTX 3060 12GB at similar prices. Getting almost triple the memory made it worth trying.

I swapped out my dual 3060 12GB setup for these MI60 cards. Many forum posts warned that AMD GPU support is tricky and compiling dependencies can be a nightmare even for experienced users. But ROCm has improved a lot recently.

First attempt was with vLLM. Compilation worked with many warnings but models wouldn’t deploy due to missing paged attention support in ROCm.

Next I tried another batch processing engine called aphrodite-engine. Compilation failed completely because of paged attention issues in ROCm.

Exllamav2 compiled fine and loaded LLaMa3.1 70B successfully. Speed was disappointing though - only 4.61 tokens/s. Main issue was missing flash attention library for AMD.

I managed to compile flash attention from the ROCm repository for MI60 (had to modify setup.py line 126 to add “gfx906” to allowed_archs - took 3 hours). It works but couldn’t figure out how to make exllamav2 use it.

Flash attention benchmark results:

python performance_tests/test_flash_attn.py

### causal=False, head_dim=64, batch_sz=32, seq_len=512 ###
Flash2 forward: 49.30 TFLOPs/s, backward: 30.33 TFLOPs/s, combined: 34.08 TFLOPs/s
Pytorch forward: 5.30 TFLOPs/s, backward: 7.77 TFLOPs/s, combined: 6.86 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s

### causal=False, head_dim=64, batch_sz=16, seq_len=1024 ###
Flash2 forward: 64.35 TFLOPs/s, backward: 36.21 TFLOPs/s, combined: 41.38 TFLOPs/s
Pytorch forward: 5.60 TFLOPs/s, backward: 8.48 TFLOPs/s, combined: 7.39 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s

### causal=False, head_dim=128, batch_sz=16, seq_len=1024 ###
Flash2 forward: 70.61 TFLOPs/s, backward: 17.20 TFLOPs/s, combined: 21.95 TFLOPs/s
Pytorch forward: 5.07 TFLOPs/s, backward: 6.51 TFLOPs/s, combined: 6.02 TFLOPs/s

Triton showed zeros because installation wasn’t complete.

Then I tried llama.cpp which compiled without issues.

Llama.cpp GGUF performance (first 100 tokens):

Model	Quantization	Speed (tok/s)
Qwen2.5-7B-Instruct	Q8_0	57.41
Meta-Llama-3.1-8B-Instruct	Q4_K_M	58.36
Qwen2.5-14B-Instruct	Q8_0	27.14
gemma-2-27b-it	Q8_0	16.72
Meta-Llama-3.1-70B-Instruct	Q5_K_M	9.30
Qwen2.5-72B-Instruct	Q5_K_M	8.90

Decent results for the price but I wanted better performance.

Finally tested MLC-LLM. Installation was just one pip command (assuming ROCm 6.2 is already installed). Super easy setup even though AMD dropped MI60 support. Not only did it work perfectly but it gave the best speeds! MLC uses custom quantization which might be why it’s less popular.

MLC-LLM results:

Model	Quantization	Speed (tok/s)
Llama-3-8B-Instruct	q4f16_1	81.5
Qwen2.5-7B-Instruct	q4f16_1	81.5
Qwen2.5-14B-Instruct	q4f16_1	49.9
Qwen2.5-32B-Instruct	q4f16_1	23.8
Qwen2.5-72B-Instruct	q4f16_1	8.90

Really happy with this setup overall. MI60 makes a solid RTX 3060 alternative. Wish there were more $300 options with 24GB+ VRAM but this works great for now.

Hope this helps others considering AMD GPUs for inference work.

Note: My MI60s run at 225W instead of 300W due to PSU limitations. Might see 10-20% better performance with full power.

Liam27 · August 30, 2025, 2:49am

Nice find on those MI60s! I’ve been fighting with AMD setup too - MLC-LLM might be the answer. How’s your memory looking? Are the 70B models eating up all 32GB or do you have room left? And does that custom quantization hurt quality compared to regular GGUF?

Luke_Thunder · August 29, 2025, 1:17pm

Nice work with the MI60! I hit the exact same wall with vLLM and aphrodite-engine on my MI50 - that paged attention issue is everywhere with ROCm. And yeah, had to add gfx906 for flash attention too.

That performance gap really stands out though. 81.5 tok/s vs 58.36 tok/s on similar 8B models? MLC’s clearly got better AMD optimization than llama.cpp. But I’m wondering - does MLC’s q4f16_1 quantization mess with quality compared to Q4_K_M? The speed boost is huge, but how’s it handle longer chats or tricky reasoning?

Zack_88Surf · August 27, 2025, 3:55am

wow, those mlc-llm numbers look awesome! i’ve been eyeing some mi50s on ebay too, but wasn’t sure about the hassle. are you facing any stability issues during long inference? does the 225w limit affect performance much? how do the temps stack up against your old 3060s?