Quick summary: AMD cards like Radeon VII, VEGA II, MI50 and MI60 (gfx906/gfx907) work great with flash attention, llama.cpp and MLC-LLM. Good performance numbers included below.
Hey folks!
Last month I found two AMD MI60 cards on eBay for $300 each. The 32GB VRAM was really appealing compared to RTX 3060 12GB at similar prices. Getting almost triple the memory made it worth trying.
I swapped out my dual 3060 12GB setup for these MI60 cards. Many forum posts warned that AMD GPU support is tricky and compiling dependencies can be a nightmare even for experienced users. But ROCm has improved a lot recently.
First attempt was with vLLM. Compilation worked with many warnings but models wouldn’t deploy due to missing paged attention support in ROCm.
Next I tried another batch processing engine called aphrodite-engine. Compilation failed completely because of paged attention issues in ROCm.
Exllamav2 compiled fine and loaded LLaMa3.1 70B successfully. Speed was disappointing though - only 4.61 tokens/s. Main issue was missing flash attention library for AMD.
I managed to compile flash attention from the ROCm repository for MI60 (had to modify setup.py line 126 to add “gfx906” to allowed_archs - took 3 hours). It works but couldn’t figure out how to make exllamav2 use it.
Flash attention benchmark results:
python performance_tests/test_flash_attn.py
### causal=False, head_dim=64, batch_sz=32, seq_len=512 ###
Flash2 forward: 49.30 TFLOPs/s, backward: 30.33 TFLOPs/s, combined: 34.08 TFLOPs/s
Pytorch forward: 5.30 TFLOPs/s, backward: 7.77 TFLOPs/s, combined: 6.86 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s
### causal=False, head_dim=64, batch_sz=16, seq_len=1024 ###
Flash2 forward: 64.35 TFLOPs/s, backward: 36.21 TFLOPs/s, combined: 41.38 TFLOPs/s
Pytorch forward: 5.60 TFLOPs/s, backward: 8.48 TFLOPs/s, combined: 7.39 TFLOPs/s
Triton forward: 0.00 TFLOPs/s, backward: 0.00 TFLOPs/s, combined: 0.00 TFLOPs/s
### causal=False, head_dim=128, batch_sz=16, seq_len=1024 ###
Flash2 forward: 70.61 TFLOPs/s, backward: 17.20 TFLOPs/s, combined: 21.95 TFLOPs/s
Pytorch forward: 5.07 TFLOPs/s, backward: 6.51 TFLOPs/s, combined: 6.02 TFLOPs/s
Triton showed zeros because installation wasn’t complete.
Then I tried llama.cpp which compiled without issues.
Llama.cpp GGUF performance (first 100 tokens):
| Model | Quantization | Speed (tok/s) |
|---|---|---|
| Qwen2.5-7B-Instruct | Q8_0 | 57.41 |
| Meta-Llama-3.1-8B-Instruct | Q4_K_M | 58.36 |
| Qwen2.5-14B-Instruct | Q8_0 | 27.14 |
| gemma-2-27b-it | Q8_0 | 16.72 |
| Meta-Llama-3.1-70B-Instruct | Q5_K_M | 9.30 |
| Qwen2.5-72B-Instruct | Q5_K_M | 8.90 |
Decent results for the price but I wanted better performance.
Finally tested MLC-LLM. Installation was just one pip command (assuming ROCm 6.2 is already installed). Super easy setup even though AMD dropped MI60 support. Not only did it work perfectly but it gave the best speeds! MLC uses custom quantization which might be why it’s less popular.
MLC-LLM results:
| Model | Quantization | Speed (tok/s) |
|---|---|---|
| Llama-3-8B-Instruct | q4f16_1 | 81.5 |
| Qwen2.5-7B-Instruct | q4f16_1 | 81.5 |
| Qwen2.5-14B-Instruct | q4f16_1 | 49.9 |
| Qwen2.5-32B-Instruct | q4f16_1 | 23.8 |
| Qwen2.5-72B-Instruct | q4f16_1 | 8.90 |
Really happy with this setup overall. MI60 makes a solid RTX 3060 alternative. Wish there were more $300 options with 24GB+ VRAM but this works great for now.
Hope this helps others considering AMD GPUs for inference work.
Note: My MI60s run at 225W instead of 300W due to PSU limitations. Might see 10-20% better performance with full power.