Lets start with Q4_K_M. An example of it is mistralai/mistral-7b-instruct-v0.3
Lets go through each parameters,
Q4
Means 4-bit quantization.
Original models are usually 16-bit or 32-bit precision.
4-bit compresses them heavily → much smaller memory usage.
K
Means it uses K-quantization (grouped quantization).
This is an improved method used such that
- Preserves better accuracy
- Keeps performance stable
- Reduces memory efficiently
M
Means Medium variant of K-quantization.
There are usually:
Q4_K_S→ Small (slightly smaller, slightly lower quality)Q4_K_M→ Medium (better balance)Q4_K_L→ Large (better quality, more RAM)
How we can recommend a model for local run ?
| Quant | RAM Usage | Quality | Recommended? |
|---|---|---|---|
| Q2 | Very Low | Low | ❌ No |
| Q4_K_S | Low | Good | OK |
| Q4_K_M | Moderate | Very Good | ✅ YES |
| Q8_0 | High | Excellent | Only if 32GB+ RAM |
If you have:
- 16GB RAM → use Q4_K_M
- 32GB RAM → Q4_K_M or Q8_0
- Less than 16GB → try Q4_K_S
Now I am trying to load the model in LM studio on my local laptop.