Show HN: Llama 3.1 70B Runs on Single RTX 3090 via NVMe

In an impressive feat of optimization, a developer has demonstrated running the massive Llama 3.1 70 billion parameter language model on a single consumer-grade RTX 3090 GPU. This achievement, which would typically require multiple high-end GPUs or cloud instances, was made possible by a technique that streams model weights directly from fast NVMe solid-state storage, bypassing the CPU's RAM entirely. The method cleverly works around the GPU's limited VRAM (24GB on the RTX 3090) by loading parts of the model into GPU memory just as they are needed for computation, while the bulk of the data resides on the much larger NVMe drive. This 'direct storage' approach minimizes latency and keeps the GPU fed with data. This breakthrough has significant implications for democratizing access to large language models. It enables researchers, developers, and enthusiasts to experiment with state-of-the-art open-source models on affordable, locally-owned hardware. The technique lowers the barrier to

Show HN: Llama 3.1 70B Runs on Single RTX 3090 via NVMe

Related news