I spent the last month auditing the C++ and CUDA internals of vLLM, the most widely deployed open-source LLM inference engine.
I found 221 locations where PyTorch's 64-bit tensor metadata is silently truncated to 32-bit int variables before being used in GPU buffer allocations, kernel launch parameters, and loop bounds. The findings were responsibly reported to vLLM .
For code paths that process GGUF model files, the tensor dimensions come directly from the model file header. An attacker who crafts a malicious GGUF file and uploads it to a public model hub can trigger a deterministic GPU buffer overflow when a victim loads that model into vLLM.
This is not a new class of bug. It is the same integer truncation pattern that has produced 10 CVEs in llama.cpp and Ollama over the past two years. The difference is that nobody has looked for it systematically across the rest of the inference stack.
The ML security conversation has been dominated by prompt injection and pickle deserialization. Meanwhile, the C++ and CUDA code that actually parses model files and allocates GPU memory has received almost no attention from the security research community.
Model files are not trusted input. They are downloaded from the internet, hosted on public hubs, and parsed by native code with zero validation. This is the same threat model we accepted for images, fonts, and PDFs decades ago. The only thing missing is the formal recognition that it applies to model files too.
I submitted both findings to the vLLM security team through coordinated disclosure. They were closed as not meeting their security bar. I respect that decision. I also think it reflects a classification gap, not a technical disagreement. When there is no CWE entry for "memory corruption via crafted model file," it is harder for any maintainer to justify treating these reports as security issues.
I have submitted a proposal to MITRE for a new CWE entry to address this gap.
Full report (12 pages, proof of concept, CVE comparison tables, reproduction steps):https://lnkd.in/gdDjQAu8
This is part of a broader research effort into memory corruption across the ML inference stack. More findings are in coordinated disclosure and will be published as those processes complete.
If you maintain an inference engine or work on ML infrastructure security, I would like to hear from you.
#security #llm #vllm #machinelearning #vulnerability #cybersecurity #aiml #gpusecurity #infosec