Skip to content

VRAM is not freed when stopping models #7958

@jroeber

Description

@jroeber

LocalAI version: localai/localai:v3.9.0-gpu-nvidia-cuda-13

Environment, CPU architecture, OS, and Version:

Linux 6.17.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 14 21:44:46 UTC 2025 x86_64 GNU/Linux

AMD CPU
Nvidia RTX 3060 GPU
Ubuntu 25.10
Kubernetes (RKE2 v1.35.0+rke2r1), using Nvidia k8s-device-plugin with timeSlicing GPU sharing method

Describe the bug

When LocalAI stops a model (either manual or LRU), the child process is not killed, and thus the VRAM is still allocated.

To Reproduce

  1. Start up a model using any backend (llama or stablediffusion)
  2. Click the stop model button
  3. Observe that the VRAM is still shown as allocated in the LocalAI GUI. Also observe that nvidia-smi on the host shows a child process still running and holding the VRAM.

Expected behavior

When LocalAI stops a model, the child process should stop, and the VRAM should be freed.

Logs

After stopping a model, the GPU use in the GUI stays the same, but no models are listed below it:

Image

nvidia-smi shows a child process of LocalAI still running:

Sat Jan 10 17:37:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   56C    P8             15W /  170W |    6061MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          445091      C   ...ds/cuda13-llama-cpp/lib/ld.so       6052MiB |
+-----------------------------------------------------------------------------------------+

In this case, PID 445091 is a child of the LocalAI parent process.

Additional context

I can "fix" this by restarting the pod, but for some reason LocalAI is not killing its child processes in my setup when a model is unloaded. This ties up VRAM and keeps me from starting other models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions