VRAM is not freed when stopping models

**LocalAI version:** `localai/localai:v3.9.0-gpu-nvidia-cuda-13`


**Environment, CPU architecture, OS, and Version:**


Linux 6.17.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 14 21:44:46 UTC 2025 x86_64 GNU/Linux

AMD CPU
Nvidia RTX 3060 GPU
Ubuntu 25.10
Kubernetes (RKE2 v1.35.0+rke2r1), using Nvidia k8s-device-plugin with `timeSlicing` GPU sharing method

**Describe the bug**


When LocalAI stops a model (either manual or LRU), the child process is not killed, and thus the VRAM is still allocated.

**To Reproduce**


1. Start up a model using any backend (llama or stablediffusion)
2. Click the stop model button
3. Observe that the VRAM is still shown as allocated in the LocalAI GUI. Also observe that `nvidia-smi` on the host shows a child process still running and holding the VRAM.

**Expected behavior**


When LocalAI stops a model, the child process should stop, and the VRAM should be freed.

**Logs**


After stopping a model, the GPU use in the GUI stays the same, but no models are listed below it:

<img width="187" height="44" alt="Image" src="https://github.com/user-attachments/assets/833687af-c82f-4fec-bc23-f08a7136d451" />

`nvidia-smi` shows a child process of LocalAI still running:

```
Sat Jan 10 17:37:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   56C    P8             15W /  170W |    6061MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          445091      C   ...ds/cuda13-llama-cpp/lib/ld.so       6052MiB |
+-----------------------------------------------------------------------------------------+
```

In this case, PID 445091 is a child of the LocalAI parent process.

**Additional context**


I can "fix" this by restarting the pod, but for some reason LocalAI is not killing its child processes in my setup when a model is unloaded. This ties up VRAM and keeps me from starting other models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

VRAM is not freed when stopping models #7958

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

VRAM is not freed when stopping models #7958

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions