How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

102

This article was contributed by Vedrana Vidulin, Head of Responsible AI Unit at Intellias (LinkedIn).

As AI becomes central to smart devices, embedded systems, and edge computing, the ability to run language models locally — without relying on the cloud — is essential. Whether it’s for reducing latency, improving data privacy, or enabling offline functionality, local AI inference opens up new opportunities across industries. LiteLLM offers a practical solution for bringing large language models to resource-constrained devices, bridging the gap between powerful AI tools and the limitations of embedded hardware.

Deploying LiteLLM, an open source LLM gateway, on embedded Linux unlocks the ability to run lightweight AI models in resource-constrained environments. Acting as a flexible proxy server, LiteLLM provides a unified API interface that accepts OpenAI-style requests — allowing you to interact with local or remote models using a consistent developer-friendly format. This guide walks you through everything from installation to performance tuning, helping you build a reliable, lightweight AI system on embedded Linux distribution.

Setup checklist

Before you start, here’s what’s required:

  • A device running a Linux-based operating system (Debian) with sufficient computational resources to handle LLM operations.​
  • Python 3.7 or higher installed on the device.​
  • Access to the internet for downloading necessary packages and models.

Step-by-Step Installation

Step 1: Install LiteLLM

First, we make sure the device is up to date and ready for installation. Then we install LiteLLM in a clean and safe environment.

Update the package lists to ensure access to the latest software versions:

sudo apt-get update

Check if pip (Python Package Installer) is installed:

pip –version

If not, install it using:

sudo apt-get install python3-pip

It is recommended to use a virtual environment. Check if venv is installed:

dpkg -s python3-venv | grep “Status: install ok installed”

If venv is intalled the output would be “Status: install ok installed”. If not installed:

sudo apt install python3-venv -y

Create and activate virtual environment:

python3 -m venv litellm_envsource litellm_env/bin/activate

Use pip to install LiteLLM along with its proxy server component:

pip install ‘litellm[proxy]’

Use LiteLLM within this environment. To deactivate the virtual environment type deactivate.

Step 2: Configure LiteLLM

With LiteLLM installed, the next step is to define how it should operate. This is done through a configuration file, which specifies the language models to be used and the endpoints through which they’ll be served.

Navigate to a suitable directory and create a configuration file named config.yaml:

mkdir ~/litellm_configcd ~/litellm_confignano config.yaml

In config.yaml specify the models you intend to use. For example, to configure LiteLLM to interface with a model served by Ollama:

model_list:   model_name: codegemma litellm_params:   model: ollama/codegemma:2b   api_base: http://localhost:11434

This configuration maps the model name codegemma to the codegemma:2b model served by Ollama at http://localhost:11434.

Step 3: Serve models with Ollama

To run your AI model locally, you’ll use a tool called Ollama. It’s designed specifically for hosting large language models (LLMs) directly on your device — without relying on cloud services.

To get started, install Ollama using the following command:

curl -fsSL https://ollama.com/install.sh | sh

This command downloads and runs the official installation script, which automatically starts the Ollama server.

Once installed, you’re ready to load the AI model you want to use. In this example, we’ll pull a compact model called codegemma:2b.

ollama pull codegemma:2b

After the model is downloaded, the Ollama server will begin listening for requests — ready to generate responses from your local setup.

Step 4: Launch the LiteLLM proxy server

With both the model and configuration ready, it’s time to start the LiteLLM proxy server — the component that makes your local AI model accessible to applications.

To launch the server, use the command below:

litellm –config ~/litellm_config/config.yaml

The proxy server will initialize and expose endpoints defined in your configuration, allowing applications to interact with the specified models through a consistent API.

Step 5: Test the deployment

Let’s confirm if everything works as expected. Write a simple Python script that sends a test request to the LiteLLM server and save it as test_script.py:

import openai client = openai.OpenAI(api_key=“anything”, base_url=http://localhost:4000“)response = client.chat.completions.create(    model=“codegemma”,    messages=[{“role”: “user”, “content”: “Write me a Python function to calculate the nth Fibonacci number.”}])print(response) 

Finally, run the script using this command:

python3 ./test_script.py

If the setup is correct, you’ll receive a response from the local model — confirming that LiteLLM is up and running.

Optimize LiteLLM performance on embedded devices

To ensure fast, reliable performance on embedded systems, it’s important to choose the right language model and adjust LiteLLM’s settings to match your device’s limitations.

Choosing the Right Language Model

Not every AI model is built for devices with limited resources — some are just too heavy. That’s why it’s crucial to go with compact, optimized models designed specifically for such environments:​

  • DistilBERT – a distilled version of BERT, retaining over 95% of BERT’s performance with 66 million parameters. It’s suitable for tasks like text classification, sentiment analysis, and named entity recognition.
  • TinyBERT – with approximately 14.5 million parameters, TinyBERT is designed for mobile and edge devices, excelling in tasks such as question answering and sentiment classification.
  • MobileBERT – optimized for on-device computations, MobileBERT has 25 million parameters and achieves nearly 99% of BERT’s accuracy. It’s ideal for mobile applications requiring real-time processing.
  • TinyLlama – a compact model with approximately 1.1 billion parameters, TinyLlama balances capability and efficiency, making it suitable for real-time natural language processing in resource-constrained environments.
  • MiniLM – a compact transformer model with approximately 33 million parameters, MiniLM is effective for tasks like semantic similarity and question answering, particularly in scenarios requiring rapid processing on limited hardware.

Selecting a model that fits your setup isn’t just about saving space — it’s about ensuring smooth performance, fast responses, and efficient use of your device’s limited resources.

Configure settings for better performance

A few small adjustments can go a long way when you’re working with limited hardware. By fine-tuning key LiteLLM settings, you can boost performance and keep things running smoothly.

Restrict the number of tokens

Shorter responses mean faster results. Limiting the maximum number of tokens in response can reduce memory and computational load. In LiteLLM, this can be achieved by setting the max_tokens parameter when making API calls. For example:​

import openai client = openai.OpenAI(api_key=“anything”, base_url=http://localhost:4000“)response = client.chat.completions.create(    model=“codegemma”,    messages=[{“role”: “user”, “content”: “Write me a Python function to calculate the nth Fibonacci number.”}],    max_tokens=500 # Limits the response to 500 tokens)print(response) 

Adjusting max_tokens helps keep replies concise and reduces the load on your device.
Managing simultaneous requests

If too many requests hit the server at once, even the best-optimized model can get bogged down. That’s why LiteLLM includes an option to limit how many queries it processes at the same time. For instance, you can restrict LiteLLM to handle up to 5 concurrent requests by setting max_parallel_requests as follows:

litellm –config ~/litellm_config/config.yaml –num_requests 5

This setting helps distribute the load evenly and ensures your device stays stable — even during periods of high demand.
A Few More Smart Moves

Before going live with your setup, here are two additional best practices worth considering:

  • Secure your setup – implement appropriate security measures, such as firewalls and authentication mechanisms, to protect the server from unauthorized access.
  • Monitor performance – use LiteLLM’s logging capabilities to track usage, performance, and potential issues.

LiteLLM makes it possible to run language models locally, even on low-resource devices. By acting as a lightweight proxy with a unified API, it simplifies integration while reducing overhead. With the right setup and lightweight models, you can deploy responsive, efficient AI solutions on embedded systems — whether for a prototype or a production-ready solution.

Summary 

Running LLMs on embedded devices doesn’t necessarily require heavy infrastructure or proprietary services. LiteLLM offers a streamlined, open-source solution for deploying language models with ease, flexibility, and performance — even on devices with limited resources. With the right model and configuration, you can power real-time AI features at the edge, supporting everything from smart assistants to secure local processing.

Join Our Community

We’re continuously exploring the future of tech, innovation, and digital transformation at Intellias — and we invite you to be part of the journey.

  • Visit our Intellias Blog and dive deeper into industry insights, trends, and expert perspectives.
  • This article was written by Vedrana Vidulin, Head of Responsible AI Unit at Intellias. Connect with Vedrana through her LinkedIn page