Serve multiple LLMs locally on your Mac with optimized memory and caching
GitHub RepoImpressions2.4k

Serve multiple LLMs locally on your Mac with optimized memory and caching

@githubprojectsPost Author

Project Description

View on GitHub

Run Multiple LLMs on Your Mac Without Melting It

Ever wanted to run a few different large language models locally to compare outputs, test prompts, or just have a private playground, but your Mac's memory said "absolutely not"? You're not alone. Running even one decent-sized model can be a memory hog, let alone switching between several.

That's the exact problem omlx tackles. It's a tool that lets you serve multiple LLMs locally on your Mac with a sharp focus on optimized memory usage and intelligent caching. Think of it as a lightweight, local model router that tries to be smart about your system's resources.

What It Does

In simple terms, omlx is a local server that manages multiple LLM backends (like llama.cpp, ollama, or others). Its primary job is to handle incoming requests and route them to the appropriate loaded model. The key is that it's designed to load and unload models dynamically based on demand, and it caches recent models to avoid the expensive cost of reloading them from disk every single time.

This means you can have access to several models through a single endpoint, but your RAM isn't trying to hold all of them at once. It swaps them in and out as needed.

Why It's Cool

The clever part is in the resource management. Instead of the brute-force approach of loading everything and hoping your machine can handle it, omlx operates more like a just-in-time inventory system for LLMs.

  • Optimized Memory: It only keeps the most recently used models in memory. If you ask for a model that isn't loaded, it will swap out an idle one (if memory is full) to make room. This lets you "have" more models available than you could possibly fit in RAM simultaneously.
  • Intelligent Caching: The caching strategy means that if you're bouncing between two models for a task, you're not waiting for a full reload each switch. The second (or third) most recent model is likely still sitting in memory, ready to go.
  • Unified Interface: You interact with all your models through a consistent API endpoint (often OpenAI-compatible), so your client code stays simple. You just change the model name in your request to switch between them.
  • Mac-First Design: It's built with the Apple Silicon Mac's memory architecture in mind, aiming to get the most out of the hardware you have.

The main use case is clear: local development and testing. If you're building an app that uses LLMs and want to test performance across different models, or if you're a researcher comparing outputs, this saves you from a manual, tedious model loading/unloading dance.

How to Try It

Getting started is straightforward. Head over to the GitHub repository to clone it and get the setup instructions.

git clone https://github.com/jundot/omlx
cd omlx

The repo's README will have the most up-to-date prerequisites and running commands, but typically you'll need Python and uv or pip to install dependencies. You'll likely configure which model backends and specific models you want to use in a config file, then start the server with a simple command like ./omlx or python main.py.

From there, you can point any OpenAI-compatible client (like the OpenAI Python library, curl, or apps like Open WebUI) to http://localhost:11434 (or whichever port it uses) and start sending requests, specifying the model in the request body.

Final Thoughts

omlx feels like a practical tool for a specific, growing need. As the landscape of open-source models expands, the ability to efficiently run more than one locally becomes huge for developers. It's not about running a production-scale model hub; it's about making your local development and experimentation workflow smoother and more memory-conscious.

If you've been frustrated by "out of memory" errors while trying to test a second model, this approach of smart swapping and caching is a sensible solution. It's the kind of tool that sits quietly in the background and just makes things work better, which is always a win.


Follow us for more cool projects: @githubprojects

Back to Projects
Project ID: b3bb99c5-7cb1-4564-9587-c01805a3e2e0Last updated: March 11, 2026 at 05:20 AM