Running large language models locally is no longer reserved for beefy GPU rigs. With the right setup, a Jetson Nano can serve as a surprisingly capable inference endpoint for smaller models — completely air-gapped from the cloud.
Why edge inference matters
For security professionals, running models locally means:
- No data exfiltration risk — queries never leave your network
- Predictable latency — no API rate limits or outages
- Full control — choose your model, quantization, and context length
Hardware requirements
| Component | Specification |
|---|---|
| Board | NVIDIA Jetson Nano (4GB) |
| Storage | 64GB+ microSD (A2 rated) |
| Power | 5V 4A barrel jack (not USB) |
| Cooling | Active fan recommended |
Installation
First, flash JetPack 4.6 and update the system:
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git
Install Ollama using the official script:
curl -fsSL https://ollama.com/install.sh | sh
Pull a quantized model that fits in 4GB VRAM:
ollama pull phi3:mini
Testing the setup
Run a quick inference test:
ollama run phi3:mini "Explain buffer overflow in 3 sentences"
You should see output within a few seconds. The Phi-3 mini model runs comfortably on the Nano with 4-bit quantization.
Exposing as a local API
Ollama exposes a REST API on port 11434 by default. You can query it from other machines on your network:
import requests
response = requests.post(
"http://jetson-nano.local:11434/api/generate",
json={
"model": "phi3:mini",
"prompt": "What is a reverse shell?",
"stream": False
}
)
print(response.json()["response"])
Performance notes
With the 4GB Jetson Nano, expect roughly 5-8 tokens/second on quantized 3B parameter models. Not fast enough for real-time chat, but perfectly adequate for batch analysis, automated report generation, or offline threat intelligence enrichment.
Next steps
In a follow-up post, I’ll cover integrating this with n8n for automated security workflow enrichment — having an LLM summarize alerts before they hit the SOC dashboard.
