With the rise of local AI models, tools like Ollama make it easy to run powerful models on your machine. Combine that with Spring Boot, and you can build your own AI-powered backend API—without relying on external services.
In this guide, you’ll learn how to:
- Run Ollama locally
- Integrate it with Spring Boot
- Build a REST API that talks to LLMs
- Handle real-world use cases
📌 Architecture Overview
Client (Postman / UI)
↓
Spring Boot REST API
↓
Ollama Local API (localhost:11434)
↓
LLM Model (llama3 / mistral / etc.)
👉 Your Spring Boot app acts as a middleware layer between users and the LLM.
⚙️ Step 1: Run Ollama
Install and start a model:
ollama run llama3
👉 This starts Ollama server at:
http://localhost:11434
🌐 Step 2: Ollama API Endpoint
POST request:
http://localhost:11434/api/generate
Sample Request:
{
"model": "llama3",
"prompt": "Explain Java threads"
}
🏗️ Step 3: Create Spring Boot Project
Use dependencies:
- Spring Web
- Lombok (optional)
📦 Step 4: Create Request/Response DTO
class OllamaRequest {
private String model;
private String prompt;
}
class OllamaResponse {
private String response;
}
🔌 Step 5: Service Layer (Calling Ollama)
Using Spring’s modern HTTP client:
import org.springframework.web.client.RestClient;
import org.springframework.stereotype.Service;
@Service
public class OllamaService {
private final RestClient restClient = RestClient.create("http://localhost:11434");
public String generate(String prompt) {
String requestBody = """
{
"model": "llama3",
"prompt": "%s"
}
""".formatted(prompt);
return restClient.post()
.uri("/api/generate")
.body(requestBody)
.retrieve()
.body(String.class);
}
}
🎯 Step 6: REST Controller
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/ai")
public class AIController {
private final OllamaService service;
public AIController(OllamaService service) {
this.service = service;
}
@GetMapping("/ask")
public String ask(@RequestParam String prompt) {
return service.generate(prompt);
}
}
▶️ Step 7: Test Your API
👉 Call from browser/Postman:
http://localhost:8080/api/ai/ask?prompt=Explain multithreading in Java
🔄 Step 8: Dynamic Model Selection
Make your API flexible:
public String generate(String model, String prompt) {
String body = """
{
"model": "%s",
"prompt": "%s"
}
""".formatted(model, prompt);
return restClient.post()
.uri("/api/generate")
.body(body)
.retrieve()
.body(String.class);
}
👉 Now you can use:
llama3→ generalmistral→ fastcodellama→ coding
⚡ Step 9: Async Processing (Important)
Use CompletableFuture for non-blocking calls:
import java.util.concurrent.CompletableFuture;
public CompletableFuture<String> generateAsync(String prompt) {
return CompletableFuture.supplyAsync(() -> generate(prompt));
}
🧠 Real-World Use Cases
🤖 AI Chatbot
- Use
llama3for conversation
💻 Coding Assistant
- Use
codellama
📄 Content Generator
- Blogs, emails, documentation
🧩 Agentic AI
- Decision-making workflows
🔐 Best Practices
✔ Run Ollama on same server for low latency
✔ Use thread pool (ExecutorService)
✔ Add timeout handling
✔ Cache frequent responses
✔ Validate user input
⚠️ Common Issues
❌ Empty Response
👉 Ollama streams response → handle properly
❌ Slow Performance
👉 Use smaller models like:
- mistral
- phi3
❌ Memory Issues
👉 Large models require high RAM
🚀 Advanced Enhancements
- 🔄 Streaming response handling
- 🧠 Context memory (chat history)
- 🔍 RAG (Retrieval-Augmented Generation)
- 🔐 Authentication layer
- 📊 Logging + monitoring
🎯 Conclusion
By integrating Ollama with Spring Boot, you can:
- Build private AI systems
- Avoid external API costs
- Create scalable AI-powered applications
👉 This is perfect for:
- Enterprise apps
- Internal tools
- AI experiments