UnWeb: Liberating Content from HTML for the Age of AI Agents
How a simple HTML-to-Markdown converter became essential infrastructure for agentic AI workflows
The Problem We Don’t Talk About Enough
If you’re building AI agents or working with LLMs, you’ve probably hit this wall: your AI needs to consume content from the web, but HTML is a nightmare for language models.
Think about it. Your AI agent needs to read documentation from Confluence. Extract knowledge from SharePoint pages. Process content from wikis. Pull information from blog posts. But here’s what it actually gets:
<div class="container">
<nav class="sidebar">...</nav>
<div class="content-wrapper">
<aside class="ads">...</aside>
<main>
<h1 class="title-xl">Actual Content</h1>
<p class="text-body">This is what you need</p>
</main>
</div>
<footer>...</footer>
</div>
Your AI doesn’t need the navigation. It doesn’t care about the sidebar. The ads are noise. The CSS classes are irrelevant. All it needs is:
# Actual Content
This is what you need
That’s the problem UnWeb solves.
What is UnWeb?
UnWeb is an open-source HTML-to-Markdown converter designed specifically for modern AI workflows. It’s not just another parser—it’s built to understand the intent of HTML content and extract what actually matters.
Two core capabilities:
- Smart Content Extraction: Automatically identifies the main content from full webpages
- Clean Markdown Output: Produces CommonMark that LLMs can actually work with
And it’s dead simple to deploy: a two-container architecture (Vue 3 frontend + ASP.NET Core backend) that runs anywhere—Docker, Kubernetes, or your laptop.
Why Markdown Matters for AI Agents
If you’re building agentic AI systems, markdown isn’t just a nice-to-have—it’s essential infrastructure. Here’s why:
1. Token Efficiency
HTML is verbose. A simple heading like <h1 class="article-title primary">Introduction</h1> becomes # Introduction in markdown. That’s 51 characters down to 15. For LLMs where every token counts (and costs money), this matters.
Real example from our testing:
- Original HTML: 47,823 characters
- Extracted Markdown: 8,942 characters
- Token savings: ~81%
2. Context Window Management
Modern LLMs have large context windows (100K-200K tokens), but filling them with HTML cruft wastes that precious space. Clean markdown means:
- More actual content per query
- Better retrieval in RAG systems
- Cleaner embeddings for vector databases
3. Semantic Clarity
AI agents understand structure better when it’s explicit. Markdown provides:
- Clear heading hierarchy (
#,##,###) - Obvious lists and code blocks
- Unambiguous links and emphasis
No guessing whether <div class="heading"> is actually a heading.
4. Reasoning and Planning
When AI agents reason about content, they do it in natural language. Markdown is closer to natural language than HTML. This leads to:
- Better summarization
- More accurate question answering
- Cleaner chain-of-thought reasoning
Real-World Agentic AI Use Cases
Let’s get practical. Here are scenarios where UnWeb becomes critical infrastructure:
Use Case 1: Knowledge Base Migration for AI Assistants
Scenario: You’re building an internal AI assistant that needs to answer questions from your company’s Confluence wiki.
The Challenge: Confluence exports HTML. Your RAG system needs clean text. You have 10,000+ pages.
The Solution with UnWeb:
# Deploy UnWeb
docker-compose -f docker-compose.yml up -d
# Use the API in your migration script
curl -X POST http://localhost:8080/api/convert/upload \
-F "file=@confluence-page.html" \
| jq -r '.markdown' > clean-content.md
# Feed to your vector database
# Your AI now has clean, semantic content
Result: Your AI assistant retrieves more relevant context, gives better answers, and uses fewer tokens doing it.
Use Case 2: Web Research Agents
Scenario: Your AI agent needs to research a topic by reading multiple web articles.
The Challenge: Web scraping gives you HTML soup. The agent wastes tokens on navigation menus and cookie banners.
The Solution with UnWeb:
# Your AI agent workflow
def research_topic(url):
# Fetch the page
html = fetch_page(url)
# Convert with UnWeb
response = requests.post(
'http://unweb:8080/api/convert/paste',
json={'html': html}
)
clean_content = response.json()['markdown']
# Feed to LLM for analysis
analysis = llm.analyze(clean_content)
return analysis
Result: Your agent reads 5x more articles in the same token budget, leading to more comprehensive research.
Use Case 3: Documentation Processing for Code Agents
Scenario: Your coding assistant needs to understand framework documentation to help developers.
The Challenge: Documentation sites are full of navigation, search bars, version selectors, and ads.
The Solution with UnWeb:
UnWeb’s smart content extraction automatically:
- Ignores the sidebar
- Skips the navigation
- Removes the footer
- Keeps the actual documentation
Your code agent gets pure technical content, ready for reasoning.
Use Case 4: Multi-Agent Systems with Specialized Extractors
Scenario: You’re building a multi-agent system where one agent specializes in content extraction.
The Architecture:
User Request
↓
Coordinator Agent
↓
├─→ Web Scraper Agent (fetches HTML)
├─→ Content Extractor Agent (uses UnWeb) ←── 🎯
├─→ Analysis Agent (processes markdown)
└─→ Synthesis Agent (generates response)
Why This Works: Each agent has a clear job. The Content Extractor Agent is a lightweight wrapper around UnWeb’s API, making the whole system modular and maintainable.
The Architecture: Built for AI Infrastructure
UnWeb isn’t just a web app—it’s designed to be infrastructure in your AI stack:
Deployment Options:
- Docker Compose:
docker-compose upand you’re running - Kubernetes: Production-ready with ingress, health checks, and autoscaling
- API-First: RESTful endpoints for programmatic access
Integration Patterns:
- Synchronous Processing: Upload HTML, get markdown instantly
- Batch Processing: Process entire CMS exports
- Microservice: Deploy as a service in your AI platform
Example Integration in LangChain:
from langchain.document_loaders import UnstructuredHTMLLoader
import requests
class UnWebLoader:
def __init__(self, unweb_url="http://unweb:8080"):
self.unweb_url = unweb_url
def load(self, html_path):
with open(html_path, 'r') as f:
html = f.read()
response = requests.post(
f"{self.unweb_url}/api/convert/paste",
json={'html': html}
)
markdown = response.json()['markdown']
# Return as LangChain Document
return [Document(page_content=markdown)]
# Use in your RAG pipeline
loader = UnWebLoader()
docs = loader.load('knowledge-base.html')
vectorstore.add_documents(docs)
Smart Content Extraction: How It Works
UnWeb doesn’t just strip HTML tags. It understands content structure:
Priority 1: Semantic HTML
- Looks for
<main>,<article>,[role='main'] - If found, uses that as the primary content
Priority 2: Content Scoring
- Analyzes all
<div>elements - Scores based on:
- Text length
- Paragraph density
- Link-to-text ratio
- Excludes
<nav>,<footer>,<aside>automatically
Priority 3: Fallback
- Uses entire
<body>if no clear main content
Result: You get the article, not the website chrome.
Getting Started in 2 Minutes
Want to try it right now?
Option 1: Docker Compose (Fastest)
curl -O https://raw.githubusercontent.com/waelouf/unweb/main/docker/docker-compose.yml
docker-compose -f docker-compose.yml up
Visit http://localhost:8081 and paste HTML or upload a file.
Option 2: API Usage
# Convert HTML from stdin
curl -X POST http://localhost:8080/api/convert/paste \
-H "Content-Type: application/json" \
-d '{"html":"<h1>Hello AI</h1><p>Clean content for agents</p>"}' \
| jq -r '.markdown'
# Output:
# # Hello AI
#
# Clean content for agents
Option 3: Kubernetes
kubectl apply -f https://raw.githubusercontent.com/waelouf/unweb/main/kubernetes/all-in-one.yaml
Why We Built This
As developers working with AI agents, we kept running into the same problem: getting clean content into our systems. We tried:
- Browser automation (slow, resource-heavy)
- Generic HTML parsers (too much noise)
- Manual preprocessing (doesn’t scale)
- Commercial APIs (expensive, vendor lock-in)
We needed something that was:
- Fast: Process thousands of pages
- Accurate: Smart content extraction
- Self-hosted: Control your infrastructure
- Free: Open source, deploy anywhere
So we built UnWeb.
The Technical Stack
For the curious:
Backend:
- ASP.NET Core .NET 10 (Minimal APIs)
Frontend:
- Vue 3 with Vite
- File upload + paste support
Deployment:
- nginx for static frontend
- Separate containers for frontend/backend
- Path-based routing via ingress
CommonMark, Not GitHub Flavored
UnWeb outputs strict CommonMark, not GitHub Flavored Markdown. This was intentional:
Why CommonMark?
- Universal standard
- Predictable parsing
- Better for LLMs (less ambiguity)
- Smaller output (no GFM extensions)
If you need GFM, the conversion is straightforward.
What’s Next for UnWeb
We’re keeping it focused, but here’s what’s on the roadmap:
Phase 2:
- URL fetching (pass a URL, get markdown back)
- Batch processing UI
- Custom extraction rules
Phase 3:
- Conversion history
- API authentication
- Webhook support for CI/CD
Maybe:
- Browser extension
- CLI tool
- Language bindings (Python, JavaScript)
Try It for Your AI Workflow
If you’re building AI agents that need to consume web content, give UnWeb a shot:
- Clone/Deploy: Get it running in 2 minutes
- Test with your content: Throw real HTML at it
- Measure the difference: Compare token usage before/after
- Integrate: Add it to your AI pipeline
Links:
- Deployment Repo: github.com/waelouf/unweb
- Docker Images: hub.docker.com/u/waelouf
- Documentation: See the README
Final Thoughts
We’re in the age of agentic AI. Your AI assistants, research agents, coding copilots—they all need clean, structured content to work effectively.
HTML was built for browsers. Markdown is built for humans (and now, AI agents).
UnWeb is the bridge.
It’s not glamorous infrastructure. It won’t make headlines. But it’s the kind of tool that makes your AI systems work better, cost less, and scale further.
And that’s exactly what good infrastructure should do.