UnWeb: Liberating Content from HTML for the Age of AI Agents

How a simple HTML-to-Markdown converter became essential infrastructure for agentic AI workflows

The Problem We Don’t Talk About Enough

If you’re building AI agents or working with LLMs, you’ve probably hit this wall: your AI needs to consume content from the web, but HTML is a nightmare for language models.

Think about it. Your AI agent needs to read documentation from Confluence. Extract knowledge from SharePoint pages. Process content from wikis. Pull information from blog posts. But here’s what it actually gets:

<div class="container">
  <nav class="sidebar">...</nav>
  <div class="content-wrapper">
    <aside class="ads">...</aside>
    <main>
      <h1 class="title-xl">Actual Content</h1>
      <p class="text-body">This is what you need</p>
    </main>
  </div>
  <footer>...</footer>
</div>

Your AI doesn’t need the navigation. It doesn’t care about the sidebar. The ads are noise. The CSS classes are irrelevant. All it needs is:

# Actual Content

This is what you need

That’s the problem UnWeb solves.

What is UnWeb?

UnWeb is an open-source HTML-to-Markdown converter designed specifically for modern AI workflows. It’s not just another parser—it’s built to understand the intent of HTML content and extract what actually matters.

Two core capabilities:

Smart Content Extraction: Automatically identifies the main content from full webpages
Clean Markdown Output: Produces CommonMark that LLMs can actually work with

And it’s dead simple to deploy: a two-container architecture (Vue 3 frontend + ASP.NET Core backend) that runs anywhere—Docker, Kubernetes, or your laptop.

Why Markdown Matters for AI Agents

If you’re building agentic AI systems, markdown isn’t just a nice-to-have—it’s essential infrastructure. Here’s why:

1. Token Efficiency

HTML is verbose. A simple heading like <h1 class="article-title primary">Introduction</h1> becomes # Introduction in markdown. That’s 51 characters down to 15. For LLMs where every token counts (and costs money), this matters.

Real example from our testing:

Original HTML: 47,823 characters
Extracted Markdown: 8,942 characters
Token savings: ~81%

2. Context Window Management

Modern LLMs have large context windows (100K-200K tokens), but filling them with HTML cruft wastes that precious space. Clean markdown means:

More actual content per query
Better retrieval in RAG systems
Cleaner embeddings for vector databases

3. Semantic Clarity

AI agents understand structure better when it’s explicit. Markdown provides:

Clear heading hierarchy (#, ##, ###)
Obvious lists and code blocks
Unambiguous links and emphasis

No guessing whether <div class="heading"> is actually a heading.

4. Reasoning and Planning

When AI agents reason about content, they do it in natural language. Markdown is closer to natural language than HTML. This leads to:

Better summarization
More accurate question answering
Cleaner chain-of-thought reasoning

Real-World Agentic AI Use Cases

Let’s get practical. Here are scenarios where UnWeb becomes critical infrastructure:

Use Case 1: Knowledge Base Migration for AI Assistants

Scenario: You’re building an internal AI assistant that needs to answer questions from your company’s Confluence wiki.

The Challenge: Confluence exports HTML. Your RAG system needs clean text. You have 10,000+ pages.

The Solution with UnWeb:

# Deploy UnWeb
docker-compose -f docker-compose.yml up -d

# Use the API in your migration script
curl -X POST http://localhost:8080/api/convert/upload \
  -F "file=@confluence-page.html" \
  | jq -r '.markdown' > clean-content.md

# Feed to your vector database
# Your AI now has clean, semantic content

Result: Your AI assistant retrieves more relevant context, gives better answers, and uses fewer tokens doing it.

Use Case 2: Web Research Agents

Scenario: Your AI agent needs to research a topic by reading multiple web articles.

The Challenge: Web scraping gives you HTML soup. The agent wastes tokens on navigation menus and cookie banners.

The Solution with UnWeb:

# Your AI agent workflow
def research_topic(url):
    # Fetch the page
    html = fetch_page(url)

    # Convert with UnWeb
    response = requests.post(
        'http://unweb:8080/api/convert/paste',
        json={'html': html}
    )

    clean_content = response.json()['markdown']

    # Feed to LLM for analysis
    analysis = llm.analyze(clean_content)

    return analysis

Result: Your agent reads 5x more articles in the same token budget, leading to more comprehensive research.

Use Case 3: Documentation Processing for Code Agents

Scenario: Your coding assistant needs to understand framework documentation to help developers.

The Challenge: Documentation sites are full of navigation, search bars, version selectors, and ads.

The Solution with UnWeb:

UnWeb’s smart content extraction automatically:

Ignores the sidebar
Skips the navigation
Removes the footer
Keeps the actual documentation

Your code agent gets pure technical content, ready for reasoning.

Use Case 4: Multi-Agent Systems with Specialized Extractors

Scenario: You’re building a multi-agent system where one agent specializes in content extraction.

The Architecture:

User Request
    ↓
Coordinator Agent
    ↓
├─→ Web Scraper Agent (fetches HTML)
├─→ Content Extractor Agent (uses UnWeb) ←── 🎯
├─→ Analysis Agent (processes markdown)
└─→ Synthesis Agent (generates response)

Why This Works: Each agent has a clear job. The Content Extractor Agent is a lightweight wrapper around UnWeb’s API, making the whole system modular and maintainable.

The Architecture: Built for AI Infrastructure

UnWeb isn’t just a web app—it’s designed to be infrastructure in your AI stack:

Deployment Options:

Docker Compose: docker-compose up and you’re running
Kubernetes: Production-ready with ingress, health checks, and autoscaling
API-First: RESTful endpoints for programmatic access

Integration Patterns:

Synchronous Processing: Upload HTML, get markdown instantly
Batch Processing: Process entire CMS exports
Microservice: Deploy as a service in your AI platform

Example Integration in LangChain:

from langchain.document_loaders import UnstructuredHTMLLoader
import requests

class UnWebLoader:
    def __init__(self, unweb_url="http://unweb:8080"):
        self.unweb_url = unweb_url

    def load(self, html_path):
        with open(html_path, 'r') as f:
            html = f.read()

        response = requests.post(
            f"{self.unweb_url}/api/convert/paste",
            json={'html': html}
        )

        markdown = response.json()['markdown']

        # Return as LangChain Document
        return [Document(page_content=markdown)]

# Use in your RAG pipeline
loader = UnWebLoader()
docs = loader.load('knowledge-base.html')
vectorstore.add_documents(docs)

Smart Content Extraction: How It Works

UnWeb doesn’t just strip HTML tags. It understands content structure:

Priority 1: Semantic HTML

Looks for <main>, <article>, [role='main']
If found, uses that as the primary content

Priority 2: Content Scoring

Analyzes all <div> elements
Scores based on:
- Text length
- Paragraph density
- Link-to-text ratio
Excludes <nav>, <footer>, <aside> automatically

Priority 3: Fallback

Uses entire <body> if no clear main content

Result: You get the article, not the website chrome.

Getting Started in 2 Minutes

Want to try it right now?

Option 1: Docker Compose (Fastest)

curl -O https://raw.githubusercontent.com/waelouf/unweb/main/docker/docker-compose.yml
docker-compose -f docker-compose.yml up

Visit http://localhost:8081 and paste HTML or upload a file.

Option 2: API Usage

# Convert HTML from stdin
curl -X POST http://localhost:8080/api/convert/paste \
  -H "Content-Type: application/json" \
  -d '{"html":"<h1>Hello AI</h1><p>Clean content for agents</p>"}' \
  | jq -r '.markdown'

# Output:
# # Hello AI
#
# Clean content for agents

Option 3: Kubernetes

kubectl apply -f https://raw.githubusercontent.com/waelouf/unweb/main/kubernetes/all-in-one.yaml

Why We Built This

As developers working with AI agents, we kept running into the same problem: getting clean content into our systems. We tried:

Browser automation (slow, resource-heavy)
Generic HTML parsers (too much noise)
Manual preprocessing (doesn’t scale)
Commercial APIs (expensive, vendor lock-in)

We needed something that was:

Fast: Process thousands of pages
Accurate: Smart content extraction
Self-hosted: Control your infrastructure
Free: Open source, deploy anywhere

So we built UnWeb.

The Technical Stack

For the curious:

Backend:

ASP.NET Core .NET 10 (Minimal APIs)

Frontend:

Vue 3 with Vite
File upload + paste support

Deployment:

nginx for static frontend
Separate containers for frontend/backend
Path-based routing via ingress

CommonMark, Not GitHub Flavored

UnWeb outputs strict CommonMark, not GitHub Flavored Markdown. This was intentional:

Why CommonMark?

Universal standard
Predictable parsing
Better for LLMs (less ambiguity)
Smaller output (no GFM extensions)

If you need GFM, the conversion is straightforward.

What’s Next for UnWeb

We’re keeping it focused, but here’s what’s on the roadmap:

Phase 2:

URL fetching (pass a URL, get markdown back)
Batch processing UI
Custom extraction rules

Phase 3:

Conversion history
API authentication
Webhook support for CI/CD

Maybe:

Browser extension
CLI tool
Language bindings (Python, JavaScript)

Try It for Your AI Workflow

If you’re building AI agents that need to consume web content, give UnWeb a shot:

Clone/Deploy: Get it running in 2 minutes
Test with your content: Throw real HTML at it
Measure the difference: Compare token usage before/after
Integrate: Add it to your AI pipeline

Links:

Deployment Repo: github.com/waelouf/unweb
Docker Images: hub.docker.com/u/waelouf
Documentation: See the README

Final Thoughts

We’re in the age of agentic AI. Your AI assistants, research agents, coding copilots—they all need clean, structured content to work effectively.

HTML was built for browsers. Markdown is built for humans (and now, AI agents).

UnWeb is the bridge.

It’s not glamorous infrastructure. It won’t make headlines. But it’s the kind of tool that makes your AI systems work better, cost less, and scale further.

And that’s exactly what good infrastructure should do.

2025-12-27 [Last modified: 2026-02-14]

https://blog.waelouf.com/post/blog/unweb/ Wael Mansour

Whale.log

UnWeb: Liberating Content from HTML for the Age of AI Agents

The Problem We Don’t Talk About Enough

What is UnWeb?

Why Markdown Matters for AI Agents

1. Token Efficiency

2. Context Window Management

3. Semantic Clarity

4. Reasoning and Planning

Real-World Agentic AI Use Cases

Use Case 1: Knowledge Base Migration for AI Assistants

Use Case 2: Web Research Agents

Use Case 3: Documentation Processing for Code Agents

Use Case 4: Multi-Agent Systems with Specialized Extractors

The Architecture: Built for AI Infrastructure

Smart Content Extraction: How It Works

Getting Started in 2 Minutes

Why We Built This

The Technical Stack

CommonMark, Not GitHub Flavored

What’s Next for UnWeb

Try It for Your AI Workflow

Final Thoughts

Read next