Skip to main content

UnWeb: Liberating Content from HTML for the Age of AI Agents

How a simple HTML-to-Markdown converter became essential infrastructure for agentic AI workflows


UnWeb Logo

The Problem We Don’t Talk About Enough

If you’re building AI agents or working with LLMs, you’ve probably hit this wall: your AI needs to consume content from the web, but HTML is a nightmare for language models.

Think about it. Your AI agent needs to read documentation from Confluence. Extract knowledge from SharePoint pages. Process content from wikis. Pull information from blog posts. But here’s what it actually gets:

<div class="container">
  <nav class="sidebar">...</nav>
  <div class="content-wrapper">
    <aside class="ads">...</aside>
    <main>
      <h1 class="title-xl">Actual Content</h1>
      <p class="text-body">This is what you need</p>
    </main>
  </div>
  <footer>...</footer>
</div>

Your AI doesn’t need the navigation. It doesn’t care about the sidebar. The ads are noise. The CSS classes are irrelevant. All it needs is:

# Actual Content

This is what you need

That’s the problem UnWeb solves.

What is UnWeb?

UnWeb is an open-source HTML-to-Markdown converter designed specifically for modern AI workflows. It’s not just another parser—it’s built to understand the intent of HTML content and extract what actually matters.

Two core capabilities:

  1. Smart Content Extraction: Automatically identifies the main content from full webpages
  2. Clean Markdown Output: Produces CommonMark that LLMs can actually work with

And it’s dead simple to deploy: a two-container architecture (Vue 3 frontend + ASP.NET Core backend) that runs anywhere—Docker, Kubernetes, or your laptop.

Why Markdown Matters for AI Agents

If you’re building agentic AI systems, markdown isn’t just a nice-to-have—it’s essential infrastructure. Here’s why:

1. Token Efficiency

HTML is verbose. A simple heading like <h1 class="article-title primary">Introduction</h1> becomes # Introduction in markdown. That’s 51 characters down to 15. For LLMs where every token counts (and costs money), this matters.

Real example from our testing:

  • Original HTML: 47,823 characters
  • Extracted Markdown: 8,942 characters
  • Token savings: ~81%

2. Context Window Management

Modern LLMs have large context windows (100K-200K tokens), but filling them with HTML cruft wastes that precious space. Clean markdown means:

  • More actual content per query
  • Better retrieval in RAG systems
  • Cleaner embeddings for vector databases

3. Semantic Clarity

AI agents understand structure better when it’s explicit. Markdown provides:

  • Clear heading hierarchy (#, ##, ###)
  • Obvious lists and code blocks
  • Unambiguous links and emphasis

No guessing whether <div class="heading"> is actually a heading.

4. Reasoning and Planning

When AI agents reason about content, they do it in natural language. Markdown is closer to natural language than HTML. This leads to:

  • Better summarization
  • More accurate question answering
  • Cleaner chain-of-thought reasoning

Real-World Agentic AI Use Cases

Let’s get practical. Here are scenarios where UnWeb becomes critical infrastructure:

Use Case 1: Knowledge Base Migration for AI Assistants

Scenario: You’re building an internal AI assistant that needs to answer questions from your company’s Confluence wiki.

The Challenge: Confluence exports HTML. Your RAG system needs clean text. You have 10,000+ pages.

The Solution with UnWeb:

# Deploy UnWeb
docker-compose -f docker-compose.yml up -d

# Use the API in your migration script
curl -X POST http://localhost:8080/api/convert/upload \
  -F "file=@confluence-page.html" \
  | jq -r '.markdown' > clean-content.md

# Feed to your vector database
# Your AI now has clean, semantic content

Result: Your AI assistant retrieves more relevant context, gives better answers, and uses fewer tokens doing it.

Use Case 2: Web Research Agents

Scenario: Your AI agent needs to research a topic by reading multiple web articles.

The Challenge: Web scraping gives you HTML soup. The agent wastes tokens on navigation menus and cookie banners.

The Solution with UnWeb:

# Your AI agent workflow
def research_topic(url):
    # Fetch the page
    html = fetch_page(url)

    # Convert with UnWeb
    response = requests.post(
        'http://unweb:8080/api/convert/paste',
        json={'html': html}
    )

    clean_content = response.json()['markdown']

    # Feed to LLM for analysis
    analysis = llm.analyze(clean_content)

    return analysis

Result: Your agent reads 5x more articles in the same token budget, leading to more comprehensive research.

Use Case 3: Documentation Processing for Code Agents

Scenario: Your coding assistant needs to understand framework documentation to help developers.

The Challenge: Documentation sites are full of navigation, search bars, version selectors, and ads.

The Solution with UnWeb:

UnWeb’s smart content extraction automatically:

  • Ignores the sidebar
  • Skips the navigation
  • Removes the footer
  • Keeps the actual documentation

Your code agent gets pure technical content, ready for reasoning.

Use Case 4: Multi-Agent Systems with Specialized Extractors

Scenario: You’re building a multi-agent system where one agent specializes in content extraction.

The Architecture:

User Request
    ↓
Coordinator Agent
    ↓
├─→ Web Scraper Agent (fetches HTML)
├─→ Content Extractor Agent (uses UnWeb) ←── 🎯
├─→ Analysis Agent (processes markdown)
└─→ Synthesis Agent (generates response)

Why This Works: Each agent has a clear job. The Content Extractor Agent is a lightweight wrapper around UnWeb’s API, making the whole system modular and maintainable.

The Architecture: Built for AI Infrastructure

UnWeb isn’t just a web app—it’s designed to be infrastructure in your AI stack:

Deployment Options:

  • Docker Compose: docker-compose up and you’re running
  • Kubernetes: Production-ready with ingress, health checks, and autoscaling
  • API-First: RESTful endpoints for programmatic access

Integration Patterns:

  1. Synchronous Processing: Upload HTML, get markdown instantly
  2. Batch Processing: Process entire CMS exports
  3. Microservice: Deploy as a service in your AI platform

Example Integration in LangChain:

from langchain.document_loaders import UnstructuredHTMLLoader
import requests

class UnWebLoader:
    def __init__(self, unweb_url="http://unweb:8080"):
        self.unweb_url = unweb_url

    def load(self, html_path):
        with open(html_path, 'r') as f:
            html = f.read()

        response = requests.post(
            f"{self.unweb_url}/api/convert/paste",
            json={'html': html}
        )

        markdown = response.json()['markdown']

        # Return as LangChain Document
        return [Document(page_content=markdown)]

# Use in your RAG pipeline
loader = UnWebLoader()
docs = loader.load('knowledge-base.html')
vectorstore.add_documents(docs)

Smart Content Extraction: How It Works

UnWeb doesn’t just strip HTML tags. It understands content structure:

Priority 1: Semantic HTML

  • Looks for <main>, <article>, [role='main']
  • If found, uses that as the primary content

Priority 2: Content Scoring

  • Analyzes all <div> elements
  • Scores based on:
    • Text length
    • Paragraph density
    • Link-to-text ratio
  • Excludes <nav>, <footer>, <aside> automatically

Priority 3: Fallback

  • Uses entire <body> if no clear main content

Result: You get the article, not the website chrome.

Getting Started in 2 Minutes

Want to try it right now?

Option 1: Docker Compose (Fastest)

curl -O https://raw.githubusercontent.com/waelouf/unweb/main/docker/docker-compose.yml
docker-compose -f docker-compose.yml up

Visit http://localhost:8081 and paste HTML or upload a file.

Option 2: API Usage

# Convert HTML from stdin
curl -X POST http://localhost:8080/api/convert/paste \
  -H "Content-Type: application/json" \
  -d '{"html":"<h1>Hello AI</h1><p>Clean content for agents</p>"}' \
  | jq -r '.markdown'

# Output:
# # Hello AI
#
# Clean content for agents

Option 3: Kubernetes

kubectl apply -f https://raw.githubusercontent.com/waelouf/unweb/main/kubernetes/all-in-one.yaml

Why We Built This

As developers working with AI agents, we kept running into the same problem: getting clean content into our systems. We tried:

  • Browser automation (slow, resource-heavy)
  • Generic HTML parsers (too much noise)
  • Manual preprocessing (doesn’t scale)
  • Commercial APIs (expensive, vendor lock-in)

We needed something that was:

  • Fast: Process thousands of pages
  • Accurate: Smart content extraction
  • Self-hosted: Control your infrastructure
  • Free: Open source, deploy anywhere

So we built UnWeb.

The Technical Stack

For the curious:

Backend:

  • ASP.NET Core .NET 10 (Minimal APIs)

Frontend:

  • Vue 3 with Vite
  • File upload + paste support

Deployment:

  • nginx for static frontend
  • Separate containers for frontend/backend
  • Path-based routing via ingress

CommonMark, Not GitHub Flavored

UnWeb outputs strict CommonMark, not GitHub Flavored Markdown. This was intentional:

Why CommonMark?

  • Universal standard
  • Predictable parsing
  • Better for LLMs (less ambiguity)
  • Smaller output (no GFM extensions)

If you need GFM, the conversion is straightforward.

What’s Next for UnWeb

We’re keeping it focused, but here’s what’s on the roadmap:

Phase 2:

  • URL fetching (pass a URL, get markdown back)
  • Batch processing UI
  • Custom extraction rules

Phase 3:

  • Conversion history
  • API authentication
  • Webhook support for CI/CD

Maybe:

  • Browser extension
  • CLI tool
  • Language bindings (Python, JavaScript)

Try It for Your AI Workflow

If you’re building AI agents that need to consume web content, give UnWeb a shot:

  1. Clone/Deploy: Get it running in 2 minutes
  2. Test with your content: Throw real HTML at it
  3. Measure the difference: Compare token usage before/after
  4. Integrate: Add it to your AI pipeline

Links:

Final Thoughts

We’re in the age of agentic AI. Your AI assistants, research agents, coding copilots—they all need clean, structured content to work effectively.

HTML was built for browsers. Markdown is built for humans (and now, AI agents).

UnWeb is the bridge.

It’s not glamorous infrastructure. It won’t make headlines. But it’s the kind of tool that makes your AI systems work better, cost less, and scale further.

And that’s exactly what good infrastructure should do.