Glossary

Markdown/HTML Splitting

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages as natural boundaries for creating semantically coherent chunks in retrieval-augmented generation systems.

Get in touch Learn more

Developer working on RAG retrieval system, document chunks visible on screen, technical workspace with code editor.

DOCUMENT CHUNKING STRATEGY

What is Markdown/HTML Splitting?

A document segmentation technique that uses the inherent structural elements of markup languages to create semantically coherent text chunks for retrieval-augmented generation (RAG).

Markdown/HTML splitting is a document chunking strategy that uses the native structural elements—such as headings (#, <h1>), lists, paragraphs (<p>), and code blocks—of markup languages as natural boundaries for segmentation. This method preserves the logical and semantic organization intended by the document's author, creating chunks that are more coherent than those produced by arbitrary character-count splits. It is a form of semantic chunking and layout-aware chunking specifically optimized for web-based and documentation content.

The process involves parsing the document's Abstract Syntax Tree (AST) or Document Object Model (DOM) to identify these structural nodes. Splitting at these boundaries, such as after a heading or before a new section, helps maintain contextual integrity, which improves retrieval precision by reducing the chance that a retrieved chunk contains unrelated topics. This strategy is fundamental in Retrieval-Augmented Generation Architectures for grounding language models in well-structured, factual enterprise data from sources like internal wikis, API documentation, and knowledge bases.

DOCUMENT CHUNKING STRATEGIES

Key Features of Markdown/HTML Splitting

Markdown and HTML splitting are document segmentation strategies that use the native structural elements (e.g., headings, lists, divs) of markup languages as natural boundaries for creating semantically coherent chunks.

Structure-Aware Segmentation

Markdown/HTML splitting uses the document's inherent Document Object Model (DOM) or markdown syntax tree to define chunk boundaries. Instead of arbitrary character counts, it splits at logical structural elements:

Headings (<h1>, #)
Paragraphs (<p>, blank lines)
Lists (<ul>, -)
Code blocks (<pre>, ```)
Horizontal rules (<hr>, ---)
Table rows (<tr>) This preserves the author's intended grouping of ideas, leading to chunks that are more self-contained and meaningful for retrieval.

Preservation of Semantic Coherence

The primary advantage is maintaining semantic coherence within each chunk. A chunk created by splitting at a heading (## Results) will contain all content related to that subsection until the next heading of equal or greater importance. This contrasts with fixed-length splitting, which can sever a sentence mid-thought or separate a data point from its explanatory context. Coherent chunks improve retrieval precision because the embedded vector representation more accurately reflects a complete concept or topic.

Hierarchical Chunking Support

Markup languages are inherently hierarchical. Splitters can leverage this to create parent-child chunk relationships automatically. For example:

Parent Chunk: A section defined by an <h2> heading.
Child Chunks: Individual paragraphs or lists within that section. This enables multi-granularity retrieval. A broad query can retrieve the parent chunk for an overview, while a specific query can retrieve a precise child chunk. This structure is foundational for advanced retrieval patterns like sentence window retrieval where a core child chunk is expanded with its parent's context.

Metadata and Attribute Extraction

During splitting, valuable metadata can be extracted from the markup and attached to each chunk, enriching it for filtering and ranking. This includes:

Heading level and text (for hierarchical context)
HTML element type (e.g., table, code)
CSS class names or IDs (e.g., class="important")
Markdown formatting (e.g., bold text indicating key terms)
Link destinations (href attributes) This metadata allows for hybrid retrieval strategies, combining semantic vector search with precise metadata filters (e.g., WHERE element_type = 'code') to quickly narrow results.

Mitigation of Context Fragmentation

A key challenge in chunking is context fragmentation, where related information is split across separate chunks. Markdown/HTML splitting reduces this by using large structural blocks as primary units. To handle cases where these blocks exceed desired token limits, it is often combined with recursive splitting. The strategy is applied hierarchically: first split by major headings, then by paragraphs, then by sentences if needed. This ensures the largest semantically intact unit is always preserved first, minimizing information loss at boundaries.

Implementation in Common Frameworks

This strategy is implemented in major RAG development frameworks:

LangChain: The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter split based on headers and preserve header metadata.
LlamaIndex: The HTMLNodeParser and MarkdownNodeParser convert elements/headers into Node objects with inherited metadata.
Unstructured.io: The library's partitioning functions natively understand HTML/XML tags and Markdown syntax to output structured elements. These tools handle the complexity of nested tags and inconsistent markup, allowing engineers to focus on configuring chunk granularity (e.g., split by h2 only) rather than writing custom parsers.

DOCUMENT CHUNKING STRATEGY

How Markdown/HTML Splitting Works

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages as natural boundaries for creating semantically coherent chunks.

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages—such as headings (#, ##), lists (-, *), code blocks (```), and HTML tags (<p>, <div>, <h1>)—as natural boundaries for creating semantically coherent chunks. This method preserves the logical organization intended by the document's author, ensuring that related content, like a header and its subsequent paragraphs, remains together. The process typically involves parsing the document's Abstract Syntax Tree (AST) to identify these elements and split the text at their boundaries, which is more effective than arbitrary character-based splitting for retrieval tasks.

This strategy is a form of layout-aware or semantic chunking that improves retrieval quality in Retrieval-Augmented Generation (RAG) systems by ensuring retrieved chunks are self-contained logical units. It directly addresses the context window constraint of language models by providing meaningful, discrete units of information. Compared to fixed-length chunking, it reduces the risk of severing critical contextual relationships at chunk boundaries, though it may be combined with techniques like chunk overlap for added robustness. Tools like the LangChain Text Splitter and LlamaIndex Node Parser often implement this method to preprocess documents for chunk indexing in a vector database.

DOCUMENT SEGMENTATION

Comparison with Other Chunking Strategies

This table compares Markdown/HTML Splitting against other common document chunking strategies across key technical features relevant to Retrieval-Augmented Generation (RAG) pipelines.

Feature / Metric	Markdown/HTML Splitting	Fixed-Length Chunking	Semantic Chunking	Recursive Character Splitting
Boundary Definition	Native structural elements (headings, lists, divs)	Arbitrary character/token count	Semantic coherence (topics, paragraphs)	Hierarchy of separators (e.g., \n\n, \n, .)
Preserves Document Structure
Chunk Size Consistency	Variable	Fixed	Variable	Variable within target range
Requires Pre-Trained Model
Handles Semi-Structured Data
Implementation Complexity	Medium (requires parser)	Low	High (requires embedding model)	Low-Medium
Optimal For	Web content, documentation, code repos	Plain text, transcripts, logs	Long-form articles, reports, books	General-purpose text with mixed formatting
Common Overlap Strategy	Contextual (parent/child nodes)	Fixed sliding window	Minimal (relies on semantic boundaries)	Configurable sliding window

MARKDOWN/HTML SPLITTING

Framework and Tool Implementation

Markdown and HTML splitting leverage the inherent structural elements of markup languages—like headings, lists, and semantic tags—to create semantically coherent document chunks. This approach is foundational for retrieval-augmented generation (RAG) systems, as it preserves logical document flow and improves retrieval accuracy.

LangChain's Markdown/HTML Header Splitters

The LangChain framework provides specialized splitters that use document structure as primary chunk boundaries. The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter parse source files to split content under specific heading levels (e.g., #, ##).

Preserves Hierarchy: Chunks are created per header, maintaining parent-child relationships.
Metadata Injection: The header text is automatically added to chunk metadata, enabling filtered retrieval.
Configurable Granularity: Engineers can define which header levels to split on, balancing chunk size and semantic unity.

EXPLORE

LlamaIndex Node Parsers for Structured Documents

LlamaIndex implements markdown/HTML splitting through its NodeParser abstraction. The MarkdownNodeParser and HTMLNodeParser convert documents into Node objects, which are the atomic units for indexing.

Semantic Chunking by Default: These parsers naturally split on elements like headings (<h1>) and paragraphs (<p>).
Rich Metadata: Each Node retains structural metadata (e.g., tag type, heading level).
Hierarchical Indexing: Nodes can be organized into parent-child graphs, enabling multi-level retrieval strategies.

EXPLORE

Unstructured.io's Partitioning Engine

Unstructured.io is a production-grade library for ingesting and pre-processing documents, including complex HTML and Markdown. Its partition_html and partition_markdown functions are industry standards.

Layout-Aware Splitting: Goes beyond simple tags to understand visual cues and document layout.
Element Categorization: Returns structured data with element types (e.g., Title, NarrativeText, Table).
Pre-Built Cleaners: Includes utilities for removing boilerplate (headers, footers, ads) before chunking, dramatically improving signal-to-noise ratio.

EXPLORE

Beautiful Soup for Custom HTML Parsing

For maximum control, engineers often use Beautiful Soup, a Python library for parsing HTML and XML. It allows for the creation of highly customized splitting logic.

Fine-Grained Selectors: Use CSS selectors or the find_all() method to target specific elements (e.g., article, .post-content).
Custom Boundary Logic: Programmatically define chunking rules based on any combination of tags, classes, or IDs.
Preprocessing Integration: Easily integrates into data pipelines to clean, extract, and structure raw HTML before it reaches the chunker.

EXPLORE

Implementation Best Practices

Effective markdown/HTML splitting requires more than just applying a library. Key engineering considerations include:

Combine with Recursive Splitting: Use header-based splitting first, then apply a recursive character splitter to long sections to respect the model's context window.
Manage Chunk Overlap: Implement chunk overlap (e.g., 10% of chunk size) when splitting within a structural element to prevent information loss at seams.
Validate Output: Post-splitting, analyze chunk length distributions and semantic coherence to tune parameters like chunk_size and chunk_overlap.

Evaluation and Impact on Retrieval

The quality of splitting directly impacts RAG system performance. Key evaluation metrics and considerations:

Retrieval Recall: Structural splitting improves recall by keeping related concepts (e.g., a list and its introductory paragraph) together.
Answer Precision: Well-formed chunks reduce the chance of the LLM receiving incomplete context, mitigating hallucinations.
Benchmarking: Use retrieval evaluation metrics like Hit Rate or Mean Reciprocal Rank (MRR) to A/B test different splitting strategies (e.g., fixed-length vs. markdown-based) on your corpus.

DOCUMENT CHUNKING

Frequently Asked Questions

Essential questions about splitting documents using their native markup structure for optimal retrieval-augmented generation.

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages—such as headings (#, <h1>), lists (-, <li>), code blocks (```, <pre>), and divs (<div>)—as natural boundaries for creating semantically coherent chunks. It works by parsing the document's Document Object Model (DOM) for HTML or its abstract syntax for Markdown to identify these elements, then splitting the text at these boundaries. This method preserves the logical flow and self-contained nature of sections, producing chunks that are more meaningful for retrieval than arbitrary character-based splits. For example, a chunk might be defined as everything between an <h2> tag and the next <h2> tag, ensuring a complete subsection is kept intact.

Enabling Efficiency, Speed & Accuracy

Intelligent Analysis, Decision & Execution

We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.

Talk to Us

Search across company data

Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.

Useful when people spend too long searching or get different answers from different systems.

Enterprise searchRAGPermissions

Automate internal workflows

Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.

Useful when repetitive work moves across multiple tools and teams.

AI agentsWorkflow automationGovernance

Add AI to products and internal tools

Build assistants, guided actions, or decision support into the software your team or customers already use.

Useful when AI needs to be part of the product, not a separate tool.

AI integrationDecision supportModel routing

DOCUMENT CHUNKING STRATEGIES

Related Terms

Markdown/HTML splitting is one of several strategies for segmenting documents. These related techniques define how raw text is transformed into retrievable units for RAG systems.

Semantic Chunking

A strategy that splits text based on its natural semantic boundaries, such as the conclusion of a topic, a shift in narrative, or the end of a coherent argument. Unlike fixed-length or delimiter-based methods, it aims to keep self-contained ideas within a single chunk, which can improve retrieval relevance.

Primary Mechanism: Often uses a language model or a heuristic (like a significant drop in cosine similarity between sentences) to identify breakpoints.
Advantage: Produces chunks with higher intrinsic coherence, which can lead to more precise semantic search results.
Trade-off: More computationally expensive than rule-based methods and can be sensitive to the chosen model or threshold.

Recursive Character Text Splitting

A hierarchical, rule-based strategy that recursively splits text using a prioritized list of separators (e.g., \n\n, \n, . , ) until the resulting chunks are within a desired size range.

Process: It first attempts to split on the primary separator (e.g., double newlines for paragraphs). If chunks are still too large, it moves to the next separator (e.g., single newlines), and so on.
Use Case: A robust, general-purpose method that works well on plain text where structural markup is not available. It's the default splitter in many frameworks like LangChain.
Key Parameter: The chunk_size and chunk_overlap settings control the final output granularity.

Layout-Aware Chunking

A strategy for semi-structured documents (PDFs, scanned forms, HTML tables) that uses visual and spatial cues—not just textual ones—to define chunk boundaries. It is closely related to markdown/HTML splitting but applied to rendered document layouts.

Input Sources: PDFs where text position, font size, and column boundaries convey structure.
Techniques: Leverages libraries like PDFPlumber, Docling, or Unstructured.io to extract not just text, but also its bounding boxes and reading order.
Output: Chunks that respect logical visual units, such as a figure with its caption, a table, or a sidebar, preserving information that pure text splitting would scramble.

Hierarchical Chunking

A strategy that creates a multi-level tree structure of chunks (e.g., document > section > subsection > paragraph) to enable retrieval at different levels of granularity. Markdown/HTML, with its inherent heading hierarchy, is a natural input for this method.

Parent-Child Relationships: A large 'parent' chunk (e.g., a whole section) contains smaller 'child' chunks (e.g., individual paragraphs within it).
Retrieval Flexibility: A broad query can retrieve a parent chunk for overview context, while a specific query can retrieve a precise child chunk for a detailed answer.
Implementation: Often stored in vector databases that support hierarchical or multi-vector indexing, allowing the retrieval system to choose the appropriate level.

Abstract Syntax Tree (AST) Chunking

A specialized strategy for source code that parses code into its syntactic tree structure and uses language-defined nodes (functions, classes, methods) as logical, self-contained chunks. It is the semantic equivalent of markdown/HTML splitting for programming languages.

Process: A parser (like tree-sitter) builds an AST where each node represents a syntactic element. Chunks are created by extracting the text of subtrees corresponding to logical units.
Benefit: Preserves code semantics perfectly; a chunk is a complete, compilable function or class, not an arbitrary slice of text that might break syntax.
Application: Essential for code retrieval-augmented generation (RAG) systems, where retrieving a half-function is useless.

Sentence Window Retrieval

A retrieval strategy (not strictly a chunking strategy) that uses fine-grained chunks but expands context at query time. It often relies on high-quality sentence splitting as a prerequisite, similar to using <p> tags in HTML.

Two-Stage Process:
1. Indexing: Individual sentences are embedded and indexed as separate, fine-grained chunks.
2. Retrieval & Expansion: When a sentence is retrieved, its immediate surrounding sentences (a 'window') are also fetched to provide the language model with necessary context.
Advantage: Combines the precision of retrieving a single relevant sentence with the contextual completeness of a larger passage, optimizing for both accuracy and information sufficiency.

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.

Limited slotsGet a Free AI Consultation

How We Work

Custom AI workflows for your Business

One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.

Talk to Us

Markdown/HTML Splitting

What is Markdown/HTML Splitting?

Key Features of Markdown/HTML Splitting

Structure-Aware Segmentation

Preservation of Semantic Coherence

Hierarchical Chunking Support

Metadata and Attribute Extraction

Mitigation of Context Fragmentation

Implementation in Common Frameworks

How Markdown/HTML Splitting Works

Comparison with Other Chunking Strategies

Framework and Tool Implementation

LangChain's Markdown/HTML Header Splitters

LlamaIndex Node Parsers for Structured Documents

Unstructured.io's Partitioning Engine

Beautiful Soup for Custom HTML Parsing

Implementation Best Practices

Evaluation and Impact on Retrieval

Frequently Asked Questions

Intelligent Analysis, Decision & Execution

Search across company data

Automate internal workflows

Add AI to products and internal tools

Prasad Kumkar

Partnered with leading AI, data, and software stack.

Custom AI workflows for your Business

Review the use case

Pick the right approach

Build the first useful version

Improve from there