Inferensys

Glossary

Markdown/HTML Splitting

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages as natural boundaries for creating semantically coherent chunks in retrieval-augmented generation systems.
Developer working on RAG retrieval system, document chunks visible on screen, technical workspace with code editor.
DOCUMENT CHUNKING STRATEGY

What is Markdown/HTML Splitting?

A document segmentation technique that uses the inherent structural elements of markup languages to create semantically coherent text chunks for retrieval-augmented generation (RAG).

Markdown/HTML splitting is a document chunking strategy that uses the native structural elements—such as headings (#, <h1>), lists, paragraphs (<p>), and code blocks—of markup languages as natural boundaries for segmentation. This method preserves the logical and semantic organization intended by the document's author, creating chunks that are more coherent than those produced by arbitrary character-count splits. It is a form of semantic chunking and layout-aware chunking specifically optimized for web-based and documentation content.

The process involves parsing the document's Abstract Syntax Tree (AST) or Document Object Model (DOM) to identify these structural nodes. Splitting at these boundaries, such as after a heading or before a new section, helps maintain contextual integrity, which improves retrieval precision by reducing the chance that a retrieved chunk contains unrelated topics. This strategy is fundamental in Retrieval-Augmented Generation Architectures for grounding language models in well-structured, factual enterprise data from sources like internal wikis, API documentation, and knowledge bases.

DOCUMENT CHUNKING STRATEGIES

Key Features of Markdown/HTML Splitting

Markdown and HTML splitting are document segmentation strategies that use the native structural elements (e.g., headings, lists, divs) of markup languages as natural boundaries for creating semantically coherent chunks.

01

Structure-Aware Segmentation

Markdown/HTML splitting uses the document's inherent Document Object Model (DOM) or markdown syntax tree to define chunk boundaries. Instead of arbitrary character counts, it splits at logical structural elements:

  • Headings (<h1>, #)
  • Paragraphs (<p>, blank lines)
  • Lists (<ul>, -)
  • Code blocks (<pre>, ```)
  • Horizontal rules (<hr>, ---)
  • Table rows (<tr>) This preserves the author's intended grouping of ideas, leading to chunks that are more self-contained and meaningful for retrieval.
02

Preservation of Semantic Coherence

The primary advantage is maintaining semantic coherence within each chunk. A chunk created by splitting at a heading (## Results) will contain all content related to that subsection until the next heading of equal or greater importance. This contrasts with fixed-length splitting, which can sever a sentence mid-thought or separate a data point from its explanatory context. Coherent chunks improve retrieval precision because the embedded vector representation more accurately reflects a complete concept or topic.

03

Hierarchical Chunking Support

Markup languages are inherently hierarchical. Splitters can leverage this to create parent-child chunk relationships automatically. For example:

  • Parent Chunk: A section defined by an <h2> heading.
  • Child Chunks: Individual paragraphs or lists within that section. This enables multi-granularity retrieval. A broad query can retrieve the parent chunk for an overview, while a specific query can retrieve a precise child chunk. This structure is foundational for advanced retrieval patterns like sentence window retrieval where a core child chunk is expanded with its parent's context.
04

Metadata and Attribute Extraction

During splitting, valuable metadata can be extracted from the markup and attached to each chunk, enriching it for filtering and ranking. This includes:

  • Heading level and text (for hierarchical context)
  • HTML element type (e.g., table, code)
  • CSS class names or IDs (e.g., class="important")
  • Markdown formatting (e.g., bold text indicating key terms)
  • Link destinations (href attributes) This metadata allows for hybrid retrieval strategies, combining semantic vector search with precise metadata filters (e.g., WHERE element_type = 'code') to quickly narrow results.
05

Mitigation of Context Fragmentation

A key challenge in chunking is context fragmentation, where related information is split across separate chunks. Markdown/HTML splitting reduces this by using large structural blocks as primary units. To handle cases where these blocks exceed desired token limits, it is often combined with recursive splitting. The strategy is applied hierarchically: first split by major headings, then by paragraphs, then by sentences if needed. This ensures the largest semantically intact unit is always preserved first, minimizing information loss at boundaries.

06

Implementation in Common Frameworks

This strategy is implemented in major RAG development frameworks:

  • LangChain: The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter split based on headers and preserve header metadata.
  • LlamaIndex: The HTMLNodeParser and MarkdownNodeParser convert elements/headers into Node objects with inherited metadata.
  • Unstructured.io: The library's partitioning functions natively understand HTML/XML tags and Markdown syntax to output structured elements. These tools handle the complexity of nested tags and inconsistent markup, allowing engineers to focus on configuring chunk granularity (e.g., split by h2 only) rather than writing custom parsers.
DOCUMENT CHUNKING STRATEGY

How Markdown/HTML Splitting Works

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages as natural boundaries for creating semantically coherent chunks.

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages—such as headings (#, ##), lists (-, *), code blocks (```), and HTML tags (<p>, <div>, <h1>)—as natural boundaries for creating semantically coherent chunks. This method preserves the logical organization intended by the document's author, ensuring that related content, like a header and its subsequent paragraphs, remains together. The process typically involves parsing the document's Abstract Syntax Tree (AST) to identify these elements and split the text at their boundaries, which is more effective than arbitrary character-based splitting for retrieval tasks.

This strategy is a form of layout-aware or semantic chunking that improves retrieval quality in Retrieval-Augmented Generation (RAG) systems by ensuring retrieved chunks are self-contained logical units. It directly addresses the context window constraint of language models by providing meaningful, discrete units of information. Compared to fixed-length chunking, it reduces the risk of severing critical contextual relationships at chunk boundaries, though it may be combined with techniques like chunk overlap for added robustness. Tools like the LangChain Text Splitter and LlamaIndex Node Parser often implement this method to preprocess documents for chunk indexing in a vector database.

DOCUMENT SEGMENTATION

Comparison with Other Chunking Strategies

This table compares Markdown/HTML Splitting against other common document chunking strategies across key technical features relevant to Retrieval-Augmented Generation (RAG) pipelines.

Feature / MetricMarkdown/HTML SplittingFixed-Length ChunkingSemantic ChunkingRecursive Character Splitting

Boundary Definition

Native structural elements (headings, lists, divs)

Arbitrary character/token count

Semantic coherence (topics, paragraphs)

Hierarchy of separators (e.g., \n\n, \n, .)

Preserves Document Structure

Chunk Size Consistency

Variable

Fixed

Variable

Variable within target range

Requires Pre-Trained Model

Handles Semi-Structured Data

Implementation Complexity

Medium (requires parser)

Low

High (requires embedding model)

Low-Medium

Optimal For

Web content, documentation, code repos

Plain text, transcripts, logs

Long-form articles, reports, books

General-purpose text with mixed formatting

Common Overlap Strategy

Contextual (parent/child nodes)

Fixed sliding window

Minimal (relies on semantic boundaries)

Configurable sliding window

MARKDOWN/HTML SPLITTING

Framework and Tool Implementation

Markdown and HTML splitting leverage the inherent structural elements of markup languages—like headings, lists, and semantic tags—to create semantically coherent document chunks. This approach is foundational for retrieval-augmented generation (RAG) systems, as it preserves logical document flow and improves retrieval accuracy.

05

Implementation Best Practices

Effective markdown/HTML splitting requires more than just applying a library. Key engineering considerations include:

  • Combine with Recursive Splitting: Use header-based splitting first, then apply a recursive character splitter to long sections to respect the model's context window.
  • Manage Chunk Overlap: Implement chunk overlap (e.g., 10% of chunk size) when splitting within a structural element to prevent information loss at seams.
  • Validate Output: Post-splitting, analyze chunk length distributions and semantic coherence to tune parameters like chunk_size and chunk_overlap.
06

Evaluation and Impact on Retrieval

The quality of splitting directly impacts RAG system performance. Key evaluation metrics and considerations:

  • Retrieval Recall: Structural splitting improves recall by keeping related concepts (e.g., a list and its introductory paragraph) together.
  • Answer Precision: Well-formed chunks reduce the chance of the LLM receiving incomplete context, mitigating hallucinations.
  • Benchmarking: Use retrieval evaluation metrics like Hit Rate or Mean Reciprocal Rank (MRR) to A/B test different splitting strategies (e.g., fixed-length vs. markdown-based) on your corpus.
DOCUMENT CHUNKING

Frequently Asked Questions

Essential questions about splitting documents using their native markup structure for optimal retrieval-augmented generation.

Markdown/HTML splitting is a document segmentation strategy that uses the native structural elements of markup languages—such as headings (#, <h1>), lists (-, <li>), code blocks (```, <pre>), and divs (<div>)—as natural boundaries for creating semantically coherent chunks. It works by parsing the document's Document Object Model (DOM) for HTML or its abstract syntax for Markdown to identify these elements, then splitting the text at these boundaries. This method preserves the logical flow and self-contained nature of sections, producing chunks that are more meaningful for retrieval than arbitrary character-based splits. For example, a chunk might be defined as everything between an <h2> tag and the next <h2> tag, ensuring a complete subsection is kept intact.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.