Inkmark: a very fast, feature-packed, AI-first Markdown gem for Ruby
- Ruby
- Rust
- AI
Meet Inkmark, a very fast, featureful, AI-first Markdown Ruby gem. It’s a Rust-backed gem—built on pulldown-cmark—ready to replace your existing Markdown gem, providing lots of features, as well as first-class primitives for AI: chunking documents for retrieval, truncating to context windows, mapping output back to source bytes for citation.
It is currently the fastest CommonMark- and GFM-conformant Markdown engine for Ruby across the board. If you’re already running Redcarpet, kramdown, or Commonmarker, you can swap it in: bundle add inkmark—check the README first.
But before we get to the gem, a step back: the reason I built this is that Markdown means something different now than it did when most of the existing Markdown libraries were designed.
Markdown is the lingua franca of the AI internet¶
Since John Gruber and Aaron Swartz’s inception of the Markdown project in 2004, it has become incredibly popular: READMEs, posts, comment boxes, and developer-built static sites—beating all existing simple text-to-HTML formats and even becoming more popular than authoring content in WYSIWYG HTML boxes.
Then, large language models showed up and quietly made Markdown the default wire format of the AI internet:
- LLM prompts are Markdown: System prompts, user messages, assistant responses.
- LLM outputs are Markdown by default. Headings, lists, code fences. ChatGPT, Claude, Gemini—their answers render as Markdown.
- RAG corpora are stored as Markdown. Everything gets normalized to Markdown before chunking and embedding. It’s the universal intermediate format because it preserves enough structure (headings, code, lists) without dragging in HTML’s complexity.
- Agent meta files (
CLAUDE.md,AGENTS.md, and others) are Markdown.
Markdown won the AI ecosystem the same way JSON (the “previous lingua franca”, in a way) won APIs: good enough to be useful, simple and open enough for everyone to tolerate it.
”User interviews”¶
When I did “user interviews” of a sort before designing the library—asking my Ruby developer friends and past co-workers about their Markdown parsing experience and requirements—it became quite obvious that they belong in two large groups.
The first group did not use Markdown with LLMs other than for rendering simple streaming “answers” from AI. There were no specific requirements other than to maybe provide support for extending Markdown syntax (I think we have more than enough Markdown syntax spin-offs that require feature flags in parsers; instead, Inkmark has the extension API). The second group, though… The second group was full of developers who worked with RAG pipelines. They required a good API. They needed chunking, and I’ve even seen several DIY chunking snippets based on other gem APIs. It is clear that the “market” for AI-first features, so to speak, is there.
New priorities for Markdown libraries¶
That changes what a Markdown library has to be good at:
First of all, pure speed for bulk processing. In the past, a Markdown library used to render one document per page request—or render a page or a comment once and store the resulting HTML in the database. Now it renders ten thousand documents at index time, and again every time the corpus refreshes. Previously, one could argue that speed of the Markdown processor does not matter much because the processing only happens once, on creation or update. Now, a Markdown processor has to be fast to deal with everything that is happening.
Second, it would be nice—and lately, almost a must-have—to see LLM/RAG-friendly primitives: chunking that respects document structure, truncation that doesn’t destroy code blocks or even words. We need plain-text export for embedding models (and for search indices, too). None of this is the kind of thing an old-school Markdown-to-HTML renderer was designed for.
Ruby AI ecosystem¶
Existing Ruby Markdown gems were built for rendering to HTML and supporting a lot of syntax gotchas. Some also have a nice API to them, allowing traversal and changing the document. They are not optimized for the LLM reality, and there is still space to make a Markdown processor even faster.
That’s the gap Inkmark fills.
It slots into the rest of Ruby’s AI stack—RubyLLM for multi-provider LLM calls, langchainrb for LangChain primitives, neighbor for pgvector and sqlite-vec on ActiveRecord, informers for local transformer inference—so you can build a real RAG pipeline without leaving Ruby.
Inkmark is faster than competing Ruby Markdown gems, with a trivial enough API so it is easy to replace them in minutes. With Inkmark, it is possible to avoid writing complex transformation classes in Ruby and slicing documents as HTML. In any case, the event API should be enough to traverse and even change Markdown on the fly.
Inkmark is very fast¶
No matter how many features we need, and if we even have a RAG pipeline, 9 out of 10 users would still use a Markdown gem like this:
html = MarkdownEngine.parse(source)
So if we’re not fast enough, or, ideally, the fastest, what’s the point?
Inkmark is built on pulldown-cmark, a CommonMark and GFM parser written in Rust that uses SIMD. The result is the fastest CommonMark-conformant Markdown engine available for Ruby today.
Let’s see the numbers against other Ruby Markdown gems from the benchmark suite. I’ve included the benchmarks the gems themselves use, plus some Markdown files of a specific size (about 4 KB, about 1 KB, and so on). With Inkmark as the reference, how much faster or slower are the others?
| Benchmark | Redcarpet | Markly | Commonmarker | rdiscount | kramdown |
|---|---|---|---|---|---|
| CommonMark spec (201 KB) | 1.29× slower | 2.59× slower | 3.40× slower | 5.53× slower | 45× slower |
| Redcarpet README (14 KB) | 1.28× slower | 3.18× slower | 3.55× slower | 5.20× slower | 83× slower |
| Redcarpet’s own benchmark (8 KB) | 1.16× slower | 2.96× slower | 3.54× slower | 4.46× slower | 75× slower |
| dotenv README (4 KB) | 1.10× slower | 2.85× slower | 3.55× slower | 4.63× slower | 103× slower |
| Faraday README (1 KB) | 1.02× faster (ballpark) | 3.09× slower | 4.68× slower | 4.73× slower | 77× slower |
| README section (0.5 KB) | 1.05× slower (ballpark) | 6.35× slower | 8.32× slower | 5.24× slower | 98× slower |
Some notes. First, Redcarpet wins by a hair on inputs under a kilobyte—parser setup overhead dominates at that size; it varies by the run, and I would say we’re in the same ballpark. Above ~1 KB Inkmark pulls ahead, and the gap widens with document size. What is also important is that Inkmark beats Redcarpet at its own benchmark taken from their repository.
Second, Redcarpet itself is not CommonMark conformant; it’s the speed leader of the previous generation, but you pay in subtle compatibility differences. Inkmark gives you top-tier speed and CommonMark + GFM conformance.
Third, the gap to kramdown and Commonmarker can be massive on real inputs. Commonmarker is a modern, nicely built Ruby gem, also powered by Rust. It is based on the Comrak Rust crate, which, in a way, sacrifices performance to be able to generate an AST from a Markdown file; the library Inkmark uses is specifically built for speed and does not parse the document into an AST.
If you’re running Redcarpet because it’s the fast option, it should be safe to switch now. Equal-or-better throughput with stricter spec compliance and many more features. For other gems, unless you use their specific features, like AST parsing with Commonmarker, you will be getting better speeds (3×–9×), more features, and comfort working with AI pipelines.
Inkmark has features galore¶
Let’s explore the feature set, from the basics to advanced magick.
The basics¶
What most people would use, and, possibly, not even touch other features. A one-method, one-line Markdown parser/renderer:
Inkmark.to_html("**hello**")
# or
Inkmark.new("# Hello").to_html
# or: mutable options accessor for tweaks at runtime
md = Inkmark.new(src)
md.options.tables = false
md.to_html
You can also set process-level defaults in something like a Rails initializer:
Inkmark.default_options.preset = :recommended
Inkmark.default_options.statistics = true
Presets¶
Now, unlike other libraries, to avoid juggling a dozen different flags and options (although you can still fine-tune everything), Inkmark ships several presets:
# Modern-web profile: smart punctuation, autolinks, lazy images, syntax
# highlighting, nofollow on external links, scheme allowlist on
# link destinations, emoji shortcodes, heading IDs, hard wraps,
# frontmatter
Inkmark.to_html(md, options: { preset: :recommended })
# `:recommended` plus raw HTML pass-through.
# Use ONLY for content you fully trust.
Inkmark.to_html(md, options: { preset: :trusted })
# See other presets in README
And then override—tighten link allowlists to your own hostnames, turn off features you don’t use, dial in what you want.
Inkmark.to_html(md, options: {
preset: :recommended,
links: { allowed_hosts: ["*.example.com"] }
})
Features at a glance¶
The single-method API hides a long list of built-in features—server-side syntax highlighting with CSS class output, GFM tables and tasklists and strikethrough and footnotes, smart punctuation, emoji shortcodes, autolinks, lazy-loading images, heading IDs with Unicode-transliterated slugs, frontmatter parsing, definition lists, math blocks, wikilinks, hard wraps, superscript and subscript, table of contents, statistics. Each has a single option flag in the README’s grid. Most flags do what their name says.
Safety and policy¶
Rendering Markdown from untrusted input is a sanitization problem dressed up as a formatting problem. Inkmark is safe-by-default; the policy controls let you tighten further.
Raw HTML¶
Raw HTML pass-through is off by default. Anything that looks like HTML in the source gets escaped:
Inkmark.to_html("<script>alert(1)</script>")
# => "<p><script>alert(1)</script></p>\n"
You can enable pass-through only for trusted content, like the one that was authored by you or your team.
Inkmark.to_html("hi <em>there</em>", options: { raw_html: true })
# => "<p>hi <em>there</em></p>\n"
When raw_html: true is on alongside GFM, the tagfilter still strips a small set of dangerous tags, the GFM spec’s own minimal hardening. You can disable the tagfilter, but at that point, sanitization is up to you.
Allowlists for hosts and schemes¶
You can lock link destinations and image sources to a list of hostnames using glob syntax. This can be useful if, for instance, you only want to allow images from your CDN hosts—something users uploaded using your own image processing tools.
Inkmark.to_html(md, options: {
links: { allowed_hosts: ["example.com", "*.example.com", "github.com"] },
images: { allowed_hosts: ["cdn.example.com", "imgur.com"] }
})
Default is no filtering. Pass [] for deny-all-external, useful when rendering content that should never link out.
You also can—and probably should—drop unsafe URL schemes: javascript:, data:, anything you don’t expect:
Inkmark.to_html(md, options: {
links: { allowed_schemes: %w[http https mailto] },
# block `data:` URIs in img src
images: { allowed_schemes: %w[http https] }
})
Frontmatter¶
Inkmark supports parsing frontmatter: it parses a leading YAML block and exposes it via Inkmark#frontmatter. Useful when Markdown files are the source of truth for structured metadata—blog posts, docs sites, or content collections.
md = Inkmark.new(post, options: { frontmatter: true })
md.to_html # => "<h1>Body</h1>\n"
md.frontmatter # => { "title" => "Hello", "tags" => ["ruby", "markdown"] }
Statistics and extraction¶
Say you need document statistics: word and character count, or how many blocks of a specific type there are. Inkmark collects document metadata. Two independent options control what’s exposed: statistics: true counts elements and even does language detection. extract: { kind: true, ... } populates structured arrays of records:
Statistics and structured extraction¶
md = Inkmark.new(source, options: { statistics: true })
md.to_html
md.statistics
# => {
# heading_count: 2,
# character_count: 142,
# word_count: 28,
# code_block_count: 1,
# image_count: 1,
# link_count: 2,
# footnote_definition_count: 1,
# likely_language: "eng",
# language_confidence: 0.93
# }
extract opts into structured arrays of records; you ask for the kinds you need. You can go over every heading to make a custom table of contents (more on that later), collect all links, collect all images to make a gallery—all without writing a line of code to parse the original Markdown text.
md = Inkmark.new(source, options: {
extract: {
headings: true,
links: true,
images: true,
code_blocks: true,
footnote_definitions: true
}
})
md.to_html
md.extracts[:headings]
# => [
# { level: 1, text: "Hello World", id: "hello-world", byte_range: 0...14 },
# { level: 2, text: "Code Example", id: "code-example", byte_range: 68...83 }
# ]
md.extracts[:code_blocks]
# => [{ lang: "ruby", source: "puts \"hello\"\n", byte_range: 78...101 }]
Every record carries a byte_range pointing into the original source string, if needed; source.byteslice(r.begin, r.size) recovers the raw Markdown. Useful for RAG-readiness: you can map the block back to exact source bytes—for grounding, for highlighting, for verification—without substring searches that fail on paraphrase or near-duplicate sections.
Table of contents¶
Set toc: true to collect a table of contents alongside the render. The resulting toc object exposes #to_markdown, #to_html, and #to_s for usage in whatever format you need.
md = Inkmark.new(source, options: { toc: true })
md.to_html
md.toc.to_html # rendered <nav><ol>...</ol></nav>
md.toc.to_markdown # ordered list with anchor links
toc: { depth: 3 } limits which heading levels appear (h1–h3 in that example). Enabling toc implicitly enables headings: { ids: true } so anchors work, and triggers heading extraction, so md.extracts[:headings] is populated as well!
Quality-of-life features¶
A few of the remaining flags hide features you’d otherwise reach for a second gem to get—or hand-roll over the post-rendered HTML.
For typography, smart_punctuation: true rewrites ASCII punctuation during the parse: -- → en dash, --- → em dash, ... → ellipsis, straight "..." → matched curly quotes. One flag, no second-pass typography pipeline, all done in native code.
images: { lazy: true } adds lazy-loading images support: loading="lazy" decoding="async" on every <img> Inkmark emits. Browsers defer off-screen images natively—no JS, no IntersectionObserver scaffold. For long-form Markdown this is free page-load latency reduction.
Autolinking is also available: links: { autolink: true } promotes bare URLs and email addresses to anchors with correct boundary detection—it won’t swallow the trailing period of a sentence. links: { nofollow: true } adds rel="nofollow noopener" to anchors pointing off-domain, the standard hygiene step for user-generated content.
headings: { ids: true } auto-generates heading anchors: id slugs from heading text, transliterating non-ASCII characters so non-English headings survive the trip: ## Привет, мир produces a slug like privet-mir rather than blank or mangled output. Duplicates within a document get a counter suffix. Pair with headings: { attributes: true } to opt into explicit # Heading {#custom-id .klass} syntax when you want to pin a slug yourself.
emoji_shortcodes: true expands gemoji-style :shortcode: to emoji characters inline. Codes inside code blocks stay verbatim; unknown codes pass through untouched.
Syntax highlighting with syntax_highlight: true runs fenced code blocks through syntect—TextMate-grammar accuracy across dozens of languages, CSS-class output (not inline styles, so you own the theme). No client-side JavaScript, no flash of unstyled code, no Pygments shell-out.
Inkmark is AI-first¶
With all that in mind, statistics and extraction are only half of Inkmark’s AI/RAG story; the other half is the chunking, truncation, and serialization primitives. None of these exist as first-class methods in any other Markdown gem.
Heading-based chunking¶
When working with RAG, we need to split a document into chunks before embedding. What’s a chunk, though? Either a structural unit (heading section) or a hard size budget (sliding window). And Inkmark gives you both.
Inkmark.chunks_by_heading walks the document and emits one entry per heading section, with the heading hierarchy as breadcrumbs:
sections = Inkmark.chunks_by_heading(readme)
sections.first
# => {
# heading: "Installation",
# level: 2,
# id: "installation",
# breadcrumb: ["Inkmark", "Getting started"],
# content: "Run `bundle install`...\n"
# }
The breadcrumb is the chunk’s structural address: ancestor headings, root to immediate parent. When you embed this chunk for retrieval, prepend the breadcrumb to the content—"Inkmark > Getting started > Installation\n\n..."—and the embedding model gets the document’s structural context for free. Without breadcrumbs, you’re either embedding flat text (lossy: every “Installation” section in your corpus collides) or hand-rolling a heading walker on top of an AST.
Note that when enabling statistics: true, every section also carries :character_count and :word_count—useful for size budgeting at retrieval time without re-counting.
The full RAG pattern is short:
Inkmark.chunks_by_heading(readme).each do |s|
next if s[:heading].nil? # skip the preamble
context = (s[:breadcrumb] + [s[:heading]]).join(" > ")
embed_and_store("#{context}\n\n#{s[:content]}", metadata: { id: s[:id] })
end
chunks_by_heading always returns the full array, including a preamble entry (heading: nil, level: 0) for content before the first heading.
sections = Inkmark.chunks_by_heading(readme)
sections.find { |s| s[:heading] == "Installation" }
sections.select { |s| s[:heading]&.match?(/install|usage/i) }
sections.reject { |s| s[:heading].nil? } # skip preamble
Sliding-window chunking¶
For documents without clean heading structure, like transcripts or OCR output, heading-based chunking has nothing to grip on. Inkmark.chunks_by_size is the answer: fixed-size windows with overlap, walking the filter-applied Markdown sequentially.
# Char budget with overlap
Inkmark.chunks_by_size(doc, chars: 500, overlap: 50)
# Word budget, cuts at word boundaries
Inkmark.chunks_by_size(doc, words: 120, overlap: 15, at: :word)
# Dual budget: cut at whichever is reached first
Inkmark.chunks_by_size(doc, chars: 1000, words: 200)
Each window comes back as a hash with an :index and :content, plus :character_count / :word_count when statistics: true is set.
Two boundary modes. at: :block (the default) cuts only between top-level Markdown blocks—your output stays valid Markdown, and a single block that exceeds the budget gets emitted as its own oversized window rather than silently sliced. at: :word serializes the full filtered Markdown and cuts at the last Unicode word boundary that fits—tighter packing, but it can split open constructs.
You can also compose the two for a “heading-based, but size-capped” hybrid: split by heading, fall through to sliding-window for any oversized section.
Inkmark.chunks_by_heading(doc).flat_map do |c|
if c[:content].size > 2000
Inkmark.chunks_by_size(c[:content], chars: 500, overlap: 50)
else
[c]
end
end
Inkmark doesn’t ship a tokenizer; budgets are in characters or Unicode words. However, you can already get the likely document language, character and word count from statistics to do a ballpark estimation.
Block- and word-aware truncation¶
There is also truncation: taking one document and capping it at a budget without breaking it. Useful when you’re stuffing a doc into an LLM context window, when a chunk turns out larger than expected, when you’re previewing.
The hand-rolled truncation would likely break code blocks, and sentences mid-word, mid-table. Inkmark.truncate_markdown does it correctly:
# Cuts at the last complete block that fits—output stays valid Markdown
Inkmark.truncate_markdown(doc, chars: 4000, at: :block)
# Word-boundary cut—tighter, but may split open constructs
Inkmark.truncate_markdown(doc, chars: 4000, at: :word)
# Dual budget, custom marker
Inkmark.truncate_markdown(doc, chars: 4000, words: 500, at: :word, marker: "[…]")
The marker (default "…") counts toward the budget, so chars: 4000 always returns ≤ 4000 codepoints. Pass marker: nil to suppress the marker entirely.
chunks_by_heading accepts a truncate: keyword that applies the same contract per section—useful when you want every section to stand alone as a self-contained, budget-capped unit:
Inkmark.chunks_by_heading(doc, truncate: { chars: 500, at: :block })
# => each entry's :content is ≤ 500 chars; metadata stays intact;
# statistics counts are recomputed against the truncated content
Plain-text extraction¶
Strip all Markdown syntax and return inline content as plain text. Designed for embedding models, token counting, LLM prompts, anything downstream that treats Markdown formatting as noise:
Inkmark.to_plain_text("**bold** and [a link](https://example.com)")
# => "bold and a link (https://example.com)\n"
The same filter pipeline runs before serialization, so to_plain_text, to_markdown, and to_html see the same emoji expansion, autolink rewrites, and allowlist policy applied consistently:
md = Inkmark.new(source, options: { emoji_shortcodes: true, links: { autolink: true } })
md.to_plain_text # plain text with :rocket: → 🚀, bare URLs as text
md.to_html # HTML with the same transforms applied
The output grammar is documented in the README and predictable: bold/italic/strike unwrap to inner text, links serialize as text (url) (collapsed when text equals url), images as alt (src), code fences as their raw bodies, blockquotes as email-style > prefixes, lists as - / 1. bullets with two-space-per-level indent.
Markdown-to-Markdown pipeline¶
#to_markdown runs the same filter pipeline as #to_html and serializes the result back to Markdown. Use it as a preprocessing step in pipelines that consume Markdown rather than HTML—LLM prompts, secondary renderers, content storage, anything that wants clean, normalized Markdown:
# Class-method form for one-shots
Inkmark.to_markdown("**bold** :rocket:", options: { emoji_shortcodes: true })
# => "**bold** 🚀"
# Instance form: same options object drives both outputs
md = Inkmark.new(source, options: {
emoji_shortcodes: true,
links: { allowed_hosts: ["trusted.com", "*.trusted.com"] }
})
md.to_markdown # filtered Markdown for the next stage
md.to_html # rendered HTML for display
Inkmark has an extension API¶
The RAG primitives above—and most of Inkmark’s own filter pipeline—are built on the same underlying extension API your own code uses. The model is event handlers. You register a block for any element kind, and Inkmark fires it during the parse, post-order—children before parents.
Rewriting output¶
Set dest=, level=, id=, html=, or markdown= on the event to rewrite what ends up in the output. Setting html= skips the default rendering for that element, including any post-render filters; setting markdown= re-renders a Markdown string with the same options as the main document.
Rewrite images to be served from a CDN, rearrange headings, do custom markup for code blocks and filter markup, easily:
md = Inkmark.new(source)
# Image CDN rewriting
md.on(:image) do |img|
img.dest = "https://cdn.example.com/#{File.basename(img.dest)}"
end
# Heading shifts (for fitting docs into a layout that owns <h1>)
md.on(:heading) { |h| h.level = [h.level + 1, 6].min }
# Custom code block rendering by language tag
md.on(:code_block) do |c|
case c.lang
when "mermaid" then c.html = %(<div class="mermaid">#{c.text}</div>\n)
when "math" then c.html = %(<div class="math">\\[#{c.text}\\]</div>\n)
end
end
# Suppress elements entirely
# all images
md.on(:image) { |img| img.delete }
# by content
md.on(:heading) { |h| h.delete if h.text.start_with?("INTERNAL:") }
# Replace a paragraph with a re-parsed Markdown snippet
md.on(:paragraph) do |p|
next unless p.text.start_with?("@note ")
p.markdown = "> **Note:** #{p.text.sub(/\A@note /, "")}"
end
html = md.to_html
Six things happening in one parse, no AST walking, no regex.
Walking for analysis¶
#walk fires the handlers without rendering. Use it for pure analysis: validation, collection, instrumentation.
# Verify every image has alt text
md = Inkmark.new(source)
missing_alt = []
md.on(:image) { |img| missing_alt << img.dest if img.text.empty? }
md.walk
raise "Missing alt: #{missing_alt.join(', ')}" if missing_alt.any?
# Collect every fenced code block language used in the document
languages = Set.new
Inkmark.new(source).on(:code_block) { |c| languages << c.lang if c.lang }.walk
Note: for built-in heading/link/image/word-count collection, statistics: true will likely solve most of your collection tasks; the handler API is for the cases where you need something custom.
Tree context¶
Container elements expose their child events lazily:
md.on(:table) do |t|
rows = t.children_of(:table_row)
rows.each_with_index do |row, i|
cells = row.children_of(:table_cell).map(&:text)
puts "Row #{i}: #{cells.join(' | ')}"
end
end
Use parent_kind, ancestor_kinds, and depth for context-sensitive decisions:
# Skip decorative images that are already inside a link
md.on(:image) { |img| img.delete if img.ancestor_kinds.include?(:link) }
# Process only top-level paragraphs
md.on(:paragraph) do |p|
next unless p.parent_kind.nil?
# ... only top-level paragraphs reach here
end
depth is 0 at the top level and increments once per nesting level—a paragraph inside a blockquote has depth: 1.
Source byte ranges¶
Every event carries a byte_range pointing into the original source, the same one extraction records use. Useful when handlers need to correlate elements back to source positions—for syntax-highlighting injection, source maps, diffing:
source = File.read("post.md")
md = Inkmark.new(source)
md.on(:heading) do |h|
raw = source.byteslice(h.byte_range)
puts "#{h.byte_range}: #{raw.inspect}"
end
md.walk
Time to try it¶
bundle add inkmark
Or in your Gemfile:
gem "inkmark"
No Rust toolchain is needed for installation. Inkmark supports Ruby 3.3+.
Definitely read the README at GitHub—it has the full options grid, the full event-handler surface, and the benchmark methodology in case you want to reproduce the numbers above.
Don’t hesitate to ping me on X with feedback.