A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
-
Updated
Jun 13, 2026 - Python
A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
Adaptive Chunking: automatically select the best chunking method per document for RAG. Accepted at LREC 2026.
🍱 Semantically create chunks from large document for passing to LLM workflows
A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.
Kerning-aware text splitting
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
A sentence splitting (sentence boundary disambiguation) library for Go. It is rule-based and works out-of-the-box.
JChunk is a lightweight and flexible library designed to provide multiple strategies for text chunking within Java applications
An exploration of text splitting and chunking in JavaScript
Split and analyze text files
a powerful Markdown chunking tool that understands document structure. Unlike naive token splitters, it protects atomic elements (code, math, tables), merges by semantic affinity, and scores chunk quality — ready for RAG and fine-tuning workflows.
A practical guide to 6 document chunking strategies for RAG and LLM applications — Document, Fixed-Size, Recursive, Sentence, Semantic, and Agentic chunking with working code and plain-English explanations.
A web app that allows users to upload PDFs and interact with them through a Q&A interface. The application extracts text from PDFs, generates embeddings, stores them in a FAISS database, and retrieves relevant information to provide context-aware answers using a large language model .
Markdown chunker for RAG. Structure-aware splitting preserves full semantic context; tables split at row boundaries.
A smart C# text splitting library that intelligently chunks text while preserving semantic boundaries. Uses a hierarchical approach with configurable overlap and detailed metadata.
An intelligent chatbot that allows users to upload text-based Ayurveda PDFs and ask questions based on the content using RAG (Retrieval-Augmented Generation) combining semantic search and LLM-based responses.
Benchmark chunking strategies for your RAG corpus. Get a recommended config. CLI, Python library, and MCP server.
General token-aware chunker for text + code (AST-aware, never splits a function) + markdown — the reusable chunking primitive for RAG. Stdlib only.
Specialized markdown text splitter - part of LEDAA project's data ingestion pipeline for RAG.
LangChain is a framework, which is very helpful and easy to build applications based on available Large Language Models.
Add a description, image, and links to the text-splitting topic page so that developers can more easily learn about it.
To associate your repository with the text-splitting topic, visit your repo's landing page and select "manage topics."