Skip to content

Commit

Permalink
Adds splitters for different programming and markup languages (#1469)
Browse files Browse the repository at this point in the history
* Adds a HTML text splitter

* Formatting

* Adds docs page

* Adds CodeTextSplitter with support for popular languages

* Improve text splitting by preserving separators

* Fix formatting

* Merge and fix tests

* Update docs

* Factor individual language splitters into fromLanguage method, add docs
  • Loading branch information
jacoblee93 authored May 31, 2023
1 parent 64cd89e commit 4d55d1e
Show file tree
Hide file tree
Showing 11 changed files with 937 additions and 177 deletions.
38 changes: 38 additions & 0 deletions docs/docs/modules/indexes/text_splitters/examples/code.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
hide_table_of_contents: true
---

# Code and Markup Text Splitters

LangChain supports a variety of different markup and programming language-specific text splitters to split your text based on language-specific syntax.
This results in more semantically self-contained chunks that are more useful to a vector store or other retriever.
Popular languages like JavaScript, Python, and Rust are supported as well as Latex, HTML, and Markdown.

## Usage

Initialize a standard `RecursiveCharacterTextSplitter` with the `fromLanguage` factory method. Below are some examples for various languages.

## JavaScript

import CodeBlock from "@theme/CodeBlock";
import JSExample from "@examples/indexes/javascript_text_splitter.ts";

<CodeBlock language="typescript">{JSExample}</CodeBlock>

## Python

import PythonExample from "@examples/indexes/python_text_splitter.ts";

<CodeBlock language="typescript">{PythonExample}</CodeBlock>

## HTML

import HTMLExample from "@examples/indexes/html_text_splitter.ts";

<CodeBlock language="typescript">{HTMLExample}</CodeBlock>

## Latex

import LatexExample from "@examples/indexes/latex_text_splitter.ts";

<CodeBlock language="typescript">{LatexExample}</CodeBlock>
32 changes: 0 additions & 32 deletions docs/docs/modules/indexes/text_splitters/examples/latex.mdx

This file was deleted.

61 changes: 0 additions & 61 deletions docs/docs/modules/indexes/text_splitters/examples/markdown.mdx

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
hide_table_of_contents: true
---

# `RecursiveCharacterTextSplitter`
# RecursiveCharacterTextSplitter

The recommended TextSplitter is the `RecursiveCharacterTextSplitter`. This will split documents recursively by different characters - starting with `"\n\n"`, then `"\n"`, then `" "`. This is nice because it will try to keep all the semantically relevant content in the same place for as long as possible.

Expand Down
74 changes: 74 additions & 0 deletions examples/src/indexes/html_text_splitter.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = `<!DOCTYPE html>
<html>
<head>
<title>🦜️🔗 LangChain</title>
<style>
body {
font-family: Arial, sans-serif;
}
h1 {
color: darkblue;
}
</style>
</head>
<body>
<div>
<h1>🦜️🔗 LangChain</h1>
<p>⚡ Building applications with LLMs through composability ⚡</p>
</div>
<div>
As an open source project in a rapidly developing field, we are extremely open to contributions.
</div>
</body>
</html>`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("html", {
chunkSize: 175,
chunkOverlap: 20,
});
const output = await splitter.createDocuments([text]);

console.log(output);

/*
[
Document {
pageContent: '<!DOCTYPE html>\n<html>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<head>\n <title>🦜️🔗 LangChain</title>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<style>\n' +
' body {\n' +
' font-family: Arial, sans-serif;\n' +
' }\n' +
' h1 {\n' +
' color: darkblue;\n' +
' }\n' +
' </style>\n' +
' </head>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<body>\n' +
' <div>\n' +
' <h1>🦜️🔗 LangChain</h1>\n' +
' <p>⚡ Building applications with LLMs through composability ⚡</p>\n' +
' </div>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<div>\n' +
' As an open source project in a rapidly developing field, we are extremely open to contributions.\n' +
' </div>\n' +
' </body>\n' +
'</html>',
metadata: { loc: [Object] }
}
]
*/
54 changes: 54 additions & 0 deletions examples/src/indexes/javascript_text_splitter.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import {
SupportedTextSplitterLanguages,
RecursiveCharacterTextSplitter,
} from "langchain/text_splitter";

console.log(SupportedTextSplitterLanguages); // Array of supported languages

/*
[
'cpp', 'go',
'java', 'js',
'php', 'proto',
'python', 'rst',
'ruby', 'rust',
'scala', 'swift',
'markdown', 'latex',
'html'
]
*/

const jsCode = `function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
chunkSize: 32,
chunkOverlap: 0,
});
const jsOutput = await splitter.createDocuments([jsCode]);

console.log(jsOutput);

/*
[
Document {
pageContent: 'function helloWorld() {',
metadata: { loc: [Object] }
},
Document {
pageContent: 'console.log("Hello, World!");',
metadata: { loc: [Object] }
},
Document {
pageContent: '}\n// Call the function',
metadata: { loc: [Object] }
},
Document {
pageContent: 'helloWorld();',
metadata: { loc: [Object] }
}
]
*/
57 changes: 29 additions & 28 deletions examples/src/indexes/latex_text_splitter.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { LatexTextSplitter } from "langchain/text_splitter";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = `\\begin{document}
\\title{🦜️🔗 LangChain}
Expand All @@ -15,7 +15,7 @@ As an open source project in a rapidly developing field, we are extremely open t
\\end{document}`;

const splitter = new LatexTextSplitter({
const splitter = RecursiveCharacterTextSplitter.fromLanguage("latex", {
chunkSize: 100,
chunkOverlap: 0,
});
Expand All @@ -24,30 +24,31 @@ const output = await splitter.createDocuments([text]);
console.log(output);

/*
[
Document {
pageContent: '\\begin{document}\n' +
'\\title{🦜️🔗 LangChain}\n' +
'⚡ Building applications with LLMs through composability ⚡',
metadata: { loc: [Object] }
},
Document {
pageContent: 'Quick Install}',
metadata: { loc: [Object] }
},
Document {
pageContent: "Hopefully this code block isn't split\n" +
'yarn add langchain\n' +
'\\end{verbatim}\n' +
'\n' +
'As an open source project in a rapidly',
metadata: { loc: [Object] }
},
Document {
pageContent: 'developing field, we are extremely open to contributions.\n' +
'\n' +
'\\end{document}',
metadata: { loc: [Object] }
}
]
[
Document {
pageContent: '\\begin{document}\n' +
'\\title{🦜️🔗 LangChain}\n' +
'⚡ Building applications with LLMs through composability ⚡',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\section{Quick Install}',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\begin{verbatim}\n' +
"Hopefully this code block isn't split\n" +
'yarn add langchain\n' +
'\\end{verbatim}',
metadata: { loc: [Object] }
},
Document {
pageContent: 'As an open source project in a rapidly developing field, we are extremely open to contributions.',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\end{document}',
metadata: { loc: [Object] }
}
]
*/
36 changes: 36 additions & 0 deletions examples/src/indexes/python_text_splitter.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const pythonCode = `def hello_world():
print("Hello, World!")
# Call the function
hello_world()`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("python", {
chunkSize: 32,
chunkOverlap: 0,
});

const pythonOutput = await splitter.createDocuments([pythonCode]);

console.log(pythonOutput);

/*
[
Document {
pageContent: 'def hello_world():',
metadata: { loc: [Object] }
},
Document {
pageContent: 'print("Hello, World!")',
metadata: { loc: [Object] }
},
Document {
pageContent: '# Call the function',
metadata: { loc: [Object] }
},
Document {
pageContent: 'hello_world()',
metadata: { loc: [Object] }
}
]
*/
Loading

1 comment on commit 4d55d1e

@vercel
Copy link

@vercel vercel bot commented on 4d55d1e May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.