-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds splitters for different programming and markup languages (#1469)
* Adds a HTML text splitter * Formatting * Adds docs page * Adds CodeTextSplitter with support for popular languages * Improve text splitting by preserving separators * Fix formatting * Merge and fix tests * Update docs * Factor individual language splitters into fromLanguage method, add docs
- Loading branch information
1 parent
64cd89e
commit 4d55d1e
Showing
11 changed files
with
937 additions
and
177 deletions.
There are no files selected for viewing
38 changes: 38 additions & 0 deletions
38
docs/docs/modules/indexes/text_splitters/examples/code.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
--- | ||
hide_table_of_contents: true | ||
--- | ||
|
||
# Code and Markup Text Splitters | ||
|
||
LangChain supports a variety of different markup and programming language-specific text splitters to split your text based on language-specific syntax. | ||
This results in more semantically self-contained chunks that are more useful to a vector store or other retriever. | ||
Popular languages like JavaScript, Python, and Rust are supported as well as Latex, HTML, and Markdown. | ||
|
||
## Usage | ||
|
||
Initialize a standard `RecursiveCharacterTextSplitter` with the `fromLanguage` factory method. Below are some examples for various languages. | ||
|
||
## JavaScript | ||
|
||
import CodeBlock from "@theme/CodeBlock"; | ||
import JSExample from "@examples/indexes/javascript_text_splitter.ts"; | ||
|
||
<CodeBlock language="typescript">{JSExample}</CodeBlock> | ||
|
||
## Python | ||
|
||
import PythonExample from "@examples/indexes/python_text_splitter.ts"; | ||
|
||
<CodeBlock language="typescript">{PythonExample}</CodeBlock> | ||
|
||
## HTML | ||
|
||
import HTMLExample from "@examples/indexes/html_text_splitter.ts"; | ||
|
||
<CodeBlock language="typescript">{HTMLExample}</CodeBlock> | ||
|
||
## Latex | ||
|
||
import LatexExample from "@examples/indexes/latex_text_splitter.ts"; | ||
|
||
<CodeBlock language="typescript">{LatexExample}</CodeBlock> |
32 changes: 0 additions & 32 deletions
32
docs/docs/modules/indexes/text_splitters/examples/latex.mdx
This file was deleted.
Oops, something went wrong.
61 changes: 0 additions & 61 deletions
61
docs/docs/modules/indexes/text_splitters/examples/markdown.mdx
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; | ||
|
||
const text = `<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>🦜️🔗 LangChain</title> | ||
<style> | ||
body { | ||
font-family: Arial, sans-serif; | ||
} | ||
h1 { | ||
color: darkblue; | ||
} | ||
</style> | ||
</head> | ||
<body> | ||
<div> | ||
<h1>🦜️🔗 LangChain</h1> | ||
<p>⚡ Building applications with LLMs through composability ⚡</p> | ||
</div> | ||
<div> | ||
As an open source project in a rapidly developing field, we are extremely open to contributions. | ||
</div> | ||
</body> | ||
</html>`; | ||
|
||
const splitter = RecursiveCharacterTextSplitter.fromLanguage("html", { | ||
chunkSize: 175, | ||
chunkOverlap: 20, | ||
}); | ||
const output = await splitter.createDocuments([text]); | ||
|
||
console.log(output); | ||
|
||
/* | ||
[ | ||
Document { | ||
pageContent: '<!DOCTYPE html>\n<html>', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '<head>\n <title>🦜️🔗 LangChain</title>', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '<style>\n' + | ||
' body {\n' + | ||
' font-family: Arial, sans-serif;\n' + | ||
' }\n' + | ||
' h1 {\n' + | ||
' color: darkblue;\n' + | ||
' }\n' + | ||
' </style>\n' + | ||
' </head>', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '<body>\n' + | ||
' <div>\n' + | ||
' <h1>🦜️🔗 LangChain</h1>\n' + | ||
' <p>⚡ Building applications with LLMs through composability ⚡</p>\n' + | ||
' </div>', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '<div>\n' + | ||
' As an open source project in a rapidly developing field, we are extremely open to contributions.\n' + | ||
' </div>\n' + | ||
' </body>\n' + | ||
'</html>', | ||
metadata: { loc: [Object] } | ||
} | ||
] | ||
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
import { | ||
SupportedTextSplitterLanguages, | ||
RecursiveCharacterTextSplitter, | ||
} from "langchain/text_splitter"; | ||
|
||
console.log(SupportedTextSplitterLanguages); // Array of supported languages | ||
|
||
/* | ||
[ | ||
'cpp', 'go', | ||
'java', 'js', | ||
'php', 'proto', | ||
'python', 'rst', | ||
'ruby', 'rust', | ||
'scala', 'swift', | ||
'markdown', 'latex', | ||
'html' | ||
] | ||
*/ | ||
|
||
const jsCode = `function helloWorld() { | ||
console.log("Hello, World!"); | ||
} | ||
// Call the function | ||
helloWorld();`; | ||
|
||
const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", { | ||
chunkSize: 32, | ||
chunkOverlap: 0, | ||
}); | ||
const jsOutput = await splitter.createDocuments([jsCode]); | ||
|
||
console.log(jsOutput); | ||
|
||
/* | ||
[ | ||
Document { | ||
pageContent: 'function helloWorld() {', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: 'console.log("Hello, World!");', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '}\n// Call the function', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: 'helloWorld();', | ||
metadata: { loc: [Object] } | ||
} | ||
] | ||
*/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; | ||
|
||
const pythonCode = `def hello_world(): | ||
print("Hello, World!") | ||
# Call the function | ||
hello_world()`; | ||
|
||
const splitter = RecursiveCharacterTextSplitter.fromLanguage("python", { | ||
chunkSize: 32, | ||
chunkOverlap: 0, | ||
}); | ||
|
||
const pythonOutput = await splitter.createDocuments([pythonCode]); | ||
|
||
console.log(pythonOutput); | ||
|
||
/* | ||
[ | ||
Document { | ||
pageContent: 'def hello_world():', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: 'print("Hello, World!")', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: '# Call the function', | ||
metadata: { loc: [Object] } | ||
}, | ||
Document { | ||
pageContent: 'hello_world()', | ||
metadata: { loc: [Object] } | ||
} | ||
] | ||
*/ |
Oops, something went wrong.
4d55d1e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Successfully deployed to the following URLs:
langchainjs-docs – ./
langchainjs-docs-ruddy.vercel.app
langchainjs-docs-git-main-langchain.vercel.app
langchainjs-docs-langchain.vercel.app
js.langchain.com