HTML to Markdown API

This API takes a URL as a query parameter and returns a Markdown version of the parsed HTML content of the website.

Introduction

This project provides a simple API to convert web pages to Markdown. It's useful for tasks like archiving web content, creating documentation, or building tools that need a plain text representation of a website. This is also very useful for scraping main websites content as inputs for LLM summarization tasks.

It leverages Playwright for browser automation and Turndown for HTML to Markdown conversion.

API Endpoint

GET method at /api/scrape
GET method at /:url will be redirected to /api/scrape

Request Parameters

Parameter	Type	Default	Description	Required
`url`	string	-	The URL of the website to convert	Yes
`json`	boolean (1\|0)	false	If true, the response will be in JSON format	No
`formatTables`	boolean (1\|0)	true	If true, parse tables in markdown syntax. Otherwise, return as strings	No
`stripTables`	boolean (1\|0)	false	If true, tables will be removed from the response	No
`stripImages`	boolean (1\|0)	false	If true, images will be removed from the response	No
`stripLinks`	boolean (1\|0)	false	If true, links will be removed from the response	No
`waitForTimeoutSeconds`	int	3	If specified, the API will wait for the specified number of seconds before parsing the website content. This is useful for waiting for lazy loaded content to load.	No

Response

By default, the API will return a plain text response containing the Markdown version of the website's content. The Content-Type header will be set to text/plain.

If json is set to 1, the API will return a JSON response containing the following fields:

title: The title of the website.
urlSource: The URL of the website.
publishedTime: The date and time the website was published.
markdownContent: The Markdown version of the website's content.

Error Handling

The API will return appropriate HTTP status codes for errors. Possible errors include:

500 Internal Server Error: If there is an error during the conversion process (e.g., website not found, Playwright error, Turndown error). More detailed error messages may be logged on the server.

Dependencies

Playwright: For browser automation.
Turndown: For HTML to Markdown conversion.
Express: For creating the API server.
... other dependencies listed in package.json

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML to Markdown API

Introduction

API Endpoint

Request Parameters

Response

Error Handling

Dependencies

Contributing

About

Releases

Packages

Languages

License

Gan-Tu/UrlReader

Folders and files

Latest commit

History

Repository files navigation

HTML to Markdown API

Introduction

API Endpoint

Request Parameters

Response

Error Handling

Dependencies

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages