Skip to content

Commit

Permalink
Backend rewrite in Go, & parser rewrite v0.8.0-beta
Browse files Browse the repository at this point in the history
- Rewrite of HTTP client backend to make use of goroutines, instead of gevent (#17)
- Fixed all concurrency slowdowns (#17)
- Write of backend HTML parser using selectolax (#13)
- Added ability to search by HTML tags, similar to bs4
- Added ability to use dot notation for pulling attributes from elements
- Added `find_all` parser shortcut to the Response and BrowserSession objects
- Fixed missing required dependency errors (#15)
- Removed lxml, pyquery, w3lib, and bs4 libraries
- Deprecated xpath searching from HTML parser
- Failed `map`, `imap`, & `imap_enum` responses will yield `FailedResponse` instead of `None`
- Added Chrome 117 and Firefox 117 TLS profiles
- Fixed README discrepancies (#17)
  • Loading branch information
daijro committed Sep 30, 2023
1 parent af05a72 commit 886ec25
Show file tree
Hide file tree
Showing 18 changed files with 779 additions and 363 deletions.
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
exclude bridge
53 changes: 17 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

- Seamless transition between HTTP and headless browsing 💻
- Integrated fast HTML parser 🚀
- High performance concurrency with gevent (_without monkey-patching!_) 🚀
- High performance network concurrency with goroutines & gevent 🚀
- Replication of browser TLS fingerprints 🚀
- JavaScript rendering 🚀
- Supports HTTP/2 🚀
Expand All @@ -51,6 +51,7 @@

- High performance ✨
- Minimal dependence on the python standard libraries
- HTTP backend written in Go
- Written with type safety
- 100% threadsafe ❤️

Expand Down Expand Up @@ -203,7 +204,7 @@ Creating a new Chrome Session object:

```py
>>> session = hrequests.Session() # version randomized by default
>>> session = hrequests.Session('chrome', version=112)
>>> session = hrequests.Session('chrome', version=117)
```

<details>
Expand Down Expand Up @@ -387,7 +388,6 @@ Parameters:
verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
timeout (float, optional): Timeout in seconds. Defaults to 30.
proxies (dict, optional): Dictionary of proxies. Defaults to None.
nohup (bool, optional): Run the request in the background. Defaults to False.
<Additionally includes all parameters from `hrequests.Session` if a session was not specified>

Returns:
Expand Down Expand Up @@ -512,18 +512,21 @@ To handle timeouts or any other exception during the connection of the request,
['Response failed: Connection error', 'Response failed: Connection error', <Response [200]>]
```

The value returned by the exception handler will be used in place of the response in the result list:
The value returned by the exception handler will be used in place of the response in the result list.

If an exception handler isn't specified, the default yield type is `hrequests.FailedResponse`.

<hr width=50>

## HTML Parsing

HTML scraping uses PyQuery, which is ~7x faster than bs4. This functionality is based of [requests-html](https://github.com/psf/requests-html).
HTML scraping is based off [selectolax](https://github.com/rushter/selectolax), which is **over 25x faster** than bs4. This functionality is inspired by [requests-html](https://github.com/psf/requests-html).

| Library | Time (1e5 trials) |
| -------------- | ----------------- |
| BeautifulSoup4 | 52.6 |
| PyQuery | 7.5 |
| selectolax | **1.9** |

The HTML parser can be accessed through the `html` attribute of the response object:

Expand Down Expand Up @@ -595,36 +598,6 @@ If ``first`` is ``True``, only returns the first

</details>

XPath is also supported:

```py
>>> resp.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]
```

<details>
<summary>Parameters</summary>

```
Given an XPath selector, returns a list of Element objects or a single one.

Parameters:
selector (str): XPath Selector to use.
clean (bool, optional): Whether or not to sanitize the found HTML of <script> and <style> tags. Defaults to
first (bool, optional): Whether or not to return just the first result. Defaults to False.
_encoding (str, optional): The encoding format. Defaults to None.

Returns:
_XPath: A list of Element objects or a single one.

If a sub-selector is specified (e.g. //a/@href), a simple list of results is returned.
See W3School's XPath Examples for more details.

If first is True, only returns the first Element found.
```

</details>

### Introspecting elements

Grab an Element's text contents:
Expand All @@ -644,6 +617,8 @@ Getting an Element's attributes:
```py
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
>>> about.id
'about'
```

Get an Element's raw HTML:
Expand All @@ -662,6 +637,13 @@ Select Elements within Elements:
<Element 'a' href='/about/' title='' class=''>
```

Searching by HTML attributes:

```py
>>> about.find('il', role='treeitem')
<Element 'li' role='treeitem' class=('tier-2', 'element-1')>
```

Search for links within an element:

```py
Expand Down Expand Up @@ -1124,7 +1106,6 @@ Returns:

</details>


### Adding Firefox/Chrome extensions

Firefox/Chrome extensions can be easily imported into a browser session. Some potentially useful extensions include:
Expand Down
1 change: 1 addition & 0 deletions bridge/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.0
3 changes: 3 additions & 0 deletions bridge/build.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
set /p ver=<VERSION

xgo --out=hrequests-cgo-%ver% -buildmode=c-shared --dest=./dist .
21 changes: 21 additions & 0 deletions bridge/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
module hrequests_bridge

go 1.21.1

require (
github.com/bogdanfinn/fhttp v0.5.24
github.com/bogdanfinn/tls-client v1.6.1
github.com/goccy/go-json v0.10.2
github.com/google/uuid v1.3.1
)

require (
github.com/andybalholm/brotli v1.0.4 // indirect
github.com/bogdanfinn/utls v1.5.16 // indirect
github.com/klauspost/compress v1.15.12 // indirect
github.com/tam7t/hpkp v0.0.0-20160821193359-2b70b4024ed5 // indirect
golang.org/x/crypto v0.1.0 // indirect
golang.org/x/net v0.5.0 // indirect
golang.org/x/sys v0.4.0 // indirect
golang.org/x/text v0.6.0 // indirect
)
24 changes: 24 additions & 0 deletions bridge/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
github.com/andybalholm/brotli v1.0.4 h1:V7DdXeJtZscaqfNuAdSRuRFzuiKlHSC/Zh3zl9qY3JY=
github.com/andybalholm/brotli v1.0.4/go.mod h1:fO7iG3H7G2nSZ7m0zPUDn85XEX2GTukHGRSepvi9Eig=
github.com/bogdanfinn/fhttp v0.5.24 h1:OlyBKjvJp6a3TotN3wuj4mQHHRbfK7QUMrzCPOZGhRc=
github.com/bogdanfinn/fhttp v0.5.24/go.mod h1:brqi5woc5eSCVHdKYBV8aZLbO7HGqpwyDLeXW+fT18I=
github.com/bogdanfinn/tls-client v1.6.1 h1:GTIqQssFoIvLaDf4btoYRzDhUzudLqYD4axvfUCXl3I=
github.com/bogdanfinn/tls-client v1.6.1/go.mod h1:FtwQ3DndVZ0xAOO704v4iNAgbHOcEc5kPk9tjICTNQ0=
github.com/bogdanfinn/utls v1.5.16 h1:NhhWkegEcYETBMj9nvgO4lwvc6NcLH+znrXzO3gnw4M=
github.com/bogdanfinn/utls v1.5.16/go.mod h1:mHeRCi69cUiEyVBkKONB1cAbLjRcZnlJbGzttmiuK4o=
github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU=
github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
github.com/google/uuid v1.3.1 h1:KjJaJ9iWZ3jOFZIf1Lqf4laDRCasjl0BCmnEGxkdLb4=
github.com/google/uuid v1.3.1/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/klauspost/compress v1.15.12 h1:YClS/PImqYbn+UILDnqxQCZ3RehC9N318SU3kElDUEM=
github.com/klauspost/compress v1.15.12/go.mod h1:QPwzmACJjUTFsnSHH934V6woptycfrDDJnH7hvFVbGM=
github.com/tam7t/hpkp v0.0.0-20160821193359-2b70b4024ed5 h1:YqAladjX7xpA6BM04leXMWAEjS0mTZ5kUU9KRBriQJc=
github.com/tam7t/hpkp v0.0.0-20160821193359-2b70b4024ed5/go.mod h1:2JjD2zLQYH5HO74y5+aE3remJQvl6q4Sn6aWA2wD1Ng=
golang.org/x/crypto v0.1.0 h1:MDRAIl0xIo9Io2xV565hzXHw3zVseKrJKodhohM5CjU=
golang.org/x/crypto v0.1.0/go.mod h1:RecgLatLF4+eUMCP1PoPZQb+cVrJcOPbHkTkbkB9sbw=
golang.org/x/net v0.5.0 h1:GyT4nK/YDHSqa1c4753ouYCDajOYKTja9Xb/OHtgvSw=
golang.org/x/net v0.5.0/go.mod h1:DivGGAXEgPSlEBzxGzZI+ZLohi+xUj054jfeKui00ws=
golang.org/x/sys v0.4.0 h1:Zr2JFtRQNX3BCZ8YtxRE9hNJYC8J6I1MVbMg6owUp18=
golang.org/x/sys v0.4.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/text v0.6.0 h1:3XmdazWV+ubf7QgHSTWeykHOci5oeekaGJBLkrkaw4k=
golang.org/x/text v0.6.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=
Loading

0 comments on commit 886ec25

Please sign in to comment.