Python package to quickly extract links in HTML

Documentation: https://fast-link-extractor.readthedocs.io

Source Code: https://github.com/lgloege/fast-link-extractor

A Python 3.7+ package to extract links from a webpage. Asyncronous functions allows the code to run fast when extracting from many sub-directories. A use case for this tool is to extract download links for use with wget or fsspec.

Installation

Install using PyPi

pip install fast-link-extractor

Insatll using GitHub

pip install git+https://github.com/lgloege/fast-link-extractor.git

Example

Simply import the package and call link_extractor(). This will output of list of extracted links

import fast_link_extractor as fle

# url to extract links from
base_url = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"

# extract all links from sub directories ending with .nc
# this may take ~10 seconds, there are a lot of sub-directories
links = fle.link_extractor(base_url,
                           search_subs=True,
                           regex='.nc$')

If using inside Jupyter or IPython, set ipython=True

import fast_link_extractor as fle

# url to extract links from
base_url = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"

# extract all links from sub directories ending with .nc
# this may take ~10 seconds, there are a lot of sub-directories
links = fle.link_extractor(base_url,
                           search_subs=True,
                           ipython=True,
                           regex='.nc$')

License

This project is licensed under the terms of the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Installation

Example

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Installation

Example

License