A multi-threaded GitHub scraper to collect Python code with docstrings from public repositories, creating a well-documented dataset for the JaraConverse LLM model.
nlp scraper script python3 dataset dataset-generation nlp-machine-learning data-scraping github-scraper python-code docstring-generator llm causal-language-modeling dataset-scripts python-dataset llm-training docst
-
Updated
Jul 21, 2024 - Python