Scrape and store model and data readme files for scientific domains

You have approximately 75 models and 105 datasets hosted on HuggingFace Hub in your netid_model and netid_data files. In addition, you have a sample of github repositories mentioned in scientific papers.

For each please scrape and store the content of the readme files For each retrieved README file extract all URLs and DOI's in it. For bonus points extract bib entries as well. Store the results in a (compressed) json file containing a dictionary with the following keys:

'id': e.g., 'LoneStriker/SauerkrautLM-Mixtral-8x7B-3.0bpw-h6-exl2'
'type': '(data|model|source)'
'url': e.g. 'https://huggingface.co/datasets/pankajmathur/alpaca_orca/raw/main/README.md' or 'https://huggingface.co/iamacaru/climate/raw/main/README.md'
'content': "the content of the readme file" - no newlines
'links': [ an array of extracted URLs ]
'dois': [ an array of extracted DOIs ]
'bibs': [ an array of extracted bib entries ] - bonus points

Each line is separately json encoded. Output should be in a single file output/netid.json.gz and your source code will be in netid.py (or netid.ipynb if you prefer to use collab notebooks).

See example in example.py.

Happy scraping!

Name		Name	Last commit message	Last commit date
Latest commit History 328 Commits
input		input
output		output
.gitattributes		.gitattributes
.gitignore		.gitignore
Jchoi38.py		Jchoi38.py
README.md		README.md
aking100.py		aking100.py
alecfowl.py		alecfowl.py
amarlow6.py		amarlow6.py
amcclu13.py		amcclu13.py
amuell11.py		amuell11.py
awarden9.ipynb		awarden9.ipynb
aweis3.py		aweis3.py
bfitzpa8.py		bfitzpa8.py
bmaples6.py		bmaples6.py
bmilstea.py		bmilstea.py
btolson1.py		btolson1.py
calle102.py		calle102.py
ccanonac.py		ccanonac.py
ccotturo.py		ccotturo.py
ckornega.py		ckornega.py
cstefani.py		cstefani.py
cvy221.py		cvy221.py
cwalsh25.py		cwalsh25.py
cwitt8.py		cwitt8.py
cwoodfil.py		cwoodfil.py
ddelrosa.py		ddelrosa.py
dferrer1.py		dferrer1.py
dfranke2.py		dfranke2.py
dhodge12.py		dhodge12.py
dmoffit1.py		dmoffit1.py
dpate139.py		dpate139.py
dwang58.py		dwang58.py
edayney.ipynb		edayney.ipynb
ehead3.py		ehead3.py
ehechmer.py		ehechmer.py
emaness.py		emaness.py
example.py		example.py
ezhao1.py		ezhao1.py
fchernow.py		fchernow.py
fgholamr.py		fgholamr.py
glapham.py		glapham.py
glee30.py		glee30.py
gpatel8.py		gpatel8.py
harshvar.ipynb		harshvar.ipynb
hchen73.py		hchen73.py
ibhandar.py		ibhandar.py
jaugust4.py		jaugust4.py
jburns46.py		jburns46.py
jclar166.py		jclar166.py
jhenley9.py		jhenley9.py
jkutbay.py		jkutbay.py
jleuciu1.py		jleuciu1.py
jmuncy2.ipnyb		jmuncy2.ipnyb
jnd547.py		jnd547.py
jtayl219.py		jtayl219.py
kchmayss.py		kchmayss.py
knguye34.py		knguye34.py
lhunte21.py		lhunte21.py
lscott32.py		lscott32.py
marifova.py		marifova.py
mdv623.py		mdv623.py
mherna21.py		mherna21.py
mkelle37.ipynb		mkelle37.ipynb
mmarcu10.py		mmarcu10.py
mmccor23.py		mmccor23.py
monim.py		monim.py
mzg857.py		mzg857.py
ncoffey3.py		ncoffey3.py
npatton4.py		npatton4.py
nvanflee.py		nvanflee.py
pkx959.py		pkx959.py
pmasani.py		pmasani.py
pmoore34.ipynb		pmoore34.ipynb
rfranqui.py		rfranqui.py
rking61.py		rking61.py
rlin8.py		rlin8.py
san6.py		san6.py
sbandar1.py		sbandar1.py
sdasari7.py		sdasari7.py
sgray38_MP3.ipynb		sgray38_MP3.ipynb
smoparth.py		smoparth.py
snidiff1.py		snidiff1.py
spatil12.py		spatil12.py
sshres25.json.gz		sshres25.json.gz
sshres25.py		sshres25.py
tcatunca.py		tcatunca.py
thatngu1.ipynb		thatngu1.ipynb
tvillarr.py		tvillarr.py
vbroda.py		vbroda.py
vgopu.py		vgopu.py
wduff.py		wduff.py
wwinslad.py		wwinslad.py
yarddoga.ipynb		yarddoga.ipynb
ygaikwad.ipynb		ygaikwad.ipynb
yhg461.py		yhg461.py
zyr546.py		zyr546.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape and store model and data readme files for scientific domains

About

Releases

Packages

Contributors 88

Languages

fdac24/MP3

Folders and files

Latest commit

History

Repository files navigation

Scrape and store model and data readme files for scientific domains

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 88

Languages

Packages