Skip to content
This repository has been archived by the owner on Oct 30, 2023. It is now read-only.

Data pipeline overview

ianknowles edited this page Feb 2, 2022 · 2 revisions

The collection target queue is setup and the collection started.

		session = FacebookCollection(date_stamp)
		session.create_target_queue()
		session.collect()

https://github.com/ianknowles/dgg-data/blob/d0a5620b9281e5eba1cb5b64aaf69c7a43552885/data/dgg-data-python/collect.py#L26-L28

The calls to predict here begin the analysis pipeline stages. The analysis runs locally and the result is uploaded to the S3 bucket.

		mau_key = r_analysis_wrapper.predict(date_stamp, 'mau')
		r_analysis_wrapper.predict(date_stamp, 'dau')

https://github.com/ianknowles/dgg-data/blob/d0a5620b9281e5eba1cb5b64aaf69c7a43552885/data/dgg-data-python/collect.py#L30-L31

Once the analysis is completed successfully the index file is updated, this is read by the website frontend to provide the list of available analyses.

		# model index
		s3_bucket = S3Bucket()
		index = ModelIndexFile(s3_bucket)
		index.add_latest(date_stamp, mau_key)

https://github.com/ianknowles/dgg-data/blob/d0a5620b9281e5eba1cb5b64aaf69c7a43552885/data/dgg-data-python/collect.py#L33-L36

Clone this wiki locally