Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scraper: problems with some fields. #62

Merged
merged 2 commits into from
Feb 6, 2024

Conversation

ntkog
Copy link
Collaborator

@ntkog ntkog commented Feb 5, 2024

Hemos detectado que hay una serie de inconsistencias en el marcado html de ciertas propiedades en las etiquetas meta de artículos de 2022.
Screenshot_20240205_140208
Screenshot_20240205_140153
Screenshot_20240205_140102

Testeado con un batch desde 2020 a 2023. Ejecución sin errores.

[2024-02-05 23:04:10,538] [11993] [INFO] [_split_documents] Splitting in chunks 64966 documents
[2024-02-05 23:05:50,384] [11993] [INFO] [_split_documents] Removing file /tmp/tmpkiq2qnee
[2024-02-05 23:05:50,384] [11993] [INFO] [_split_documents] Splitted 64966 documents in 275232 chunks
[2024-02-05 23:05:50,385] [11993] [INFO] [_load_database] Loading 275232 embeddings to database

@bukosabino bukosabino self-requested a review February 6, 2024 07:33
Copy link
Owner

@bukosabino bukosabino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ntkog ntkog merged commit 6bbe0a4 into bukosabino:main Feb 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants