Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'language' field in scraped recipes #144

Merged
merged 26 commits into from
Apr 29, 2020
Merged

Add support for 'language' field in scraped recipes #144

merged 26 commits into from
Apr 29, 2020

Conversation

jayaddison
Copy link
Collaborator

Resolves #143

@jayaddison
Copy link
Collaborator Author

Ideas:

  • It'd be nice to consistently handle multi-valued language fields
  • Using BCP-47 to validate the language codes could be useful to reduce the need for clients to do their own validation

@jayaddison
Copy link
Collaborator Author

Regarding multi-valued languages:

  • It doesn't really seem to make a huge amount of sense for a single recipe to be written in multiple languages (except perhaps in some kind of rare and probably entertaining international kitchen environments)
  • schema.org inLanguage and (superceded) language fields are single-valued
  • The html lang attribute is single-valued
  • The meta http-equiv:content-language value may be multi-valued, but its' use is discouraged by the W3C

It seems to make sense to only retrieve a single result; we can handle this by comma-splitting the meta tag's value.

There are a few Python-based libraries for BCP 47 and/or country code validation. I'm still weighing up whether it makes sense to use one of those, since it would add another dependency. Reliable-quality data is generally good.

@jayaddison
Copy link
Collaborator Author

jayaddison commented Apr 27, 2020

marrow/bcp47#1 provides an excellent list of Python libraries available for BCP47 validation.

Based on that list, language-tags looks like a good candidate because:

  • It is pure Python
  • It is self-contained (adds no other runtime dependencies)
  • It has test coverage
  • It is MIT-licensed, which means that it should not ripple any license effects to recipe-scrapers

Edit: link fixup

return decorated(self, *args, **kwargs)

try:
value = function(*args, **kwargs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@hhursev
Copy link
Owner

hhursev commented Apr 28, 2020

Very nice PR! Absolutely no comments codewise. I invited you to be a collaborator on the repository and you are most welcomed to merge it on your own 😉 Thank you for the detailed information gathered!

As a side:

  • I'd not implement the meta http-equiv header check (as you've said - it's deprecated)
  • I'd not do the BCP-47 validation (I doubt any site will have its language code messed up if one used)

However, these days I'm not using my package actively and I trust your judgement on the upper two points more!

@jayaddison
Copy link
Collaborator Author

Thank you @hhursev, that's very kind, and I've accepted the invitation.

I'll consider the meta and BCP-47 questions for a little longer before making decisions on them, and will merge the pull request once those are resolved.

@jayaddison
Copy link
Collaborator Author

* I'd not implement the `meta http-equiv` header check (as you've said - it's deprecated)

This is a good concern, and I've decided to make this disabled-by-default (so that my use case is handled, but considering it a rare one).

* I'd not do the BCP-47 validation (I doubt any site will have its language code messed up if one used)

I'm planning to keep this for now; from experience, although data cleanup is always required, it can be really pleasant to know that a library is already enforcing a schema for the data it produces.

I'm going to spend a little more time considering the scrape_me API and perhaps making some adjustments there, and will likely merge this later today.

@jayaddison
Copy link
Collaborator Author

I'm going to spend a little more time considering the scrape_me API and perhaps making some adjustments there, and will likely merge this later today.

In the end it seemed safest to leave the existing scrape_me method as-is. An experimental harvest method is introduced that exposes the new meta_http_equiv option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for a 'language' field in retrieved recipes
2 participants