-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Translation cmd for scribe-data #536
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
I used Not sure, it worth it actually. |
Looks like Do you want to do a time benchmark to check what we're talking about for speed. I don't see a problem with adding a well maintained library to the requirements if we get enough of a performance boost from it. It's less time for us to test it as well, but it would be nice to know the difference 😊 |
Ya, I can do a time benchmark with and without orjson. Also for Therefore recent changes might fail the tests.. in |
All good! Take your time on this and we'll figure it out 😊 Looking forward to the benchmark! |
Sounds good to me, @axif0 :) The difference will only grow over time 😊 Can you make the switch over to |
@andrewtavis Really sorry for extra commits. Tried to make the PR optimize as possible.. Will add & update the test cases later if marge. :) |
src/scribe_data/cli/total.py
Outdated
The local Wikidata dump that can be used to process data. | ||
wikidata_dump : Union[str, bool] | ||
The local Wikidata dump path that can be used to process data. | ||
If True, indicates the flag was used without a path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the care you're putting into the doc strings, @axif0! :)
I have some very minor local changes @axif0 :) Will send those along later and then merge it in 😊 |
Thanks for the checking and looking for the feedback's and changes.. |
Added Functions to parse the translations of a word from MediaWiki API.
Hope it doesn't make conflict with you.. also i'm not so sure about the cmd output .. ~ @andrewtavis |
Seeing in the description that it does close it. Really thanks so much for the amazing work, @axif0! 🚀👏👏 |
Yaa.. thank you.. |
Lots of changes in the commit above, @axif0, but really don't stress on a lot of it. One thing that I realized is that we technically weren't following the numpy doc formatting as all of our parameters and return statements were indented. I saw in some of your code that you didn't indent, and then investigated further and saw that that's actually correct 😊 I'll go through my commit and leave some comments for you for learning purposes! :) |
CONTRIBUTING.md
Outdated
@@ -300,13 +300,18 @@ def example_function(argument: argument_type) -> return_type: | |||
Parameters | |||
---------- | |||
argument: argument_type | |||
Description of your argument. | |||
argument: argument_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I'm talking about for the parameter indentation, so just disregard all of this :) Glad we're using the correct formatting 😊
@@ -27,8 +27,8 @@ | |||
from pathlib import Path | |||
|
|||
import pkg_resources | |||
import questionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to import questionary
always as select
and confirm
are methods that are super general sounding and really could be from anything, so I figured making them explicit would be good for readability.
------- | ||
> is_valid_language(Path("path/to/query.sparql"), "Q123456") | ||
True | ||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section for numpy docs is technically Examples
(reference). Just marking out the changes! Nothing you did wrong :)
@@ -159,7 +159,7 @@ def prompt_user_download_all(): | |||
else: | |||
print("Updating all languages and data types...") | |||
rprint( | |||
"[bold red]Note that the download all functionality must use Wikidata dumps to observe responsible Wikidata Query Service usage practices.[/bold red]" | |||
"[bold red]Note that the download all functionality must use Wikidata lexeme dumps to observe responsible Wikidata Query Service usage practices.[/bold red]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to be explicit here with the dumps as there are Wikidata dumps that are all of Wikidata, but we're using the lexeme subset :) Again just a note on the change! 😊
"qid": sub_data.get("qid", ""), | ||
} | ||
) | ||
current_languages.extend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use extend
here to avoid the for
/append
loop and clean up the code a bit :)
@@ -661,17 +660,17 @@ def check_lexeme_dump_prompt_download(output_dir: str): | |||
], | |||
).ask() | |||
|
|||
if user_input.startswith("Delete"): | |||
if user_input == "Delete existing dumps": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured being a bit more explicit made sense here :)
@@ -727,6 +730,7 @@ def check_index_exists(index_path: Path, overwrite_all: bool = False) -> bool: | |||
default="Skip process", | |||
).ask() | |||
|
|||
# If user selects "Skip process", return True meaning "don't proceed" | |||
# If user selects "Skip process", return True meaning "don't proceed". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remember to put periods at the end of comments that are their own line :)
iso_code = data.get("iso") | ||
if iso_code: | ||
|
||
if iso_code := data.get("iso"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were multiple cases where we can use this "Walrus Operator"/Assignment Expression to simplify the code :)
@@ -203,29 +207,29 @@ def _process_lexeme_forms(self, lexeme: dict) -> None: | |||
|
|||
for rep_lang, rep_data in representations.items(): | |||
if rep_lang == lang_code: | |||
form_value = rep_data.get("value") | |||
if form_value: | |||
if form_value := rep_data.get("value"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another assignment expression :) You can install the sourcery VS Code extension to get in line warnings to fix these 🪄
@@ -286,7 +288,7 @@ def process_lines(self, line: str) -> None: | |||
) | |||
self.forms_counts[lang_code][category_name] += len(forms_data) | |||
|
|||
break # Only process first valid lemma | |||
break # only process first valid lemma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline comments shouldn't be capitalized :)
|
||
filtered = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly simplifying the for
/if
loop to just be an assignment. I took this also from a sourcery suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work as always, @axif0 🤩🚀 Thanks so much for keeping at making new great features during a bit of a slow review period for me :) Really appreciate your drive to succeed 😊
Let's try to get a few more of these Scribe-Data issues down and then we can start doing some testing of all the functionality together in a call!
Please check the comments above, btw, but these are more for learning opportunities :) |
Contributor checklist
pytest
command as directed in the testing section of the contributing guideDescription
Total
If user doesn't give
-wdp
then it will usequery
.Additional:
scribe-data total -a -wdp [wiki dump directory]
and if we want to get specific language then we can add--language English
Translation
Additional:
scribe-data get -dt translations -wdp [wiki dump directory] -od [output dir for exported JSON]
we can add
-l [language]
,-dt translations
,-wdp [wiki dump directory]
.-od [output dir for exported JSON]
e.g:
I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.
Apologise Might
Related issue
total (t)
CLI functionality to be ran using a Wikidata lexemes dump #520