Feat: Translation cmd for scribe-data #536

axif0 · 2024-12-28T12:19:37Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Description

Total

If user doesn't give -wdp then it will use query.

scribe-data total -a  -wdp

Additional: scribe-data total -a -wdp [wiki dump directory] and if we want to get specific language then we can add --language English

Translation

scribe-data get -dt translations

Additional:
scribe-data get -dt translations -wdp [wiki dump directory] -od [output dir for exported JSON]
we can add -l [language], -dt translations, -wdp [wiki dump directory]. -od [output dir for exported JSON]
e.g:

scribe-data get -l bengali -dt translations -wdp dump_path -od exported_json

I tried Multi-threading as we are discussed. But it takes much time, So I increased batch_size=50000 so approximately it takes <250 second. As it speeds up file parsing by reading and processing lines in batches (e.g., 50,000 lines at a time). This way, fewer I/O operations occur, and the parser’s internal state updates more efficiently with each chunk before moving on, rather than for every single line.

Apologise Might

Related issue

…for better view

github-actions · 2024-12-28T12:20:08Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

…in_one

axif0 · 2024-12-28T12:40:22Z

I used orJson As it is faster then Json Ref: geeksforgeeks

Not sure, it worth it actually.

andrewtavis · 2024-12-28T12:43:32Z

Looks like orjson needs to be added into the requirements? Ah just seeing that you removed it :)

Do you want to do a time benchmark to check what we're talking about for speed. I don't see a problem with adding a well maintained library to the requirements if we get enough of a performance boost from it. It's less time for us to test it as well, but it would be nice to know the difference 😊

axif0 · 2024-12-28T13:06:23Z

Ya, I can do a time benchmark with and without orjson.

Also for get all cmd, I need the feedback for forms. Also in so the function all_bool might changes..I have added questionary library for message input in get.py.

Therefore recent changes might fail the tests.. in test_get.py.

andrewtavis · 2024-12-28T13:16:26Z

All good! Take your time on this and we'll figure it out 😊 Looking forward to the benchmark!

axif0 · 2024-12-28T14:00:15Z

for total

With orjson:

With json:

for translations

With orjson:

With json:

I think we should go for orjson.. 😅

andrewtavis · 2024-12-28T14:05:21Z

Sounds good to me, @axif0 :) The difference will only grow over time 😊

Can you make the switch over to orjson, add it into the requirements and also fix the tests? From there let me know and I'll start a final review 😊

… all_in_one

axif0 · 2024-12-29T19:58:11Z

Added new command,

Except total, this cmd Gets all translations & all data-types:

scribe-data get --all -wdp scribe -od data

Centralized interactive cmd:

scribe-data i

axif0 · 2024-12-29T20:10:04Z

@andrewtavis Really sorry for extra commits. Tried to make the PR optimize as possible..

Will add & update the test cases later if marge. :)

andrewtavis · 2025-01-02T21:55:04Z

src/scribe_data/cli/total.py

-            The local Wikidata dump that can be used to process data.
+        wikidata_dump : Union[str, bool]
+            The local Wikidata dump path that can be used to process data.
+            If True, indicates the flag was used without a path.


Thanks for the care you're putting into the doc strings, @axif0! :)

andrewtavis · 2025-01-03T08:39:15Z

I have some very minor local changes @axif0 :) Will send those along later and then merge it in 😊

axif0 · 2025-01-03T10:47:12Z

Thanks for the checking and looking for the feedback's and changes..

axif0 · 2025-01-03T18:57:27Z

Added Functions to parse the translations of a word from MediaWiki API.

scribe-data get -t book

Hope it doesn't make conflict with you.. also i'm not so sure about the cmd output .. ~ @andrewtavis

andrewtavis · 2025-01-03T18:59:09Z

Nice, so this closes #526 now too, @axif0? :) No stress! I'll figure out the conflicts if there are some and bring it in :)

andrewtavis · 2025-01-03T19:02:55Z

Seeing in the description that it does close it. Really thanks so much for the amazing work, @axif0! 🚀👏👏

axif0 · 2025-01-03T19:03:18Z

Yaa.. thank you..

andrewtavis · 2025-01-04T13:04:07Z

Lots of changes in the commit above, @axif0, but really don't stress on a lot of it. One thing that I realized is that we technically weren't following the numpy doc formatting as all of our parameters and return statements were indented. I saw in some of your code that you didn't indent, and then investigated further and saw that that's actually correct 😊

See: https://numpydoc.readthedocs.io/en/latest/format.html#parameters

I'll go through my commit and leave some comments for you for learning purposes! :)

andrewtavis · 2025-01-04T13:04:56Z

CONTRIBUTING.md

@@ -300,13 +300,18 @@ def example_function(argument: argument_type) -> return_type:

    Parameters
    ----------
-        argument: argument_type
-            Description of your argument.
+    argument: argument_type


This is what I'm talking about for the parameter indentation, so just disregard all of this :) Glad we're using the correct formatting 😊

andrewtavis · 2025-01-04T13:06:01Z

src/scribe_data/check/check_pyicu.py

@@ -27,8 +27,8 @@
 from pathlib import Path

 import pkg_resources
+import questionary


I decided to import questionary always as select and confirm are methods that are super general sounding and really could be from anything, so I figured making them explicit would be good for readability.

andrewtavis · 2025-01-04T13:07:25Z

src/scribe_data/check/check_query_identifiers.py

-    -------
-        > is_valid_language(Path("path/to/query.sparql"), "Q123456")
-        True
+    Examples


The section for numpy docs is technically Examples (reference). Just marking out the changes! Nothing you did wrong :)

andrewtavis · 2025-01-04T13:08:55Z

src/scribe_data/cli/get.py

@@ -159,7 +159,7 @@ def prompt_user_download_all():
        else:
            print("Updating all languages and data types...")
            rprint(
-                "[bold red]Note that the download all functionality must use Wikidata dumps to observe responsible Wikidata Query Service usage practices.[/bold red]"
+                "[bold red]Note that the download all functionality must use Wikidata lexeme dumps to observe responsible Wikidata Query Service usage practices.[/bold red]"


Decided to be explicit here with the dumps as there are Wikidata dumps that are all of Wikidata, but we're using the lexeme subset :) Again just a note on the change! 😊

andrewtavis · 2025-01-04T13:10:16Z

src/scribe_data/utils.py

-                        "qid": sub_data.get("qid", ""),
-                    }
-                )
+            current_languages.extend(


We can use extend here to avoid the for/append loop and clean up the code a bit :)

andrewtavis · 2025-01-04T13:10:37Z

src/scribe_data/utils.py

@@ -661,17 +660,17 @@ def check_lexeme_dump_prompt_download(output_dir: str):
            ],
        ).ask()

-        if user_input.startswith("Delete"):
+        if user_input == "Delete existing dumps":


I figured being a bit more explicit made sense here :)

andrewtavis · 2025-01-04T13:11:02Z

src/scribe_data/utils.py

@@ -727,6 +730,7 @@ def check_index_exists(index_path: Path, overwrite_all: bool = False) -> bool:
            default="Skip process",
        ).ask()

-        # If user selects "Skip process", return True meaning "don't proceed"
+        # If user selects "Skip process", return True meaning "don't proceed".


Let's remember to put periods at the end of comments that are their own line :)

andrewtavis · 2025-01-04T13:14:06Z

src/scribe_data/wiktionary/parse_dump.py

-            iso_code = data.get("iso")
-            if iso_code:
+
+            if iso_code := data.get("iso"):


There were multiple cases where we can use this "Walrus Operator"/Assignment Expression to simplify the code :)

andrewtavis · 2025-01-04T13:15:51Z

src/scribe_data/wiktionary/parse_dump.py

@@ -203,29 +207,29 @@ def _process_lexeme_forms(self, lexeme: dict) -> None:

                for rep_lang, rep_data in representations.items():
                    if rep_lang == lang_code:
-                        form_value = rep_data.get("value")
-                        if form_value:
+                        if form_value := rep_data.get("value"):


Another assignment expression :) You can install the sourcery VS Code extension to get in line warnings to fix these 🪄

andrewtavis · 2025-01-04T13:16:11Z

src/scribe_data/wiktionary/parse_dump.py

@@ -286,7 +288,7 @@ def process_lines(self, line: str) -> None:
                        )
                        self.forms_counts[lang_code][category_name] += len(forms_data)

-                break  # Only process first valid lemma
+                break  # only process first valid lemma


Inline comments shouldn't be capitalized :)

andrewtavis · 2025-01-04T13:17:20Z

src/scribe_data/wiktionary/parse_dump.py


+            filtered = {


Similarly simplifying the for/if loop to just be an assignment. I took this also from a sourcery suggestion :)

andrewtavis

Amazing work as always, @axif0 🤩🚀 Thanks so much for keeping at making new great features during a bit of a slow review period for me :) Really appreciate your drive to succeed 😊

Let's try to get a few more of these Scribe-Data issues down and then we can start doing some testing of all the functionality together in a call!

andrewtavis · 2025-01-04T13:23:47Z

Please check the comments above, btw, but these are more for learning opportunities :)

axif0 added 6 commits December 22, 2024 17:33

issue 523 done and in check_lexeme_dump_prompt_download added select …

5a7f273

…for better view

added scribe-data get -l bengali -dt translations -wdp mama -od monu

d51549c

forms total add ( missing total translations need feedback)

bb9e333

translation add in total, remove loggings

7eddf07

overwrite file feature added

7c4d597

final and clean code

877e6b2

axif0 changed the title ~~Feat~~ Feat: Translation cmd for scribe-data Dec 28, 2024

axif0 requested review from andrewtavis and wkyoshida December 28, 2024 12:21

axif0 added 2 commits December 28, 2024 18:27

Merge branch 'main' of https://github.com/axif0/Scribe-Data into all_…

e522826

…in_one

removed orjjson

69f4bc7

axif0 added 3 commits December 30, 2024 01:46

Add orjson dependency and add forms & boost interactive mood

612ebe5

Merge branch 'main' of https://github.com/scribe-org/Scribe-Data into…

9843c34

… all_in_one

Set default language to "all" for translations in get_data function

155647b

Removed extra welcome message from interactive mode

26aa192

andrewtavis reviewed Jan 2, 2025

View reviewed changes

Merge branch 'main' into all_in_one

d50aec1

Add MediaWiki translation parsing functionality

402493b

Formatting for all doc strings, spacing and minor improvements

94b060a

andrewtavis reviewed Jan 4, 2025

View reviewed changes

Minor fix to the contribution guide

f28a176

andrewtavis approved these changes Jan 4, 2025

View reviewed changes

andrewtavis merged commit 4132f57 into scribe-org:main Jan 4, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Translation cmd for scribe-data #536

Feat: Translation cmd for scribe-data #536

axif0 commented Dec 28, 2024 •

edited

Loading

github-actions bot commented Dec 28, 2024 •

edited by andrewtavis

Loading

axif0 commented Dec 28, 2024

andrewtavis commented Dec 28, 2024

axif0 commented Dec 28, 2024 •

edited

Loading

andrewtavis commented Dec 28, 2024

axif0 commented Dec 28, 2024 •

edited

Loading

andrewtavis commented Dec 28, 2024

axif0 commented Dec 29, 2024 •

edited

Loading

axif0 commented Dec 29, 2024 •

edited

Loading

andrewtavis Jan 2, 2025

andrewtavis commented Jan 3, 2025

axif0 commented Jan 3, 2025

axif0 commented Jan 3, 2025 •

edited

Loading

andrewtavis commented Jan 3, 2025

andrewtavis commented Jan 3, 2025

axif0 commented Jan 3, 2025

andrewtavis commented Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis Jan 4, 2025

andrewtavis left a comment

andrewtavis commented Jan 4, 2025

Feat: Translation cmd for scribe-data #536

Feat: Translation cmd for scribe-data #536

Conversation

axif0 commented Dec 28, 2024 • edited Loading

Contributor checklist

Description

Total

Translation

Related issue

github-actions bot commented Dec 28, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

axif0 commented Dec 28, 2024

andrewtavis commented Dec 28, 2024

axif0 commented Dec 28, 2024 • edited Loading

andrewtavis commented Dec 28, 2024

axif0 commented Dec 28, 2024 • edited Loading

for total

for translations

andrewtavis commented Dec 28, 2024

axif0 commented Dec 29, 2024 • edited Loading

axif0 commented Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

andrewtavis commented Jan 3, 2025

axif0 commented Jan 3, 2025

axif0 commented Jan 3, 2025 • edited Loading

andrewtavis commented Jan 3, 2025

andrewtavis commented Jan 3, 2025

axif0 commented Jan 3, 2025

andrewtavis commented Jan 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis left a comment

Choose a reason for hiding this comment

andrewtavis commented Jan 4, 2025

axif0 commented Dec 28, 2024 •

edited

Loading

github-actions bot commented Dec 28, 2024 •

edited by andrewtavis

Loading

axif0 commented Dec 28, 2024 •

edited

Loading

axif0 commented Dec 28, 2024 •

edited

Loading

axif0 commented Dec 29, 2024 •

edited

Loading

axif0 commented Dec 29, 2024 •

edited

Loading

axif0 commented Jan 3, 2025 •

edited

Loading