Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow big seeds #447

Merged
merged 6 commits into from
Oct 1, 2024
Merged

Allow big seeds #447

merged 6 commits into from
Oct 1, 2024

Conversation

jausanca
Copy link
Contributor

resolves #446

Description

This pull requests allows to upload seeds with serialized lengths over 68000 characters. It allows so by splitting csv records across chunks which are appended to a session variable on different statement executions.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-glue next" section.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@moomindani moomindani self-assigned this Sep 25, 2024
@moomindani moomindani added the enable-functional-tests This label enable functional tests label Sep 25, 2024
for i, csv_chunk in enumerate(csv_chunks):
is_first = i == 0
is_last = i == len(csv_chunks) - 1
code = "custom_glue_code_for_dbt_adapter\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have this per chunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so the way this works is that it slices the data across multiple statement executions, appending the data to an array, with an execute per chunk.
Since cursor implementation differentiates between python code and sql code by checking if the code contains "custom_glue_code_for_dbt_adapter", otherwise wrapping it in the SqlWrapper, we need it so it's executed as python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

"""
if not is_last:
code += f'''
SqlWrapper2.execute("""select 1""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so the cursor execute can retrieve the response. Otherwise it breaks when retrieving the result on

            if self.connection.use_arrow:
                result_bucket = self.response.get("result_bucket")
                result_key = self.response.get("result_key")
                if result_bucket and result_key:
                    pdf = get_pandas_dataframe_from_result_file(result_bucket, result_key)
                    self.result = pdf.to_dict('records')[0]

or on

                        chunks = output.get("Data", {}).get("TextPlain", None).strip().split('\n')
                        logger.debug(f"chunks: {chunks}")
                        self.response = json.loads(chunks[0])

I noticed it being handled in this same way SqlWrapper2.execute("""select 1""") on other parts of impl.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -86,3 +88,15 @@ def test_get_table_type(self):
connection = adapter.acquire_connection("dummy")
connection.handle # trigger lazy-load
self.assertEqual(adapter.get_table_type(target_relation), "iceberg_table")

def test_create_csv_table_slices_big_datasets(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another test to use custom_glue_code_for_dbt_adapter with a big seed to verify that we do not make breaking change from the previous version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get what you mean. What would be the test scenario?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it was my bad, this test case already covered required one.

@moomindani
Copy link
Collaborator

@moomindani moomindani added enable-functional-tests This label enable functional tests and removed enable-functional-tests This label enable functional tests labels Oct 1, 2024
@moomindani moomindani merged commit abf8c52 into aws-samples:main Oct 1, 2024
19 checks passed
@moomindani
Copy link
Collaborator

Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor enable-functional-tests This label enable functional tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt seed failing when loading big seed files
2 participants