-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bash Command ENTRYPOINT Expects train
argument
#65
Comments
train
argument
This is SageMaker's contract: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html. Unfortunately, there's not a way to change that. What you can do is create a script that responds to |
Thanks for the help. I added an alias within my cli tool to respond to the
However the hyperparameters I have specified when creating the |
tldr;The hyperparameters should be available. The SageMaker service makes these available in a hyperparameters.json file, and if you've utilized the sagemaker-training-toolkit, read in and made available as environment variables to your script/entry point. a deeper storyFollowing the call path of
On the SageMaker service side, the breadcrumbs lead us through the docs:
In the sagemaker-training-toolkit code:
|
@metrizable thanks for the explanation. I have to remark that the documentation isn't the clearest and there really isn't an example demonstrating this functionality with a bash entry point. I've setup another minimal example to test the functionality and it isn't behaving as expected. I have set the entrypoint to the
I setup my hyperparameters = {'test': 10, 'a': 50, 'b': 'some text'}
estimator = Estimator(
image_name=image,
role=iam_role,
output_path=f"s3://{aws_params['SCW_S3_BUCKET']}/sagemaker/output/",
train_instance_count=instance_count,
input_mode='File',
train_instance_type='local',
tags=TAGS,
subnets=aws_params['VPC_SUBNETS'],
security_group_ids=aws_params['VPC_SGS'],
output_kms_key=aws_params['SCW_KMS_KEY'],
hyperparameters=hyperparams
)
estimator.fit() Then I build my docker container and run the estimator and the output is the following: 2020-07-03 11:42:09,057 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmp4hke8hn_/docker-compose.yaml up --build --abort-on-container-exit
Creating tmp4hke8hn__algo-1-9jeh2_1 ... done
Attaching to tmp4hke8hn__algo-1-9jeh2_1
algo-1-9jeh2_1 | train
tmp4hke8hn__algo-1-9jeh2_1 exited with code 0
Aborting on container exit...
2020-07-03 11:42:10,771 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete ===== I was expecting the hyperparameters to be printed to the terminal using the echo command but it just prints the However, if I modify my Dockerfile and set the
where #!/usr/bin/env bash
echo "Inside test script"
for i; do
echo $i
done I build my container again and run the 2020-07-03 11:11:42,361 - sagemaker.local.image - INFO - docker command: docker-compose -f /private/var/folders/hb/qlcnb3ps2gz4v75__n9jws_40000gp/T/tmpcfh9sq30/docker-compose.yaml up --build --abort-on-container-exit
Creating tmpcfh9sq30_algo-1-8evcu_1 ... done
Attaching to tmpcfh9sq30_algo-1-8evcu_1
algo-1-8evcu_1 | 2020-07-03 10:11:43,914 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:43,926 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:43,951 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:43,975 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:43,998 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
algo-1-8evcu_1 | 2020-07-03 10:11:44,025 sagemaker-training-toolkit INFO Failed to parse hyperparameter b value some text to Json.
algo-1-8evcu_1 | Returning the value itself
algo-1-8evcu_1 | 2020-07-03 10:11:44,048 sagemaker-training-toolkit INFO Invoking user script
algo-1-8evcu_1 |
algo-1-8evcu_1 | Training Env:
algo-1-8evcu_1 |
algo-1-8evcu_1 | {
algo-1-8evcu_1 | "additional_framework_parameters": {},
algo-1-8evcu_1 | "channel_input_dirs": {},
algo-1-8evcu_1 | "current_host": "algo-1-8evcu",
algo-1-8evcu_1 | "framework_module": null,
algo-1-8evcu_1 | "hosts": [
algo-1-8evcu_1 | "algo-1-8evcu"
algo-1-8evcu_1 | ],
algo-1-8evcu_1 | "hyperparameters": {
algo-1-8evcu_1 | "test": 10,
algo-1-8evcu_1 | "a": 50,
algo-1-8evcu_1 | "b": "some text"
algo-1-8evcu_1 | },
algo-1-8evcu_1 | "input_config_dir": "/opt/ml/input/config",
algo-1-8evcu_1 | "input_data_config": {},
algo-1-8evcu_1 | "input_dir": "/opt/ml/input",
algo-1-8evcu_1 | "is_master": true,
algo-1-8evcu_1 | "job_name": job_name,
algo-1-8evcu_1 | "log_level": 20,
algo-1-8evcu_1 | "master_hostname": "algo-1-8evcu",
algo-1-8evcu_1 | "model_dir": "/opt/ml/model",
algo-1-8evcu_1 | "module_dir": "/opt/ml/code",
algo-1-8evcu_1 | "module_name": "test.sh",
algo-1-8evcu_1 | "network_interface_name": "eth0",
algo-1-8evcu_1 | "num_cpus": 2,
algo-1-8evcu_1 | "num_gpus": 0,
algo-1-8evcu_1 | "output_data_dir": "/opt/ml/output/data",
algo-1-8evcu_1 | "output_dir": "/opt/ml/output",
algo-1-8evcu_1 | "output_intermediate_dir": "/opt/ml/output/intermediate",
algo-1-8evcu_1 | "resource_config": {
algo-1-8evcu_1 | "current_host": "algo-1-8evcu",
algo-1-8evcu_1 | "hosts": [
algo-1-8evcu_1 | "algo-1-8evcu"
algo-1-8evcu_1 | ]
algo-1-8evcu_1 | },
algo-1-8evcu_1 | "user_entry_point": "test.sh"
algo-1-8evcu_1 | }
algo-1-8evcu_1 |
algo-1-8evcu_1 | Environment variables:
algo-1-8evcu_1 |
algo-1-8evcu_1 | SM_HOSTS=["algo-1-8evcu"]
algo-1-8evcu_1 | SM_NETWORK_INTERFACE_NAME=eth0
algo-1-8evcu_1 | SM_HPS={"a":50,"b":"some text","test":10}
algo-1-8evcu_1 | SM_USER_ENTRY_POINT=test.sh
algo-1-8evcu_1 | SM_FRAMEWORK_PARAMS={}
algo-1-8evcu_1 | SM_RESOURCE_CONFIG={"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]}
algo-1-8evcu_1 | SM_INPUT_DATA_CONFIG={}
algo-1-8evcu_1 | SM_OUTPUT_DATA_DIR=/opt/ml/output/data
algo-1-8evcu_1 | SM_CHANNELS=[]
algo-1-8evcu_1 | SM_CURRENT_HOST=algo-1-8evcu
algo-1-8evcu_1 | SM_MODULE_NAME=test.sh
algo-1-8evcu_1 | SM_LOG_LEVEL=20
algo-1-8evcu_1 | SM_FRAMEWORK_MODULE=
algo-1-8evcu_1 | SM_INPUT_DIR=/opt/ml/input
algo-1-8evcu_1 | SM_INPUT_CONFIG_DIR=/opt/ml/input/config
algo-1-8evcu_1 | SM_OUTPUT_DIR=/opt/ml/output
algo-1-8evcu_1 | SM_NUM_CPUS=2
algo-1-8evcu_1 | SM_NUM_GPUS=0
algo-1-8evcu_1 | SM_MODEL_DIR=/opt/ml/model
algo-1-8evcu_1 | SM_MODULE_DIR=/opt/ml/code
algo-1-8evcu_1 | SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1-8evcu","framework_module":null,"hosts":["algo-1-8evcu"],"hyperparameters":{"a":50,"b":"some text","test":10},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"a204311-kedro-sagemaker-example-2020-07-03-11-11-42-11S","log_level":20,"master_hostname":"algo-1-8evcu","model_dir":"/opt/ml/model","module_dir":"/opt/ml/code","module_name":"test.sh","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1-8evcu","hosts":["algo-1-8evcu"]},"user_entry_point":"test.sh"}
algo-1-8evcu_1 | SM_USER_ARGS=["-a","50","-b","some text","--test","10"]
algo-1-8evcu_1 | SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
algo-1-8evcu_1 | SM_HP_TEST=10
algo-1-8evcu_1 | SM_HP_A=50
algo-1-8evcu_1 | SM_HP_B=some text
algo-1-8evcu_1 | PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
algo-1-8evcu_1 |
algo-1-8evcu_1 | Invoking script with the following command:
algo-1-8evcu_1 |
algo-1-8evcu_1 | /bin/sh -c ./test.sh -a 50 -b 'some text' --test 10
algo-1-8evcu_1 |
algo-1-8evcu_1 |
algo-1-8evcu_1 | Inside test script
algo-1-8evcu_1 | -a
algo-1-8evcu_1 | 50
algo-1-8evcu_1 | -b
algo-1-8evcu_1 | some text
algo-1-8evcu_1 | --test
algo-1-8evcu_1 | 10
algo-1-8evcu_1 | 2020-07-03 10:11:44,067 sagemaker-training-toolkit INFO Reporting training SUCCESS
tmpcfh9sq30_algo-1-8evcu_1 exited with code 0
Aborting on container exit...
2020-07-03 11:11:44,301 - sagemaker - WARNING - 'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
===== Job Complete ===== It seems like when using |
You get
For the hyperparameters those are by default available in
As per |
I realize that this issue has been last updated over a year ago, but on the off chance that somebody else also stumbles here, I wanted to fill in a gap as to why the container works like it does even if no CMD or ENTRYPOINT is defined. Like several people have pointed out, the container is invoked like When the container setup installs sagemaker-training-toolkit/setup.py Line 92 in 447e8f3
This essentially creates a shim executable at
So, when the container is defined without an ENTRYPOINT and is invoked with a single argument |
@tvoipio Thanks for your comment, and I'm sure lots of poor people will eventually stumble here given the comically dreadful state of sagemaker documentation. Can I ask for clarification on how I currently understand the situation as well as a question about moving forward: There are two options for training a container
The Framework estimator is made to run with the Given the point about Framework estimators, if you had successfully trained a Framework container, if you wanted to create a transformer from it to do batch inference, how would you run the .transform() method? Would this then not pass |
There is some excellent research by various commenters here which provided great insights into the inner workings of sagemaker training package. One only wishes it was not this convoluted. Here are some of my findings: Sagemaker does allow you to essentially run a plain vanilla arbitrary script file as a training job without needing Sagemaker training package. See the note here.
args and it should work with your native script file as expected. But wait a minute, How do I do this through the Sagemaker SDK So the end effect is there is NO way to express There are other issues. sagemaker training package does not work in Sagemaker studio because the Python kernel in Studio does not have To fix the above, what I recommend is to remove Moral of the story: And a final caveat (whew!). If you decide to go down the path of implementing your own custom entry point as @uwaisiqbal has done, I would love for some AWS expert to confirm (or push back) on my findings |
Describe the bug
I would like to create a SageMaker Training Job using a custom Docker container which executes a bash command I have created. I am using the kedro framework to organise and structure my code into pipelines and nodes. I would like to execute my training code with the bash command
For some reason, Sagemaker passes
train
as a default execution parameter.To reproduce
The following is my Dockerfile:
I am creating and running a sagemaker job with the following code:
When execute the estimator.fit() I get the following error:
Why does Sagemaker pass a train argument by default to the bash command?
Expected behavior
I would like expect the sagemaker job to execute the following bash command within the job:
The text was updated successfully, but these errors were encountered: