This document serves as an introduction to generating proficient Amazon Redshift queries. This is a generalized document meaning you will need to replace “schema_name” and “table_name” with the appropriate schema and table names used for your project.
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/RedshiftGuide.qmd b/RedshiftGuide.qmd
index b7f0e1f..fcd3406 100644
--- a/RedshiftGuide.qmd
+++ b/RedshiftGuide.qmd
@@ -4,4 +4,4 @@ title: "Redshift querying guide"
## ***Introduction*** {#sec-RedShiftGuide}
-This document serves as an introduction to generating proficient Amazon Redshift queries. This is a generalized document meaning you will need to replace "schema_name" and "table_name" with the appropriate schema and table names used for your project.
+This document serves as an introduction to generating proficient Amazon Redshift queries. This is a generalized document meaning you will need to replace "schema_name" and "table_name" with the appropriate schema and table names used for your project.
diff --git a/_quarto.yml b/_quarto.yml
index 5776c5a..162e86b 100644
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -24,6 +24,7 @@ book:
- support.qmd
- references.qmd
- appendix.qmd
+ - dashboards.qmd
search:
location: sidebar
type: textbox
@@ -52,6 +53,7 @@ format:
toc: true
df-print: paged
number-depth: 1
+
diff --git a/access.qmd b/access.qmd
index fa4d5a8..6a8e1f1 100644
--- a/access.qmd
+++ b/access.qmd
@@ -18,7 +18,7 @@ title: "Obtaining ADRF Access and Account Set Up"
## **Obtaining ADRF Access**
-- Agency-affiliated researcher. If you are an agency-affiliated researcher using an agency-sponsored account, you will be granted ADRF access once you complete your onboarding tasks and required data access agreements. If you are an self-paying agency-affiliated researcher, your ADRF access is conditional on receipt of payment. If your institution of Office of Sponsored Programs will be submitting payment on your behalf, please be aware of potential access delays. Whenever possible, the Coleridge Initiative advises paying with a personal credit card or institutional payment card and using the generated invoice to request reimbursement.
+- Agency-affiliated researcher. If you are an agency-affiliated researcher using an agency-sponsored account, you will be granted ADRF access once you complete your onboarding tasks and required data access agreements. If you are a self-paying agency-affiliated researcher, your ADRF access is conditional on receipt of payment. If your institution of Office of Sponsored Programs will be submitting payment on your behalf, please be aware of potential access delays. Whenever possible, the Coleridge Initiative advises paying with a personal credit card or institutional payment card and using the generated invoice to request reimbursement.
- Individual part of a training program. If you are part of a training program, you will be granted ADRF access once you complete your onboarding tasks and required data access agreements.
diff --git a/appendix.qmd b/appendix.qmd
index 1fc3508..5d9a643 100644
--- a/appendix.qmd
+++ b/appendix.qmd
@@ -50,7 +50,7 @@ GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE schema_name.table_name to "IAM:fir
If you have any questions, please reach out to us at [support\@coleridgeinitiative.org](mailto:support@coleridgeinitiative.org "mailto:support@coleridgeinitiative.org")
-When connecting to the database through SAS, R, Stata, or Python you need to use one of the following DSNs:
+When connecting to the database using an ODBC connection, you need to use one of the following DSNs:
- **Redshift01_projects_DSN**
@@ -77,7 +77,8 @@ quit;
To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:
``` r
-options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m")) gc()
+options(java.parameters = c("-XX:+UseConcMarkSweepGC", "-Xmx8192m"))
+gc()
```
**Best practices for writing tables to Redshift**
@@ -120,7 +121,7 @@ identifier.quote="`")
conn <- dbConnect(driver, url, dbusr, dbpswd)
```
-*For the above code to work, please create a file name **.Renviron** in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And **.Renviron file** should contain the following:*
+*For the above code to work, please create a file name **.Renviron** in your user folder (user folder is something like i.e. u:\\John.doe.p00002) And **.Renviron file** should contain the following:*
``` r
DBUSER='adrf\John.doe.p00002'
@@ -155,7 +156,7 @@ identifier.quote="`")
conn <- dbConnect(driver, url, dbusr, dbpswd)
```
-*For the above code to work, please create a file name **.Renviron** in your user folder (user folder is something like i.e. u:\\ John.doe.p00002) And **.Renviron file** should contain the following:*
+*For the above code to work, please create a file name **.Renviron** in your user folder (user folder is something like i.e. u:\\John.doe.p00002) And **.Renviron file** should contain the following:*
``` r
DBUSER='adrf\John.doe.p00002'
@@ -171,8 +172,7 @@ DBPASSWD='xxxxxxxxxxxx'
``` python
import pyodbc
import pandas as pd
-cnxn = pyodbc.connect('DSN=Redshift01_projects_DSN;
- UID = adrf\\user.name.project; PWD = password')
+cnxn = pyodbc.connect('DSN=Redshift01_projects_DSN; UID = adrf\\user.name.project; PWD = password')
df = pd.read_sql(“SELECT * FROM projects.schema_name.table_name”, cnxn)
```
@@ -366,7 +366,7 @@ Query optimizers can change the order of the following list, but this general li
*Join tables using the ON keyword.* Although it's possible to "join" two tables using a WHERE clause, use an explicit JOIN. The JOIN + ON syntax distinguishes joins from WHERE clauses intended to filter the results.
-`SET search_path = schema_name;`\-- this statement sets the default schema/database to projects.schema_name
+`SET search_path = schema_name;`-- this statement sets the default schema/database to projects.schema_name
`SELECT A.col_A , B.col_B, B.col_C`
@@ -378,7 +378,7 @@ Query optimizers can change the order of the following list, but this general li
**Avoid**
-`SET search_path = schema_name;`\-- this statement sets the default schema/database to projects.schema_name
+`SET search_path = schema_name;`-- this statement sets the default schema/database to projects.schema_name
`SELECT col_A , col_B, col_C`
@@ -388,7 +388,7 @@ Query optimizers can change the order of the following list, but this general li
**Prefer**
-`SET search_path = schema_name;`\-- this statement sets the default schema/database to projects.schema_name
+`SET search_path = schema_name;`-- this statement sets the default schema/database to projects.schema_name
`SELECT A.col_A , B.col_B, B.col_C`
diff --git a/dashboards.qmd b/dashboards.qmd
new file mode 100644
index 0000000..4dca7ee
--- /dev/null
+++ b/dashboards.qmd
@@ -0,0 +1,63 @@
+---
+title: "Accessing ADRF Dashboards"
+---
+
+If you are a first-time ADRF Users, please follow the instructions in the [Onboarding Modules and Security Training](onboarding.qmd) to activate your ADRF account and complete your onboarding tasks.
+
+## 1. Setting your dashboard access password
+
+Once you have completed the management portal onboarding tasks, you will next need to set your **dashboard access password**. This is separate from the first password you use to access the ADRF through Okta, and will instead be used to provide specific access to the dashboard. **You should only need to do this the first time you access the dashboard, but you can always follow these instructions if you need to update or reset your dashboard access password in the future**.
+
+In the Management Portal, again navigate to the “Admin Tasks” page by clicking the link on the sidebar navigation menu:
+
+![](images/db_admin_tasks.png){fig-alt="Dashboard Admin Tasks"}
+
+Click on the “Reset Password” button:
+
+![](images/db_reset_password.png){fig-alt="Reset Password"}
+
+This will load the password reset window:
+
+![](images/db_password_window.png){fig-alt="Password Reset Screen"}
+
+Select the account associated with the dashboard by clicking on the checkbox on the right:
+
+![](images/db_password_reset_project.png){fig-alt="Password Reset Screen Project Selection"}
+
+> Important: Take note of the username associated with your dashboard (John.Doe.P00000 in this example). You will need to enter this username again in the next step. This is also the user name referenced in your onboarding email.
+
+Enter the desired password. The chosen password must adhere to the ADRF password policy:
+
+![](images/db_enter_password.png){fig-alt="Password Policy"}
+
+Click the "Reset Password" button to proceed with the update. You will receive confirmation at the bottom of the window once the password has been successfully updated:
+
+![](images/db_successful_password.png){fig-alt="Sucessful Password"}
+
+Please email [support\@coleridgeinitiative.org](mailto:support@coleridgeinitiative.org "mailto:support@coleridgeinitiative.org") if you have any issues resetting this password.
+
+## 2. Accessing the Dashboard
+
+Once you have successfully reset your dashboard access password, you are ready to access the dashboard. To do so, navigate back to the main Okta portal (adrf.okta.com) and click on the tile associated with your dashboard. **This tile will be unavailable until you complete the three ADRF onboarding tasks discussed in Step 1**:
+
+![](images/db_dashboard_tile.png){fig-alt="Dashboard Tile"}
+
+Clicking on this will bring up another window where you will be prompted to “Choose Your Application to Get Started.” Click on your Dashboard icon:
+
+![](images/db_application.png){fig-alt="Dashboard Application"}
+
+Next, you will need to wait for your session to be prepared. Then, your session will load the secure browser window, which will then bring up the Posit Connect portal. The Posit Connect portal is used to host the Dashboard. **This step may take several seconds while the browser loads and prepares the dashboard data**.
+
+Before accessing the dashboard, you will then be presented with one final log in, to the secure Connect environment:
+
+![](images/db_secure_connect.png){fig-alt="Secure Connect"}
+
+Here, please enter:
+
+1. The username you saw in the Password Reset step above (e.g., John.Doe.P00000)
+
+2. Your dashboard access password that you set in Step 2.
+
+![](images/db_LDAP.png){fig-alt="LDAP Credentials"}
+
+Once you enter the appropriate information and click “Log In,” your dashboard should begin to load. This again may take a minute or two - if you run into any issues, please let us know!
diff --git a/docs/.DS_Store b/docs/.DS_Store
deleted file mode 100644
index b1b1f67..0000000
Binary files a/docs/.DS_Store and /dev/null differ
diff --git a/docs/.nojekyll b/docs/.nojekyll
deleted file mode 100644
index e69de29..0000000
diff --git a/docs/access.html b/docs/access.html
deleted file mode 100644
index 735d5a2..0000000
--- a/docs/access.html
+++ /dev/null
@@ -1,508 +0,0 @@
-
-
-
-
-
-
-
-
-
-ADRF Onboarding Handbook - 1 Obtaining ADRF Access and Account Set Up
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Agency-affiliated researcher. If you are an agency-affiliated researcher, your agency will set up an ADRF account for you.
-
Individual part of a training program. If you are part of a training program, Coleridge Initiative will create an account for you once you have been accepted into the program.
-
-
-
-
Account Registration and Onboarding Tasks
-
-
You will receive an email invitation to activate your account. The email will come from http://okta.com, so please make sure that it doesn’t get caught in your email spam filter. Follow the steps outlined in the email to set up your password and your multi-factor authentication preferences. Clink on the link below to watch a video walking through the steps.
-
After activating your account, you will be logged in to the ADRF Applications page. Proceed to the Management Portal by clicking on the icon.
-
In the Management Portal, you will notice a “Onboarding Tasks” section within “Admin Tasks” with a number of items you will need to complete before you can gain access to the project space. Refer to the next section for details about the onboarding process.
-
-
-
-
Obtaining ADRF Access
-
-
Agency-affiliated researcher. If you are an agency-affiliated researcher using an agency-sponsored account, you will be granted ADRF access once you complete your onboarding tasks and required data access agreements. If you are an self-paying agency-affiliated researcher, your ADRF access is conditional on receipt of payment. If your institution of Office of Sponsored Programs will be submitting payment on your behalf, please be aware of potential access delays. Whenever possible, the Coleridge Initiative advises paying with a personal credit card or institutional payment card and using the generated invoice to request reimbursement.
-
Individual part of a training program. If you are part of a training program, you will be granted ADRF access once you complete your onboarding tasks and required data access agreements.
The ADRF stores data using Redshift. The simplest way to locate and get a quick overview of your data in a database is to use DBeaver. Please see the section on Data Organization, Amazon RedShift Querying Guide to locate and connect to your data.
-
-
G: Drive
-
Unstructured data is located on the G: drive inside the file system.
-
-
-
-
-
-
External Data and Code
-
Please note that importing of external data and code is restricted to only Coleridge staff. Given the secure and protected environment provided by the ADRF, all code, data, and packages that are coming from outside of the ADRF must be carefully vetted to prevent leaks, disclosure, or unauthorized access. This means that there is no direct method for uploading data or code from your system to the ADRF. Please contact support@coleridgeinitiative.org for any questions or assistance on importing your own code, data, or packages.
This document serves as an introduction to generating proficient Amazon Redshift queries. This is a generalized document meaning you will need to replace “schema_name” and “table_name” with the appropriate schema and table names used for your project.
-
-
-
Data Access
-
The data is housed in Redshift. You need to replace the “user.name.project” with your project based username. The project based username is your user folder name in theU:/drive:
-
-
-
-
-
Note: Your username will be different than in these examples.
-
-
The password needed to access Redshift is the second password entered when logging into the ADRF as shown in the screen below:
-
-
-
-
All data is stored under schemas in the projects database and are accessible by the following programs:
-
-
DBeaver
-
To establish a connection to Redshift in DBeaver, first double click on the server you wish to connect to. In the example below I’m connecting to Redshift11_projects. Then a window will appear asking for your Username and Password. This will be your user folder name and include adrf\ before the username. Then click OK. You will now have access to your data stored on the Redshift11_projects server.
-
-
-
-
Creating Tables in PR/TR Schema
-
When users create tables in their PR (Research Project) or TR (Training Project) schema, the table is initially permissioned to the user only. This is analogous to creating a document or file in your U drive: Only you have access to the newly created table.
-
If you want to allow all individuals in your project workspace to access the table in the PR/TR schema, you will need to grant permission to the table to the rest of the users who have access to the PR or TR schema.
Note: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in the group name db_xxxxxx_rw, replace xxxxxx with your project code. This is the last 6 characters in your project based user name. This will start with either a T or a P.
-
-
If you want to allow only a single user on your project to access the table, you will need to grant permission to that user. You can do this by running the following code:
Note: In the above code example replace schma_name with the pr_ or tr_ schema assigned to your workspace and replace table_name with the name of the table on which you want to grant access. Also, in "IAM:first_name.last_name.project_code" update first_name.last_name.project_code with the user name to whom you want to grant access to.
When connecting to the database through SAS, R, Stata, or Python you need to use one of the following DSNs:
-
-
Redshift01_projects_DSN
-
Redshift11_projects_DSN
-
-
In the code examples below, the default DSN is Redshift01_projects_DSN.
-
-
-
SAS Connection
-
proc sql;
-connect to odbc as my con
-(datasrc=Redshift01_projects_DSN user=adrf\user.name.project password=password);
-select * from connection to mycon
-(select * form projects.schema.table);
-disconnect from mycon;
-quit;
-
-
-
R Connection
-
Best practices for loading large amounts of data in R
-
To ensure R can efficiently manage large amounts of data, please add the following lines of code to your R script before any packages are loaded:
When writing an R data frame to Redshift use the following code as an example:
-
# Note: replace the table_name with the name of the data frame you wish to write to Redshift
-
-DBI::dbWriteTable(conn = conn, #name of the connection
-name ="schema_name.table_name", #name of table to save df to
-value = df_name, #name of df to write to Redshift
-overwrite =TRUE) #if you want to overwrite a current table, otherwise FALSE
-
-qry <-"GRANT SELECT ON TABLE schema.table_name TO group <group_name>;"
-dbSendUpdate(conn,qry)
-
The below table is for connecting to RedShift11 Database
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
PLEASE replace user id and password with your project workspace specific user is and password.
-
This will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.
-
The below table is for connecting to RedShift01 Database
For the above code to work, please create a file name .Renviron in your user folder (user folder is something like i.e. u:\ John.doe.p00002) And .Renviron file should contain the following:
PLEASE replace user id and password with your project workspace specific user is and password.
-
This will ensure you don’t have your id and password in R code and then you can easily share your R code with others without sharing your ID and password.
odbcload, exec("select * from PATH_TO_TABLE") clear dsn("Redshift11_projects_DSN") user("adrf\user.name.project") password("password")
-
-
-
-
Redshift Query Guidelines for Researchers
-
Developing your query. Here’s an example workflow to follow when developing a query.
-
-
Study the column and table metadata, which is accessible via the table definition. Each table definition can be displayed by clicking on the [+] next the table name.
-
To get a feel for a table’s values, SELECT * from the tables you’re working with and LIMIT your results (Keep the LIMIT applied as you refine your columns) or use (e.g., select * from [table name] LIMIT 1000 )
-
Narrow down the columns to the minimal set required to answer your question.
-
Apply any filters to those columns.
-
If you need to aggregate data, aggregate a small number of rows
-
Once you have a query returning the results you need, look for sections of the query to save as a Common Table Expression (CTE) to encapsulate that logic.
-
-
-
DO and DON’T DO BEST PRACTICES:
-
-
Tip 1: Use SELECT <columns> instead of SELECT *
-
Specify the columns in the SELECT clause instead of using SELECT *. The unnecessary columns place extra load on the database, which slows down not just the single Amazon Redshift, but the whole system.
-
Inefficient
-
SELECT * FROM projects.schema_name.table_name
-
This query fetches all the data stored in the table you choose which might not be required for a particular scenario.
-
Efficient
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name
-
-
-
Tip 2: Always fetch limited data and target accurate results
-
Lesser the data retrieved, the faster the query will run. Rather than applying too many filters on the client-side, filter the data as much as possible at the server. This limits the data being sent on the wire and you’ll be able to see the results much faster. In Amazon Redshift use LIMIT (###) qualifier at the end of the query to limit records.
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE [apply some filter] LIMIT 1000
-
-
-
Tip 3: Use wildcard characters wisely
-
Wildcard characters can be either used as a prefix or a suffix. Using leading wildcard (%) in combination with an ending wildcard will search all records for a match anywhere within the selected field.
-
Inefficient
-
Select col_A, col_B, col_C from projects.schema_name.table_name where col_A like '%BRO%'
-
This query will pull the expected results ofBrown Sugar, Brownie, Brown Riceand so on. However, it will also pull unexpected results, such asCountry Brown, Lamb with Broth, Cream of Broccoli.
-
Efficient
-
Select col_A, col_B, col_C from projects.schema_name.table_name where col_B like 'BRO%'.
-
This query will pull only the expected results ofBrownie, Brown Rice, Brown Sugar and so on.
-
-
-
Tip 4: Does My record exist?
-
Normally, developers use EXISTS() or COUNT() queries for matching a record entry. However, EXISTS() is more efficient as it will exit as soon as finding a matching record; whereas, COUNT() will scan the entire table even if the record is found in the first row.
-
Efficient
-
select col_A from projects.schema_name.table_name A where exists (select 1 from projects.schema_name.table_name B where A.col_A = B.col_A ) order by col_A;
-
-
-
Tip 5: Avoidcorrelated subqueries
-
A correlated subquery depends on the parent or outer query. Since it executes row by row, it decreases the overall speed of the process.
-
Inefficient
-
SELECT col_A, col_B, (SELECT col_C FROM projects.schema_name.table_name_a WHERE col_C = c.rma LIMIT 1) AS new_name FROM projects.schema_name.table_name_b
-
Here, the problem is — the inner query is run for each row returned by the outer query. Going over the “table_name_b” table again and again for every row processed by the outer query creates process overhead. Instead, for Amazon Redshift query optimization, use JOIN to solve such problems.
-
Efficient
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name c LEFT JOIN projects.schema_name.table_name co ON c.col_A = co.col_B
-
-
-
Tip 6: Avoid using Amazon Redshift function in the where condition
-
Often developers use functions or methods with their Amazon Redshift queries.
-
Inefficient
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE RIGHT(birth_date,4) = '1965' and LEFT(birth_date,2) = '07'
-
Note that even ifbirth_date has an index, the above query changes the WHERE clause in such a way that this index cannot be used anymore.
-
Efficient
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name WHERE birth_date between '711965' and '7311965'
-
-
-
Tip 7: Use WHERE instead of HAVING
-
HAVING clause filters the rows after all the rows are selected. It is just like a filter. Do not use the HAVING clause for any other purposes.It is useful when performing group bys and aggregations.
-
-
-
Tip 8: Use temp tables when merging large data sets
-
Creating local temp tables will limit the number of records in large table joins and merges. Instead of performing large table joins, one can break out the analysis by performing the analysis in two steps: 1) create a temp table with limiting criteria to create a smaller / filtered result set. 2) join the temp table to the second large table to limit the number of records being fetched and to speed up the query. This is especially useful when there are no indexes on the join columns.
-
Inefficient
-
SELECT col_A, col_B, sum(col_C) total FROM projects.schema_name.table_name pd INNER JOIN projects.schema_name.table_name st ON pd.col_A=st.col_B WHERE pd.col_C like 'DOG%' GROUP BY pd.col_A, pd.col_B, pd.col_C
-
Note that even if joining column col_A has an index, the col_B column does not. In addition, because the size of some tables can be large, one should limit the size of the join table by first building a smaller filtered #temp table then performing the table joins.
-
Efficient
-
SET search_path = schema_name; -- this statement sets the default schema/database to projects.schema_name
-
Step 1:
-
CREATE TEMP TABLE temp_table (
-
col_A varchar(14),
-
col_B varchar(178),
-
col_C varchar(4) );
-
Step 2:
-
INSERT INTO temp_table SELECT col_A, col_B, col_C
-
FROM projects.schema_name.table_name WHERE col_B like 'CAT%';
-
Step 3:
-
SELECT pd.col_A, pd.col_B, pd.col_C, sum(col_C) as total FROM temp_table pd INNER JOIN projects.schema_name.table_name st ON pd.col_A=st.col_B GROUP BY pd.col_A, pd.col_B, pd.col_C;
-
DROP TABLE temp_table;
-
Note always drop the temp table after the analysis is complete to release data from physical memory.
-
-
-
-
Other Pointers for best database performance
-
SELECT columns, not stars. Specify the columns you’d like to include in the results (though it’s fine to use * when first exploring tables — just remember to LIMIT your results).
-
Avoid using SELECT DISTINCT. SELECT DISTINCT command in Amazon Redshift used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. To fetch distinct rows and remove duplicate rows, use more attributes in the SELECT operation.
-
Inner joins vs WHERE clause. Use inner join for merging two or more tables rather than using the WHERE clause. WHERE clause creates the CROSS join/ CARTESIAN product for merging tables. CARTESIAN product of two tables takes a lot of time.
-
IN versus EXISTS. IN operator is costlier than EXISTS in terms of scans especially when the result of the subquery is a large dataset. We should try to use EXISTS rather than using IN for fetching results with a subquery.
-
Avoid
-
SELECT col_A , col_B, col_C
-
FROM projects.schema_name.table_name
-
WHERE col_A IN
-
(SELECT col_B FROM projects.schema_name.table_name WHERE col_B = 'DOG')
-
Prefer
-
SELECT col_A , col_B, col_C
-
FROM projects.schema_name.table_name
-
WHERE EXISTS
-
(SELECT col_A FROM projects.schema_name.table_name b WHERE
-
a.col_A = b.col_B and b.col_B = 'DOG')
-
Query optimizers can change the order of the following list, but this general lifecycle of a Amazon Redshift query is good to keep in mind when writing Amazon Redshift.
-
-
FROM (and JOIN) get(s) the tables referenced in the query.
-
WHERE filters data.
-
GROUP BY aggregates data.
-
HAVING filters out aggregated data that doesn’t meet the criteria.
-
SELECT grabs the columns (then deduplicates rows if DISTINCT is invoked).
-
UNION merges the selected data into a result set.
-
ORDER BY sorts the results.
-
-
-
-
Amazon Redshift best practices for FROM
-
Join tables using the ON keyword. Although it’s possible to “join” two tables using a WHERE clause, use an explicit JOIN. The JOIN + ON syntax distinguishes joins from WHERE clauses intended to filter the results.
-
SET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name
-
SELECT A.col_A , B.col_B, B.col_C
-
FROM projects.schema_name.table_name as A
-
JOIN projects.schema_name.table_name B ON A.col_A = B.col_B
-
Alias multiple tables. When querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table.
-
Avoid
-
SET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name
-
SELECT col_A , col_B, col_C
-
FROM dbo.table_name as A
-
LEFT JOIN dbo.table_name as B ON A.col_A = B.col_B
-
Prefer
-
SET search_path = schema_name;-- this statement sets the default schema/database to projects.schema_name
-
SELECT A.col_A , B.col_B, B.col_C
-
FROM dbo.table_name as A
-
LEFT JOIN dbo.table_name as B
-
A.col_A = B.col_B
-
-
-
Amazon Redshift best practices for WHERE
-
Filter with WHERE before HAVING. Use a WHERE clause to filter superfluous rows, so you don’t have to compute those values in the first place. Only after removing irrelevant rows, and after aggregating those rows and grouping them, include a HAVING clause to filter out aggregates.
-
Avoid functions on columns in WHERE clauses. Using a function on a column in a WHERE clause can really slow down your query, as the function prevents the database from using an index to speed up the query. Instead of using the index to skip to the relevant rows, the function on the column forces the database to run the function on each row of the table. The concatenation operator || is also a function, so don’t try to concat strings to filter multiple columns. Prefer multiple conditions instead:
-
Avoid
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name
-
WHERE concat(col_A, col_B) = 'REGULARCOFFEE'
-
Prefer
-
SELECT col_A, col_B, col_C FROM projects.schema_name.table_name
-
WHERE col_A ='REGULAR' and col_B = 'COFFEE'
-
-
-
Amazon Redshift best practices for GROUP BY
-
Order multiple groupings by descending cardinality. Where possible, GROUP BY columns in order of descending cardinality. That is, group by columns with more unique values first (like IDs or phone numbers) before grouping by columns with fewer distinct values (like state or gender).
-
-
-
Amazon Redshift best practices for HAVING
-
Only use HAVING for filtering aggregates. Before HAVING, filter out values using a WHERE clause before aggregating and grouping those values.
-
SELECT col_A, sum(col_B) as total_amt
-
FROM projects.schema_name.table_name
-
WHERE col_C = 1617 and col_A='key'
-
GROUP BY col_A
-
HAVING sum(col_D)> 0
-
-
-
Amazon Redshift best practices for UNION
-
Prefer UNION All to UNION. If duplicates are not an issue, UNION ALL won’t discard them, and since UNION ALL isn’t tasked with removing duplicates, the query will be more efficient
-
-
-
Amazon Redshift best practices for ORDER BY
-
Avoid sorting where possible, especially in subqueries. If you must sort, make sure your subqueries are not needlessly sorting data.
-
Avoid
-
SELECT col_A, col_B, col_C
-
FROM projects.schema_name.table_name
-
WHERE col_B IN
-
(SELECT col_A FROM projects.schema_name.table_name
-
WHERE col_C = 534905 ORDER BY col_B);
-
Prefer
-
SELECT col_A, col_B, col_C
-
FROM projects.schema_name.table_name
-
WHERE col_A IN
-
(SELECT col_B FROM projects.schema_name.table_name
-
WHERE col_C = 534905);
-
-
-
Troubleshooting Queries
-
There are several metrics for calculating the cost of the query in terms of storage, time, CPU utilization. However, these metrics require DBA permissions to execute. Follow up with ADRF support to get additional assistance.
-
Using the SVL_QUERY_SUMMARY view: To analyze query summary information by stream, do the following:
-
Step 1:select query, elapsed, substring from svl_qlog order by query desc limit 5;
-
Step 2:select * from svl_query_summary where query = MyQueryID order by stm, seg, step;
-
Execution Plan: Lastly, an execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure:
-
-
Click on SQL Editor in the menu bar.
-
Click on Explain Execution Plan.
-
-
It helps to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part. The execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged. Further, we can see a more detailed analysis view of each sub-operation performed during query execution.
4Do’s and Don’ts For Discussing Data Inside the ADRF
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
It is important to protect the confidential data that is inside the ADRF in communicating with your team-mates. The general rule is that you should never take any exact number out of the ADRF. This means you should never write down or share any number by text, screenshot, or share an image even with a team-mate. The rules have become more complicated now that everything is online, because even though your team-mates are “safe people”, and zoom conversations are password protected and encrypted, we’d rather err on the side of caution when sharing information over zoom.
-
This cheat sheet summarizes some of the rules that apply to discussing data before it has been exported from the ADRF and passed the ADRF team’s disclosure review. If you are unsure about a specific situation, please ask a Coleridge at support@coleridgeinitiative.org.
-
-
Exact Numbers
-
Do not describe a statistic in exact numbers. If you would like to communicate these values while not in person, you can have a private discussion via the projects drive inside the ADRF.
-
Example: If an average within a specific group was 5,000, you would need to convey this average on the projects drive.
-
-
-
Comparing Values
-
When comparing values, you are permitted to say that one value is more than, less than, or about the same as another. However, you cannot refer to the exact difference between the two numbers.
-
In practice, you can use pluses and minuses to convey differences between values for data that has not been exported from the ADRF.
-
Example: “The mean for Group A was roughly the same as the mean for Group B, but these values were both greater than that of Group C.”
-
-
-
Percentages/Proportions
-
Percentages and proportions also cannot be directly mentioned. Instead, you can refer to the percentage/proportion within 25%.
-
Example: If a proportion was 30%, you could say “The proportion is about 25%” or “The proportion is between 25% and 50%.”
To provide ADRF users with the ability to draw from sensitive data, results that are exported from the ADRF must meet rigorous standards meant to protect privacy and confidentiality. To ensure that those standards are met, the ADRF Export Review team reviews each request to ensure that it follows formal guidelines that are set by the respective agency providing the data in partnership with the Coleridge Initiative. Prior to moving data into the ADRF from the agency, the Export Review team suggests default guidelines to implement, based on standard statistical approaches in the U.S. government 1,2 as well as international standards 3,4, and 5. The Data Steward from the agency supplying the data works with the team to amend these default rules in line with the agency’s requirements. If you are unsure about the review guidelines for the data you are using in the ADRF or if you have any questions relating to exports, please reach out to support@coleridgeinitiative.org before submitting an export request.
-
To learn more about limiting disclosure more generally, please refer to the textbook or view the videos.
-
-
General Best Practices for a Successful Export
-
-
Currently, the review process is highly manual: Reviewers will read your code and view your output files, which may be time-consuming.
-
Each additional release adds disclosure risk and therefore limits subsequent releases; we ask that users limit the number of files they request to export to just the outputs necessary to produce a particular report or paper. If you are requesting an export of more than 10 files, there may be an additional charge.
-
The reviewers may ask you to make changes to your code or output to meet the requirements of guidelines that have been given by the providers of the data in the ADRF. Thus, we strongly encourage you to produce all output files—tables with rounded numbers, graphs with titles, and so forth—through code, rather than manually.
-
We ask that you only request review of final versions of output files, rather than in-progress versions. Any file containing intermediate output will be rejected.
-
Every code file should have a header describing the contents of the file, including a summary of the data manipulation that takes place in the file (e.g., regression, table or figure creation, etc.).
-
Documenting code by using comments throughout is helpful for disclosure reviews. The better the documentation, the faster the turnaround of export requests. If data files are aggregated, please provide documentation on the level of aggregation and for where in the code the aggregation takes place.
-
To help reviewers, who may not have seen your code before, we ask that users create meaningful variable names. For instance, if you are calculating outflows, it is better to name the variable “outflows” than to name it “var1.”
-
-
-
-
Timelines for Export Process
-
-
Coleridge reviewers have five business days to complete an export from the day you submit an export request. However, timelines may differ depending on your agency, so please refer to your specific agency’s guidelines.
-
The review process can be delayed if the reviewer needs additional information or if the reviewer needs you to make changes to your code or output to meet the ADRF nondisclosure requirements.
-
-
-
-
Export Review Process
-
The ADRF Export Review process typically involves two main stages:
-
-
Primary Review:
-
-
This is an initial, cursory review of your documentation and exports to ensure they do not include micro-data. A primary review can take up to 5 business days, so please plan accordingly when submitting your materials.
-
In cases where the reviewer has questions or requires additional information, the primary review may extend beyond 5 business days.
-
-
Secondary Review:
-
-
This is a comprehensive review conducted by an approved Data Steward who has content knowledge for the data permissioned to your workspace.
-
If your submission pertains to multiple data assets, it will require approval by each Data Steward before the material can be exported from the ADRF.
-
-
-
How to Check Your Export Review Status:
-
If you’ve submitted an export request, you can easily check the status of your submission by following these steps:
-
-
Log into the ADRF.
-
Open the ADRF Export module.
-
-
-
Status Descriptions:
-
To help you better understand the different stages of the Export Review process, here are the status descriptions you may encounter:
-
-
Awaiting Reviewer:
-
-
Your export is currently under primary review. If any issues arise during the primary review, your reviewer will notify you. Upon completion of the primary review, the secondary reviewer(s) will be notified.
-
-
Awaiting Secondary Review:
-
-
Your export is currently under secondary review. If your submission pertains to multiple data assets, it will require a review by each Data Steward before being approved.
-
-
-
-
-
Preparing Data for Export
-
Each agency has specific disclosure review guidelines, especially with respect to the minimum allowable cell sizes for tables. Refer to these guidelines when preparing export requests. If you are unsure of what guidelines are in place for the dataset with which you are working in the ADRF, please reach out to support@coleridgeinitiative.org.
-
-
Tables
-
-
Cell Sizes
-
-
For individual-level data, please report the number of observations from each cell. For individual-level data, the default rule is to suppress cells with fewer than 10 observations, unless otherwise directed by the guidelines of the agency that provided the data.
-
If your table includes row or column totals or is dependent on a preceding or subsequent table, reviewers will need to take into account complementary disclosure risks—that is, whether the tables’ totals, or the separate tables when read together, might disclose information about individuals in the data in a way that a single, simpler table would not. Reviewers will work with you by offering guidance on implementing any necessary complementary suppression techniques.
-
-
Weighted Data
-
-
If weighted results are to be exported, you must report both weighted and unweighted counts.
-
-
Ratios
-
-
If ratios are reported, please report the number of valid cases for both the numerator and the denominator (e.g., number of men in state X and number of women in state X, in addition to the ratio of women in state X).
-
-
Percentiles
-
-
Do not report exact percentiles. Instead, for example, you may calculate a “fuzzy median,” by averaging the true 45th and 55th percentiles.
-
-
Percentages
-
-
For any reported percentages or proportions, the underlying counts of individuals contributing to the numerators and denominators must be provided for each statistic in the desired export.
-
-
Maxima and Minima
-
-
Suppress maximum and minimum values in general.
-
You may replace an exact maximum or minimum with a top-coded value.
-
-
-
-
-
Graphs
-
-
Graphs are representations of tables. Thus, for each graph (which may have, e.g., a jpg, pdf, png, or tif extension), provide the source data of the underlying table of the graph following the guidelines for tables above.
-
Because graphs and other figures take the most time to review, the number of generated graphs should be as low as possible. Please consider the possibility that you could export the underlying table instead, and generate the graph in another package.
-
If a graph is produced from aggregated data or from tables that have been disclosure-proofed following the guidelines above (e.g., bar charts of magnitudes), provide the underlying tables.
-
If a graph is produced directly from unit-record data but aggregated in the visualization (e.g., frequency histograms), provide the underlying tables.
-
If a graph is produced directly from unit-record data and displays unit-record values (e.g., scatterplots, plots of residuals), the graph can be released only after you ensure that individuals cannot be re-identified and that values can only be estimated with a high level of uncertainty. Further processing to meet this requirement can include, but is not restricted to, cutting off the tails of a distribution, removing outliers, jittering the actual values, and removing or modifying axis values.
-
If a graph is produced from the results of modeling or derivation and uses the unit-record data (e.g., regression curves), the graph can be released only if the values cannot be used to find original data values.
-
-
Graphs of this type are generally automatically cleared.
-
For precision/recall graphs, you will need to report the sample size used to generate your model(s).
-
-
-
-
-
Model Output
-
-
Output from regression or machine-learning models generally does not pose a risk of disclosing personally identifiable information, as long as the models are not based on small samples. Provide the counts for each variable that produces the model output. If categorical variables are used then provide the counts for each category.
-
-
-
-
-
Submitting an Export Request
-
To request an export be reviewed, please watch the following video or follow the instructions below:
Verify yourself with Okta (download Okta Verify on your smartphone or other device).
-
Choose your project as seen in the photo below. For the purpose of this document, you are seeing the Coleridge Initiative Associate Access project.
-
-
-
-
-
-
Select Desktop and login with the same credentials you had done previously.
-
Upon entering the ADRF, a chrome page will appear as shown in the photo below. On this page, click Export Request in the bottom left corner. Or, from the ADRF desktop, open Google Chrome and navigate to export.adrf.net. (Note: export.adrf.net is an address that will only work within the ADRF desktop).
-
-
-
-
-
-
Click My Requests, or the top (person-shaped) icon, at the left side of the window as shown in the screenshot below.
-
-
-
-
-
-
Click New Item as shown below
-
-
-
-
-
-
You will be asked to select the project to which your export relates. If you do not see the correct project listed in the dropdown list, please reach out to our support team at support@coleridgeinitiative.org.
-
After selecting a project, click Continue.
-
-
-
-
-
-
Read through the entire page that loads. This page, titled “Create Export Request,” will ask for you to comment on all supporting code files to explain the commands used to generate the files in the export request. The Export Review team will reject all requests containing intermediate output, and there should be no more than 10 separate files for export unless approval is given in advance. The Export Review team will typically release export requests within five business days. However, if the team has any clarifying questions, this could result in a longer review process. You need to document your output files in the text box provided. See the example below:
-
-
-
-
-
-
When you have read through and followed the page instructions, and are ready to proceed:
-
-
Move the slider at the bottom of the page to indicate that you have followed the page’s guidelines.
-
At the bottom of the page, upload each of the files that you have prepared.
-
Click Submit Request… to create the export request.
-
-
-
-
-
You can click My Requests at the left side of the window to view your current and previous export requests.
-
-
To learn more about exporting results, please watch these videos.
You should be prompted to set up multifactor authentication when you create your account, the options are SMS, voice call, email and the Okta verify application.
-
-
-
-
-
-
-
-
-
-Can I set up more than one form of Multifactor Authentication?
-
-
-
-
-
-
This is recommended. If you lose access to one form of MFA, you would still be able to gain access to your account using an alternative. To do so, please log on to https://adrf.okta.com and select your name on the top right and click settings. Here you can modify or set up your SMS, voice call, email or Okta multifactor authentication.
-
-
-
-
-
-
-
-
-
-How can I reset my Okta password?
-
-
-
-
-
-
You can use the “Need help signing in?” option on the sign on page (https://adrf.okta.com) which will send a link to your email to reset your password. You may have to verify your identify by answering security questions which you set up when creating your account.
-
-
-
-
-
-
-
-
-
-How can I reset my ADRF password?
-
-
-
-
-
-
You can reset your ADRF project password by following these steps:
-
-
Click on the ADRF Management Portal Okta Tile:
-
-
-
-
Then click on Admin Tasks on the left hand side of the screen:
-
-
-
-
Then click on RESET PASSWORD:
-
-
-
-
You’ll see a screen where you can choose the project(s) for which you want to update the password.
-
-
-
-
-
-
-
-
-
-
-What if I do not remember my security questions or if I get locked out?
-
-
-
-
-
-
You would have to reach out to support at support@coleridgeinitiative.org to have your account unlocked and you would have to reset your security questions so that you can recover your account in the future.
-
-
-
-
-
-
-
-
-
-I can log into the ADRF but my desktop and DS application just show blank pages.
-
-
-
-
-
-
Please ensure the connection to ADRF is not being blocked by your organizations VPN and/or firewall (try using a device not connected to your organization’s network) and reach out to support@coleridgeinitiative.org.
-
-
-
-
-
-
-
-
-
-I saved a file in the C: drive or in the Desktop. When I logged back in, the file is no longer there. Can you restore it?
-
-
-
-
-
-
The ADRF is a temporary workspace environment, files left on the desktop will be removed when you log out of your session, and we cannot restore these files. See section 5.2.1 Best practice is to store files in your user folder on the U: drive
-
-
-
-
-
-
-
-
-
-How do I open an ipynb notebook?
-
-
-
-
-
-
On the desktop you should find an icon for JupyterLab, when you click that, a command prompt and a browser window are opened up, leave the command prompt running. You should be able to open the file by selecting File -> Open From Path and providing the path to the folder containing the ipynb notebook.
-
-
-
-
-
-
-
-
-
-How can I ingest publicly available data into the ADRF?
-
-
-
-
-
-
Please open a support request by sending an email to support@coleridgeinitiative.org. Include the dataset you wish to have available inside the ADRF and documentation that confirms that the dataset is public.
-
-
-
-
-
-
-
-
-
-Where can I access publicly available data from within the ADRF?
-
-
-
-
-
-
Publicly available data is stored in the schema ds_public_1.
-
-
-
-
-
-
-
-
-
-Where is my project or training related data stored?
-
-
-
-
-
-
All project and training related databases are prefixed with ‘pr_’ (for project) or ‘tr_’ (for training). You may use this space when creating intermediate datasets or as a “working space”. All project members have read and write access to this area (specific to your project).
-
-
-
-
-
-
-
-
-
-My data is not in a relational format. Where can I find these files?
-
-
-
-
-
-
Read-only non-relational data are stored in the G:\ drive on Windows Explorer. Project specific non-relational data and files are stored in project specific folders that are prefixed with ‘pr_’ or ‘tr_’. The location of these folders are in the P:\ drive on Windows Explorer.
-
-
-
-
-
-
-
-
-
-What is the difference between the P:, U: and G: drives?
-
-
-
-
-
-
Each drive location has a different purpose and access rule:
- P: Project specific files shared by ALL project members
- U: User personal space. Only the user has read/write access to this area.
- G: Non-relational datasets. Read-only access to authorized users only.
-
-
-
-
-
-
-
-
-
-I need to process a large amount of relational data. What is the destination location?
-
-
-
-
-
-
The best practice is to process the data where it is currently located. If the data is in a relational database, perform as much of your processing using Redshift to make the most efficient use of resources (i.e. filtering, sorting, etc).
The Management Portal web-based application is positioned primarily as the management and monitoring console for project and data stewards. It provides detailed insight on project configurations, user activity, user onboarding status, and overall cost of a project on the ADRF. We focus on four primary pillars of information a Project/Data Steward most often focuses on:
-
-
People – Who are the members of projects, how often do they use the ADRF, what exports have they requested and their status, estimated cost per person/project for current month and for the project since inception, and detailed usage metrics.
-
Projects – Details of project start/end dates, abstract description, number of members onboarded and pending, and resources the project has access to (i.e. datasets, etc).
-
Datasets – Description of the dataset, location on the ADRF (database or file system), size, name of the data steward(s), and the link to Enterprise Data Catalog (Informatica) describing the dataset and metadata.
-
Agreements – What agreements are related to these projects, indication of each member’s signing status, members pending signature, and term (dates) covered by the agreement(s).
-
-
As mentioned, the Management Portal application will track your ADRF usage. The protal will also consolidate your ADRF Terms of Use, Security Training Quiz, and Security Training Video into one place. In order to complete ADRF onboarding, all three of the mentioned tasks are to be completed by the user (researcher). To access the Management Portal, log in using your credentials at https://adrf.okta.com and click on the ADRF Management Portal icon. See picture below:
-
-
-
-
Once inside the Management Portal, you have access to your personal workspace sessions statistics along with admin tasks such as the three onboarding tasks and password management. See the example below:
-
-
-
-
Accessing the Onboarding Tasks
-
-
Log in to the Management Portal
-
Click on “Admin Tasks” in the left navigation menu.
-
-
-
-
Click on “Complete Onboarding”.
-
-
-
-
This will load the Onboarding Tasks window.
-
-
-
-
Click on each individual task to complete it.
-
-
-
Signing the ADRF Terms of Use Agreement
-
The Terms of Use need to be completed before you are given access to the data and project space inside the ADRF. To complete ADRF Terms of Use, complete the following steps:
-
-
Click on the “Terms of Use” tile.
-
-
-
-
Click on “Sign with DocuSign”
-
-
-
-
You will then be redirected to the DocuSign signing page. Click “Continue” on the upper right corner.
-
-
-
-
Click “Start” to begin.
-
-
-
-
If you have already configured a signature, click on the yellow “Sign” button to apply it. Otherwise, follow the prompts to configure your electronic signature.
-
-
-
-
Once the signature is applied, click “Finish”.
-
-
-
-
You will then be redirected back to the management portal. And the “Terms of Use” task will be marked as completed.
-
-
-
-
-
Watching the Security Training Video
-
The Security Training Video needs to be completed as well. To complete the training, complete the following steps:
-
-
Click on the “Security Training Video” tile to load the player and then click play.
-
-
-
-
Once you have watched the video in its entirety, click on the “Mark as Complete” button to complete the task.
-
-
-
-
Note: the “MARK AS COMPLETE” button will not be enabled until at least 5 minutes have passed since the start of the video.
-
-
-
Click on the back arrow in the upper right corner to return to the main tasks panel.
-
-
-
-
The training video section will now be marked as completed.
-
-
-
-
-
Complete the Security Training Quiz
-
The Security Training Quiz needs to be completed after the Security Training Video. To complete the training, complete the following steps:
-
-
Click on the “Security Quiz” tile to load the quiz.
-
-
-
-
Answer the questions and click on the “SUBMIT RESPONSE” button. You must answer at least four of the questions correctly to complete this task.
-
-
-
-
You will be automatically redirected to the main task panel once the questionnaire has been successfully completed. And the “Security Quiz” will be marked as completed.
The ADRF has an internal package repository, so users can install packages for R and Python themselves.
-
The repositories that are currently mirrored in the ADRF are CRAN for R packages and PyPi.org for Python. There is currently no access to packages hosted on Github or other mirrors.
-
-
-
-
-
-
-Note
-
-
-
-
If you are working in a shared workspace for a project, each user in the project must install the packages, there is no shared package installation for projects.
-
-
-
-
-
R packages
-
To install R packages, simply type:
-
install.packages("packagename")
-
-
and the package will be installed from the repository. You will not have to re-install the package again, and to use the package load it with the library() function. For example:
-
library(tidyverse)
-
All packages will be installed in your user folder.
-
To install a specific package version you can specify:
-
install.packages("remotes")
-
remotes::install_version("tidyverse", "1.3.2")
-
-
-
-
-
-
-Note
-
-
-
-
We recommend starting R using Rstudio for best results, instead of double clicking on a R or Rmarkdown script.
-
-
-
-
-
Python packages
-
Similar to R packages, Python packages may be installed using the Package Installer for Python (pip).
-
-
-
-
-
-
-Note
-
-
-
-
We recommend installing python packages from the command line. If you start Jupyter lab, and choose the Terminal tab:
-
-
Then install your package using pip, for example, to install the pandas package:
-
-
Then you may use the package within your Jupyter notebook as usual.
Research Data Centre of the German Federal Employment Agency at the Institute for Employment Research. “Remote Data Access and On-Site Use at the FDZ of the BA at the IAB.” (2020, December 8). http://doku.iab.de/fdz/access/Vorgaben_DAFE_EN.PDF
The ADRF messenger is an internal collaboration tool and will be made available once testing is complete.
-
-
-
Shared Folders
-
Shared folders within a project are a great way to share information with other members on a team project. Remember that when working with teams you may not share the ADRF screen (even project folders) with other members on video platforms or otherwise, whether or not your team members are working on the same project.
-
-
-
Sharing Restrictions
-
Again, remember that when working with teams you may not share the ADRF screen with other members on video platforms or otherwise, whether or not your team members are working on the same project.
-
The information contained in the ADRF is restricted to reside only in the ADRF for all purposes unless it passes Export Review. This means that it cannot be shared or potentially shared with any unauthorized parties. Do not write down any numbers or figures or tables corresponding to data in the ADRF. Copying and pasting is restricted, but manually circumventing this is also not permitted by your data agreements.
The U: drive is your user drive; it’s where you will store any files you are working on. Only the user will have access to the U: drive. For example, if user A wants to share information with user B who is on the same project, user A will need to save files to a P: drive folder and not folders in their U: drive since user B will not be able to access user A’s U: drive.
-
-
-
Project Drive
-
The P: drive also allows permanent storage. This drive is accessible by anyone on the same project, but not across projects. This is the only drive outside of the user drive where saved files will not be erased after logging out of the ADRF.
-
-
-
SQL
-
Each project will have a project-specific database created. All members of the project will have read and write permissions for data and may also create their own objects (tables, etc.). The project databases are prefixed with pr-.
-
-
-
-
Ineligible Locations
-
The G: drive (data), the L: drive (Libs), and the desktop are not eligible for long-term file storing. You won’t have permissions to write to either the G: drive or the L: drive. The desktop will function only as temporary storage—as soon as a user is logged out of the ADRF, your desktop will be cleared. Additionally, since Wi-Fi connectivity can be imperfect, desktop storage for any amount of time is not recommended.
-
-
-
Storage Size Restrictions
-
Storage size varies by project, but is capped at a predetermined amount. Additional storage costs may vary depending on the resource requirements. https://aws.amazon.com/appstream2/pricing/
-
-
-
Best Practices
-
To save storage space, try not to save raw data tables—in particular, don’t save copies of or large subsets of data that are already available through standard sources. Instead, access data through the methods described in the prior sections here, as appropriate for your programming language or program.
-
Organize folders in a way that makes sense for your particular project. For example, you might have folders for a particular analysis or sub-projects. Dates on file names can be helpful for version control.
-
Keep tabs on how much storage you are using compared to the allocated amount of storage.
This video linked below runs through the necessary steps for logging into and logging out of the ADRF. If the video does not play, click here.
-
-
-
-
Virtual Desktop Environment
-
-
What is a VDE?
-
Purpose, Contents, Capabilities
-A virtual desktop environment (VDE) allows you to interact with a remote system as if it were your own personal computer. The majority of your standard desktop functions are available, but the programs, data, and permissions are all controlled by the remote administrator (Coleridge Initiative). Thus, you will be working in a familiar environment while accessing protected data, programs, and systems that would otherwise be difficult to distribute. The ADRF uses a standard Windows environment (Windows Server) and provides a variety of software packages to conduct your analysis. For more on Windows capabilities, see the section on Windows Settings.
-
-
-
-
-
-
Temporary Nature of the Environment
-
While the environment is similar to that on your home computer (for Windows users), there are a handful of key differences. The first is that the environment is temporary in nature. This means that if you are not using it for a prolonged period of time (default is four hours but can vary by project), running programs will stop running and the information stored in temporary locations will be deleted. You will receive on on-screen message before any sessions are terminated. For more on safe, non-temporary storage locations in the ADRF, see the section on Storing Analytic Results.
-
Given the temporary nature of the ADRF, it is crucial to make sure that your work is saved—and saved in an appropriate location. Once this is complete and you are finished working, make sure that you log out of the ADRF instead of closing the window. To do this, click the rightmost icon on the top taskbar to open up the dropdown menu and select End Session. You will be prompted to double-check that your work is saved prior to ending your session and confirm that you want to end your session.
-
-
-
-
Modifying the Environment
-
-
Establishing Personal Folders
-
Establishing your own personal folders is one of the simplest, yet most important, steps to take when setting up your environment. As we note in the section on Storing Analytic Results, the two possible places to store your analytic results or files are in either the U: drive or the P: drive.
-
You will find your personal folder in the U: drive. The folder name will include your Firstname and Lastname, and may additionally include your project workspace number. This is a personal workspace that only you can access in the ADRF.
-
-
-
The U: Drive and the P: Drive
-
The U: drive is your user drive; it’s where you will store any files you are working on. Only the user will have access to the U: drive. For example, if user A wants to share information with user B who is on the same project, user A will need to save files to a P: drive folder and not folders in their U: drive since user B will not be able to access user A’s U: drive.
-
The P: drive is the project drive, which will be used to house project-specific folders. Thus, you and other collaborators on the same project will be able to save files to project drive folders.
-
Both the U: drive and P: drive have defined resource limitations of 150GB. When the workspace exceeds these limits, users will not be able to create new files or save data. The ADRF will not alert users when they approach on 150GB used. Users can check their current usage by right clicking on the user folder and clicking on properties.
-
-
-
Other Modifications
-
The top taskbar contains shortcuts to the command prompt, multiple desktop windows, a temporary folder, settings, full-screen view, and toggling multiple monitors.
-
-
-
-
-
-
Windows Settings
-
Your desktop will look familiar if you are a Windows user. You will have icons for quick access to programs or browsers on your desktop. The windows icon on the bottom left side of the screen will open up a menu of programs, folders, and other tools, much as you would see on your own desktop. You will have access to PowerShell and several customization settings (e.g., remove bottom taskbar).
-
-
-
-
-
Software in the ADRF
-
-
JupyterLab
-
JupyterLab provides flexible building blocks for interactive, exploratory computing. While JupyterLab has many features found in traditional integrated development environments (IDEs), it remains focused on interactive, exploratory computing. For more on JupyterLab, see the interface documentation.
-
The JupyterLab interface on the ADRF consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and a menu bar. The left sidebar contains a file browser, the list of running terminals and kernels, the table of contents, and the extension manager.
-
-
When using Jupyter Notebooks, make sure that all your work is saved to your U: drive and the correct director within the U: drive. You can “nd the active directory by reading the path displayed in the file browser. By default, JupyterLab opens with your U: drive as the base directory. Below, the folder icon in the white box is my user folder (not displayed, but titled Firstname.Lastname; you will have already set up your folder) and subfolder WDQI.
-
-
-
-
-
-
Notebooks
-
Jupyter Notebooks are documents that combine live runnable code with narrative text (Markdown), equations (LaTeX), images, interactive visualizations, and other rich output. You can create a notebook by clicking the blue + button in the file browser and then selecting a kernel (R, Python3, Stata) in the Launcher tab. For more information on getting started with Jupyter Notebooks, see JupyterLab Notebook documentation.
-
-
-
Accessing Stored Data from a Notebook
-
A common question is how to access stored data while writing to and using a Jupyter Notebook. Data in the ADRF are stored in a database using Microsoft SQL Server. For more information on how to access stored data in the ADRF based on choice of program (Python, R, Stata), see the section on Accessing Your Data.
-
-
-
Python 3
-
Python is a general-purpose programming language. You can access Python in a multitude of ways:
-
-
Through the start menu (windows icon). Type in Python. A desktop app called Python 3.7 (64-bit) will populate a window where you can begin programming.
-
-
-
-
Through the command prompt in the top taskbar. Once the command prompt window is open, type in python.
-
-
-
-
-
-
Through JupyterLab. !is is the recommended way to access Python since it has packages installed and available, and an execution environment for testing and running code (as well as a place to write and save code). Open JupyterLab and make sure your directory is set appropriately in the “le browser. Once there, in your new Launcher window, click the Python 3 icon.
-
-
-
-
-
4. Through Pycharm.
-
-
-
-
-
-
R
-
R is a general-purpose programming language. You may access R in one of three ways:
-
-
Through RStudio. This is an integrated development environment (IDE) for R. You can run R code, display variables, debug R code, do inline visualizations, and more. Open RStudio through the desktop shortcut, or type RStudio in the start menu.
-
Through JupyterLab. Open JupyterLab and make sure your directory is set appropriately in the file browser. Once there, in your new Launcher window, click the R icon.
-
-
-
-
-
-
Through the R GUI (graphical user interface). Type R in the search bar and click to open the RGui.
-
-
-
-
Stata
-
Stata is a general-purpose statistical so#ware package. Stata can be accessed through the desktop shortcut StataMP 16 or by searching for it using the search or menu bar, or through JupyterLab.
-
-
-
DBeaver
-
DBeaver is a universal tool for querying, editing, and managing data stored in Redshift databases. The ADRF stores data using AWS Redshift Server. DBeaver can be accessed through the desktop shortcut DBeaver or by looking it up using the search bar.
-
Once open, you will need to connect to a Redshift server. Please follow the directions in the Redshift Querying Guide Appendix 12.1 section of this guide to connect to the appropriate server.
-
-
-
Database Navigator
-
On the left side of DBeaver, a pane labeled Database Navigator allows you to easily explore what is in the server to which you are connected. By clicking the arrow, all the items within each server, Database, Schema, etc., are shown. When exploring these data and writing SQL queries, it is frequently useful to have the navigator expanded to see more easily what columns are in each table and their data type; the datatype can be seen in the screenshot to the right in parentheses next to each column name (e.g., clientid(char64) is a text column of length 64— for our purposes you can ignore the char… and varchar… and simply treat it as text).
-
-
-
-
-
-
SQL Editor
-
The SQL Editor is where you can write your own queries to analyze the data. A new script can be opened by clicking on the blue almost-square (looks a bit like an unrolled scroll) on DBeaver’s tool bar:
-
The location of this script button is circled in the red in the upper left of the screenshot below.
-
-
Note: If you use the SQL button to open a new window, it will prompt you to select a data source and enter your username and password.
-
-
-
-
-
-
-
-
Once you have a SQL Editor window open, you can write a query and run it. One option to run a query is to use the keyboard to hit ctrl+enter, and another option is to use the orange triangle.
-
-
-
Open a saved .sql FIle
-
You do not have to create a new script every time! You can open a .sql file either by simply dragging and dropping it from the file explorer, or by going to File → Open File and navigating to a .sql file, as shown in the screenshot below:
-
-
-
-
Once you have done so, the top of your SQL Editor window should name the server connection inside the angle brackets to the left of the filename (<Redshift11_projects>).
-
-
-
-
-
-
LibreOffice
-
LibreOffice is an office productivity suite. LibreOffice comes equipped with six different programs: a word processor program (Writer), a spreadsheet program (Calc), a presentation program (Impress), a graphics editor program (Draw), a math equation program (Math), and a database management program akin to Microsoft Access (Base). LibreOffice may be accessed through the desktop shortcut DBeaver or by looking it up using the search bar. Once you’ve opened up LibreOffice, you can open any of those six programs, using the left sidebar. For more information on LibreOffice, visit the LibreOffice website.
-
-
-
-
Once you click on the icon, you’ll see a page with a left sidebar that has a variety of document types under Create. Select the one suited to your needs and double click to open it.
-
-
-
-
-
-
More…
-
The ADRF provides a number of additional programs such as a simple text editor (Notepad++), PyCharm (an IDE for Python users), and several web browsers. Please note that web browsers are limited only to approved websites.
-
-
-
-
Available Software
-
The ADRF provides numerous software applications to users. Every user in the ADRF has access to:
-
-
R Studio
-
R
-
Python, through Jupyter Labs or PyCharm
-
Jupyter Labs, R and Python kernels available
-
DBeaver
-
LibreOffice
-
Notepad++
-
MikTex
-
Java
-
-
If there is software that you would like to use for your project and is not installed in the ADRF, please email support@coleridgeinitiative.org.
-
-
-
-
\ No newline at end of file
diff --git a/images/db_LDAP.png b/images/db_LDAP.png
new file mode 100644
index 0000000..bc80a25
Binary files /dev/null and b/images/db_LDAP.png differ
diff --git a/images/db_admin_tasks.png b/images/db_admin_tasks.png
new file mode 100644
index 0000000..9e2549d
Binary files /dev/null and b/images/db_admin_tasks.png differ
diff --git a/images/db_application.png b/images/db_application.png
new file mode 100644
index 0000000..05dd023
Binary files /dev/null and b/images/db_application.png differ
diff --git a/images/db_dashboard_tile.png b/images/db_dashboard_tile.png
new file mode 100644
index 0000000..9694afe
Binary files /dev/null and b/images/db_dashboard_tile.png differ
diff --git a/images/db_enter_password.png b/images/db_enter_password.png
new file mode 100644
index 0000000..8a5655f
Binary files /dev/null and b/images/db_enter_password.png differ
diff --git a/images/db_password_reset_project.png b/images/db_password_reset_project.png
new file mode 100644
index 0000000..6348130
Binary files /dev/null and b/images/db_password_reset_project.png differ
diff --git a/images/db_password_window.png b/images/db_password_window.png
new file mode 100644
index 0000000..ef0a084
Binary files /dev/null and b/images/db_password_window.png differ
diff --git a/images/db_reset_password.png b/images/db_reset_password.png
new file mode 100644
index 0000000..cf240ae
Binary files /dev/null and b/images/db_reset_password.png differ
diff --git a/images/db_secure_connect.png b/images/db_secure_connect.png
new file mode 100644
index 0000000..eb760e8
Binary files /dev/null and b/images/db_secure_connect.png differ
diff --git a/images/db_successful_password.png b/images/db_successful_password.png
new file mode 100644
index 0000000..c69c00c
Binary files /dev/null and b/images/db_successful_password.png differ
diff --git a/onboarding.qmd b/onboarding.qmd
index 4d0a736..25d5e4b 100644
--- a/onboarding.qmd
+++ b/onboarding.qmd
@@ -59,7 +59,7 @@ The Terms of Use need to be completed before you are given access to the data an
![](images/start.png){fig-alt="DocuSign Start"}
-5. If you have already configured a signature, click on the yellow "Sign" button to apply it. Otherwise, follow the prompts to configure your electronic signature.
+5. If you have already configured a signature, click on the yellow "Sign" button to apply it. Otherwise, follow the prompts to configure your electronic signature.
![](images/sign.png){fig-alt="DocuSign Sign"}
@@ -85,27 +85,26 @@ The Security Training Video needs to be completed as well. To complete the train
> Note: the "MARK AS COMPLETE" button will not be enabled until at least 5 minutes have passed since the start of the video.
-3. Click on the back arrow in the upper right corner to return to the main tasks panel.
+3. Click on the back arrow in the upper right corner to return to the main tasks panel.
![](images/back_arrow.png){fig-alt="Security Training Video Back Arrow"}
-4. The training video section will now be marked as completed.
+4. The training video section will now be marked as completed.
![](images/st_complete.png){fig-alt="Security Training Video Completed"}
-
### **Complete the Security Training Quiz**
The Security Training Quiz needs to be completed after the Security Training Video. To complete the training, complete the following steps:
-1. Click on the “Security Quiz” tile to load the quiz.
+1. Click on the “Security Quiz” tile to load the quiz.
![](images/st_quiz_tile.png){fig-alt="Security Training Quiz Tile"}
-2. Answer the questions and click on the "SUBMIT RESPONSE" button. You must answer at least four of the questions correctly to complete this task.
+2. Answer the questions and click on the "SUBMIT RESPONSE" button. You must answer at least four of the questions correctly to complete this task.
![](images/submit_response.png){fig-alt="Security Training Quiz Submit Response"}
-3. You will be automatically redirected to the main task panel once the questionnaire has been successfully completed. And the "Security Quiz" will be marked as completed.
+3. You will be automatically redirected to the main task panel once the questionnaire has been successfully completed. And the "Security Quiz" will be marked as completed.
![](images/security_quiz_complete.png){fig-alt="Security Training Quiz Complete"}