diff --git a/automation_python.ipynb b/automation_python.ipynb new file mode 100644 index 0000000..e476c83 --- /dev/null +++ b/automation_python.ipynb @@ -0,0 +1,811 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Style Guides\n", + "\n", + "- Programming languages have style guides that ensure your code is readable, usable, and debugable by others\n", + "- There are many tools that parse your code to help with debugging, generating documentation, ect. that rely on standards\n", + "- In Python, there is a quasi-official style guide, PEP8, that is followed by most python developers\n", + " - PEP8: https://www.python.org/dev/peps/pep-0008/\n", + "- Other languages, such as R, have their own style guides\n", + " - Tidyverse: https://style.tidyverse.org/\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A subset of PEP8\n", + "- We are not going to go through all of PEP8 (or the rest of the many PEPs)\n", + "- However, we can start with a subset of conventions that are common across different languages" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Naming Conventions\n", + "\n", + "There are many common conventions. These conventions are not python-specific and you will see them in code across languages. Here are some examples:\n", + "\n", + "1. **snake_case**: All words are lower case with underscores between them\n", + "2. **CamelCase**: Words start with capital letters and are not seperated\n", + "3. **mixedCase**: Like CamelCase but the first word is lowercase\n", + "4. **UPPERCASE_WITH_UNDERSCORES**: All letters are uppercase, seperated by underscores \n", + "\n", + "Python's style guide outlines rules for using naming conventions:\n", + "\n", + "1. Variables: **snake_case**\n", + " - variable_name, dna_sequence\n", + "2. Functions: **snake_case**\n", + " - combine_replicates()\n", + "3. Errors: **CamelCase**\n", + " - ValueError, SyntaxError\n", + "\n", + "There are more guidelines, but these are common ones that are encountered early on." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 1: Using naming conventions\n", + "\n", + "Edit the code block below to conform to PEP8 naming conventions. Post in the collaborative document your answers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def Velocity(TOTALDISTANCE, time):\n", + " \"This calculates the distance over time\"\n", + " Velocity_Result = TOTALDISTANCE / time\n", + " return(Velocity_Result)\n", + "Velocity(10, 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dangers in variable naming\n", + "\n", + "Aside from making names look professional, there are some rules about naming to help prevent errors and bugs. Specifically, you should never name something the same as a function included in python. \n", + "\n", + "Let's use the function sum() to see what happens of we use sum as a variable name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sum_of_two_numbers = sum([5, 4])\n", + "print(\"Sum data type\", type(sum))\n", + "sum = 10 + 5\n", + "print(\"Sum data type\", type(sum))\n", + "print(sum_of_two_numbers)\n", + "print(sum)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's try calling the sum() function again." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sum([5, 4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comments\n", + "\n", + "One documentation ability that should be reiterated is the use of comments. Comments are denoted using `#` where everything written after it on the same line is not run. This means we can use it to help ourselves when looking back at our code and others understand what we are trying to do. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def my_function(x):\n", + " \"Docstring for my_function\"\n", + " # This is a comment.\n", + " # print(x + x)\n", + " # The print function above will not run due to the '#'\n", + " print(x)\n", + "\n", + "my_function(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should write comments for your code often, specifically when you are doing a task that is specialized like using a formula or applying a custom function to do a task. Some good advice from PEP8:\n", + "\n", + ">Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Docstrings\n", + "\n", + "We previously covered docstrings within functions as a string just after the def statement. However, what if you want to write **more than just a single line of documentation**? Fortunatly there is a way to do so by using **triple-quotes**.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PEP guidelines on docstrings\n", + "\n", + "Python PEP guidelines suggest the following format:\n", + "\n", + "\"\"\"One line description\n", + "\n", + "More details about your function, in triple-quotes\n", + "\n", + "\"\"\"\n", + "\n", + "**or**\n", + "\n", + "\"\"\"Only a single line description in triple-quotes\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ... are supplemented by community formats\n", + "\n", + "Docstrings are pretty flexible, even using PEP standards. There are a couple common format guidelines for docstrings that you can choose from. Why start with these?\n", + "\n", + "1. It gives a good overview of what information people expect in your docstrings\n", + "2. The formats here can be parsed by common tools\n", + "\n", + "While the formats for documentation may differ slightly depending on the language choice, the information expected from them is fundamental. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Google format\n", + "\"\"\"Takes a string and returns a list of letters\n", + "\n", + "Args:\n", + " string (list): A string to parse for letters\n", + " upper (bool): The letters are returned uppercase \n", + " (default is False)\n", + "\n", + "Returns:\n", + " list: A list of each letter in the string\n", + "\"\"\"\n", + "\n", + "#Numpy format\n", + "\"\"\"Takes a string and returns a list of letters\n", + "\n", + "Parameters\n", + "----------\n", + "string : str\n", + " A string to parse for letters\n", + "upper : bool, optional\n", + " The letters are returned uppercase (default is False)\n", + "\n", + "Returns\n", + "-------\n", + "list\n", + " A list of each letter in the string\n", + "\"\"\"\n", + "\n", + "#reStrucured text \n", + "\"\"\"Takes a string and returns a list of letters\n", + "\n", + ":param string: A string to parse for letters\n", + ":type string: str\n", + ":param upper: A string used to join each string (default is False)\n", + ":type upper: bool\n", + ":returns: A list of each letter in the string\n", + ":rtype: list\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Excercise 2.1: Importance of naming and documentation\n", + "\n", + "Given the following function with poor naming and no documentation, determine:\n", + "1. What are the 2 inputs\n", + "2. What does it return" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def FUNCTION(number, words):\n", + " Smallest = 0\n", + " for dictionary in number:\n", + " LETTER = (dictionary / words) * 100\n", + " if LETTER > Smallest:\n", + " Smallest = LETTER\n", + " return Smallest" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Excercise 2.2: Refactoring a Function\n", + "\n", + "**Refactoring** is a term that means re-writing code without changing the task it performs. \n", + "\n", + "Refactor the following function with poor naming and no documentation. You will want to:\n", + "\n", + "1. Rename variables to an appropriate name\n", + "2. Write a docstring explaining what the function does using one of the example formats (Google, Numpy, reStructured)\n", + "\n", + "Test your function by running it after you refactor to see if it still produces the same output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def FUNCTION(number, words):\n", + " Smallest = 0\n", + " for dictionary in number:\n", + " LETTER = (dictionary / words) * 100\n", + " if LETTER > Smallest:\n", + " Smallest = LETTER\n", + " return Smallest\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# D.R.Y. Don't Repeat Yourself\n", + "\n", + "DRY is a concept in programming to avoid writing redundant code. A good sign your code is redundant would be if you copy and past parts of it and edit the new copies. \n", + "\n", + "Suppose we had three lists of proteins and wanted to check if there are any matches to a list of proteins of interest and wrote the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "protein_data1 = [\"CREG1\", \"ELK1\", \"SF1\", \"GATA1\", \"GATA3\", \"CREB1\"]\n", + "protein_data2 = [\"ATF1\", \"GATA1\", \"STAT3\", \"P53\", \"CREG1\"]\n", + "protein_data3 = [\"RELA\", \"MYC\", \"SF1\", \"CREG1\", \"GATA3\", \"ELK1\"]\n", + "proteins_of_interest = [\"ELK1\", \"MITF\", \"KAL1\", \"CREG1\"]\n", + "\n", + "# Are there any matches in the first list?\n", + "match_list1 = []\n", + "for protein in protein_data1:\n", + " if protein in proteins_of_interest:\n", + " match_list.append(protein)\n", + "\n", + "# Are there any matches in the second?\n", + "match_list2 = []\n", + "for protein in protein_data2:\n", + " if protein in proteins_of_interest:\n", + " match_list.append(protein)\n", + "\n", + "# Are there any matches in the third?\n", + "match_list3 = []\n", + "for protein in protein_data3:\n", + " if protein in proteins_of_interest:\n", + " match_list.append(protein)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We want to do an analysis on multiple datasets. But there are problems with this approach:\n", + "\n", + "1. You have to copy/paste for each list you want to compare\n", + "2. If you make a change to the analysis, you need to edit every copy\n", + " - Very hard to remain consistant\n", + "3. If you want to compare to another list, you need to either:\n", + " - Edit every copy\n", + " - Change the variable proteins_of_interest\n", + " - May affect your analysis somewhere else\n", + "\n", + "To prevent repeating code, we can write a function to do our repeated task. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Excercise 3: Refactor an Analysis Into a Function\n", + "\n", + "Write a function that does the repeated task and run it on the three lists. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Your function here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A Helpful Way To Format Strings\n", + "\n", + "In the next sections, we are going to work on custom error messages and defensive programming. Having a good way to format strings for these messages will make our job easier. Previously we concatinated strings using the `+` operator. \n", + "\n", + "Suppose we have an integer and we want to add it to a string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we directly add a string and an integer, we get a `TypeError`. One way to neatly get around this is to use what is called an **f string**. This is a python-specific method that takes care of formating the string for us. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with f strings\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Errors and Defensive Programming in Python\n", + "\n", + "One fundamental concept in programming is troubleshooting errors. When something goes wrong in Python, it can report a number of different kinds of errors. For example, look at the error message when the following code is run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(hello)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that in the command `print(hello)` when `hello` is not defined, Python returns something called a `NameError`. Python has many different types of errors built-in including:\n", + "\n", + "- NameError\n", + "- ZeroDivisionError\n", + "- TypeError\n", + "- IndexError\n", + "- ... and more!\n", + "\n", + "These errors are also referred to as **exceptions**. More information on the other exceptions built into Python are here: https://docs.python.org/3/library/exceptions.html \n", + "\n", + "We can also call an error on purpose using the `raise` keyword with a custom message. This message is part of the **Traceback**, which is a report python gives on what happened. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Example\n", + "raise NameError(\"My custom error message\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how for errors that we get without directly calling, both the class of `Error` and the message are predetermined. This is sometimes useful, but in some situations it might not be. Suppose you wrote a reverse comeplement function like the one below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def reverse_complement(dna_sequence):\n", + " \"\"\"Reverses the complement of a dna sequence\"\"\"\n", + " complements = {\"T\":\"A\", \"A\":\"T\", \"C\":\"G\", \"G\":\"C\"}\n", + " reverse = dna_sequence[::-1]\n", + " result = \"\"\n", + " for letter in reverse:\n", + " result = result + complements[letter]\n", + " return(result)\n", + "\n", + "help(reverse_complement)\n", + "print(reverse_complement(\"CAAT\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Suppose someone imports our function from a script we wrote (we will do this later in the lesson). Maybe they will run our script with a sequence that contains lowercase letters (could be a masked genomic region?)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "reverse_complement(\"CACGtgcatggTGAAA\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For a user, this is a really confusing error. It is supposed to return the reverse complement but it did not. They check the error and all it says is:\n", + "\n", + "\"KeyError: 'g'\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Try and Except Keywords\n", + "\n", + "One way we can tackle this is though `try` and `except`. What this does is tries code that is indented after `try`. If there is a particular error we think would happen, then we can write a user-defined response under `except`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#Example\n", + "\n", + "try:\n", + " print(hello)\n", + "except NameError:\n", + " print(\"A custom error message\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the error here is handled under the except keyword, so the program actually keeps going. We can visualize this by adding a `print()` statement after" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with print() before and after\n", + "print(\"Before\")\n", + "try:\n", + " print(hello)\n", + "except NameError:\n", + " print(\"A custom error message\")\n", + "print(\"After\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want the program to stop, there are a couple ways to do so. A good way is to use the `raise` within the code run after except. This will print the custom message, then continue to do what it would do if we did not run try catch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example with raise added within the except code\n", + "print(\"Before\")\n", + "try:\n", + " print(hello)\n", + "except NameError:\n", + " print(\"A custom error message\")\n", + " raise\n", + "print(\"After\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why do this?\n", + "\n", + "1. If our program is getting input that is causing an error, we want the program to **fail fast**. \n", + "2. We want to **avoid** returning **incorrect** or **unexpected** results. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Printing Error Messages\n", + "\n", + "Before we use our new keywords, this would be a good time to go over output for errors. When we print something in Jupyter or in the command line, by default it goes to a destination called `stdout` or **standard output**. There is another destination seperate from `stdout` for errors and warnings called `stderr` or **standard error**. It is dangerous to print error messages in `stdout` because some workflows utilize it for data and the error messages can get mixed in. For example, when piping output in the command line. It is also just good practice to send errors to `stderr`. \n", + "\n", + "We can set a destination for out `print()` command using the **file** argument. By default it is **sys.stdout**. We will need to import the **sys** module to send to `stderr`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Help for print\n", + "help(print)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import sys, compare stdout vs stderr\n", + "import sys\n", + "\n", + "print(\"Hello this is going to stdout\", file = sys.stdout)\n", + "print(\"Hello this is going to stderr\", file = sys.stderr)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Jupyter, `stderr` has a red background and `stdout` has a no background color. \n", + "\n", + "**Note**: The **traceback** messages shown when an error occurs are going to `stderr`. In Jupyter, they are formatted differently than other output to `stderr`. \n", + "\n", + "Back to the observation that reverse_complement can give cryptic error messages. Let's add a `try` statement and place the `for` loop in it. We can then use `except` with our expected `KeyError` to print a different message and `raise` to both **fast fail** and return the full **traceback**. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Edit the function below!\n", + "def reverse_complement(dna_sequence):\n", + " \"\"\"Reverses the complement of a dna sequence\"\"\"\n", + " complements = {\"T\":\"A\", \"A\":\"T\", \"C\":\"G\", \"G\":\"C\"}\n", + " reverse = dna_sequence[::-1]\n", + " result = \"\"\n", + " # Add try - except - raise statements\n", + " for letter in reverse:\n", + " result = result + complements[letter]\n", + " return(result)\n", + "\n", + "help(reverse_complement)\n", + "print(reverse_complement(\"CAAg\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively we can check the input for the dictionary and produce an error ourselves. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def reverse_complement(dna_sequence):\n", + " \"\"\"Reverses the complement of a dna sequence\"\"\"\n", + " complements = {\"T\":\"A\", \"A\":\"T\", \"C\":\"G\", \"G\":\"C\"}\n", + " reverse = dna_sequence[::-1]\n", + " result = \"\"\n", + " for letter in reverse:\n", + " # Check that letter is valid, if not raise an Error.\n", + " result = result + complements[letter]\n", + " return(result)\n", + "\n", + "print(reverse_complement(\"CAAg\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another problem is when programs produce incorrect results instead of producing an error. Suppose we have a function that prints all kmers of a given k from a sequence:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def kmers_from_sequence(dna_sequence, k):\n", + " \"\"\"Prints all kmers from a sequence\n", + " \"\"\"\n", + " # Formula for number of kmers\n", + " positions = len(dna_sequence) - k + 1\n", + " for i in range(positions):\n", + " kmer = dna_sequence[i:i + k]\n", + " print(kmer)\n", + " \n", + "help(kmers_from_sequence) \n", + "kmers_from_sequence(\"CACGTGACTAG\", 3)\n", + "print(\"After the function\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "kmers_from_sequence(\"CACGTGACTAG\", -3)\n", + "print(\"After the function\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can sanitize the inputs to solve this. The value, k, should be a number less than the length of the sequence but more than 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Excercise 4: Sanitize Input\n", + "\n", + "Refactor the following function to check that the value of `k` is:\n", + "- A positive number\n", + "- Not longer than the length of `dna_sequence`\n", + "\n", + "If there is a problem, `raise` a `ValueError` with an appropriate message. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example:\n", + "def kmers_from_sequence(dna_sequence, k):\n", + " \"\"\"Prints all kmers from a sequence\n", + " \"\"\"\n", + " # Write code to check input here!\n", + " \n", + " positions = len(dna_sequence) - k + 1\n", + " for i in range(positions):\n", + " kmer = dna_sequence[i:i + k]\n", + " print(kmer)\n", + "\n", + "kmers_from_sequence(\"CAATCGACGTA\", 11) # Should return an error" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Making Scripts You Can Import\n", + "\n", + "So far, we have used modules to help us work on our analyses such as:\n", + "- Standard Library\n", + " - sys\n", + "- Third Party\n", + " - pandas\n", + " - numpy\n", + " - matplotlib\n", + " - seaborn\n", + "\n", + "These are imported using the `import` keyword and we can use functions from them. We also write functions for use in our own code. Having these available to import into other scripts gives the benefit of:\n", + "1. Letting us reuse code over multiple analyses (DRY)\n", + "2. Letting others use our code in their own scripts without copy/pasting (DRY)\n", + "\n", + "While it may seem like going out of one's way to write a module and a script for analysis, you can actually have one python file act as both a module and run it from the command line to perform a task. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Start new .ipynb and .py for demo." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}