diff --git a/config.yaml b/config.yaml index 3c36955c..1c7fa785 100644 --- a/config.yaml +++ b/config.yaml @@ -64,16 +64,19 @@ episodes: - 02-fair-research-software.md - 03-tools.md - 04-version-control.md -- 05-code-readability.md -- 06-code-testing.md -- 07-documenting-code.md -- 08-open-collaboration.md -- 09-code-ethics.md -- 10-wrap-up.md +- 05-code-environment.md +- 06-code-readability.md +- 07-code-structure.md +- 08-code-correctness.md +- 09-code-documentation.md +- 10-open-collaboration.md +- 11-wrap-up.md # Information for Learners -learners: +learners: +- ci-for-testing.md - licensing.md +- ethical-environmental-considerations.md # Information for Instructors instructors: diff --git a/episodes/03-tools.md b/episodes/03-tools.md index e9d95ca8..b737324d 100644 --- a/episodes/03-tools.md +++ b/episodes/03-tools.md @@ -1,5 +1,5 @@ --- -title: "Tools and practices for research software development" +title: "Tools and practices for FAIR research software development" teaching: 60 exercises: 30 --- @@ -27,23 +27,7 @@ Here we will give an overview of the tools, how they help you achieve the aims o they work together. In later episodes we will describe some of these tools in more detail. -The table below summarises some tools and practices that can help with each of the FAIR software principles. - -| Tools and practices | Findable | Accessible | Interoperable | Reusable | -|--------------------------------------------------------------------------------------------------|----------|------------|---------------|----------| -| Integrated development environments (e.g. VS Code) - development environments (run, test, debug) | | | | x | -| Command line terminal (e.g. Bash)- reproducible workflows/pipelines | | | x | x | -| Version control tools | x | | | | -| Testing | | x | | x | -| Coding conventions and documentation | | x | x | x | -| License | | x | | x | -| Citation | x | | | x | -| Software repositories (e.g. GitHub) | x | x | | | - - -## Writing your code - -### Development environment +## Development environment One of the first choices we make when writing code is what tool to use to do the writing. You can use a simple text editor such as Notepad, a terminal based editor with syntax highlighting such as Vim or Emacs, @@ -78,7 +62,7 @@ It is a single tool in which we can: Use VS Code to open the Python script and the data file from our project. -### Command line tool/shell +## Command line tool/shell In VS Code and similar IDEs you can often run the code by clicking a button or pressing some keyboard shortcut. If you gave your code to a colleague or collaborator they might use the same IDE or something different, @@ -103,7 +87,7 @@ without editing your code. With a command line interface, your code can be built into automated workflows so that the whole process from data gathering to analysis to saving publication-quality results can be written in one Bash script and saved and reused. -### Version control +## Version control Version control means knowing what changes were made to your code and when. Many people who have worked on large documents such as essays start doing this by saving files called `essay_draft`, `essay_version1.doc`, @@ -121,7 +105,10 @@ check if it is a change in the code due to using a newer version, or a change in We will be using the Git version control system, which can be used through the command line terminal, in a browser or in a desktop application. -### Testing +## Code structure and style guidelines +TODO + +## Code correctness Testing ensures that your code is correct and does what it is set out to do. When you write code you often feel very confident that it is perfect, but when writing bigger codes or code that is meant to do complex operations it is @@ -132,7 +119,7 @@ be assured that it does work correctly on their machine. We will show different ways to test your code for different purposes. You need to think about what it is that is important to you and any future users or collaborators to decide what kind of testing is most useful for you. -### Documentation +## Documentation Documentation comes in many forms - from the names of variables and functions in your code, additional comments that explain some lines, up to a whole website full of documentation with function definitions, usage examples, @@ -140,7 +127,7 @@ tutorials and guides. You many not need as much documentation as a large commercial software product, but making your code Reusable relies on other people being able to understand what your code does and how to use it. -### Licences and citation +## Licences and citation A licence states what people can legally do with your code, and what restrictions you have placed on it. Whenever you want to use someone else's code you should check what license they have and make sure your use is legal. @@ -156,7 +143,7 @@ Both licensing law and citation procedures can vary depending on your country an so remember to check with a local team where you are. Your local research IT support or library teams would be a good place to start. -### Code repositories and registries +## Code repositories and registries Having somewhere to share your code is fundamental to making it Findable. Your institution might have a code repository, your research field may have a practice of sharing code via a specific @@ -167,32 +154,26 @@ depend on or reuse. We will discuss later how to share your code on GitHub and make it easy for others to find and use. -## Summary +## Summary of tools & practices -### Findable - -- Describe your software - README -- Software repository/registry - GitHub, registries -- Unique persistent identifier - GitHub commits/tags/releases, Zenodo - -### Accessible - -- Software repository/registry -- License -- Language and dependencies - -### Interoperable - -- Explain functionality - readme, inline comments and documentation -- Standard formats -- Communication protocols - CLI/API +The table below summarises some tools and practices that can help with each of the FAIR software principles. -### Reusable +| Tools and practices | Findable | Accessible | Interoperable | Reusable | +|---------------------------------------------------------------------------------------------------|----------|------------|---------------|----------| +| Virtual development environments, programming language and dependencies - run, test, debug, share | | x | | x | +| Integrated development environments/IDEs (e.g. VS Code, PyCharm) - run, test, debug | | | | x | +| Command line terminal (e.g. Bash, GitBash) - reproducible workflows/pipelines | | | x | x | +| Version control tools | x | | | | +| Testing - code correctness and reproducibility | | x | | x | +| Coding conventions and documentation | | x | x | x | +| Explaining functionality/installation/running - README, inline comments and documentation | | x | x | x | +| Standard formats - e.g. for data exchange (CSV, YAML) | | x | x | x | +| Communication protocols - Command Line Interface (CLI) or Application Programming Interface (API) | | x | x | x | +| License | | x | | x | +| Citation | x | | | x | +| Software repositories (e.g. GitHub, PyPi) or registries (e.g. BioTools) | x | x | | | +| Unique persistent identifier (e.g. DOIs, commits/tags/releases) - Zenodo, FigShare GitHub | x | x | | | -- Document functionality/installation/running -- Follow best practices where appropriate -- License -- Citation ## Checking your setup diff --git a/episodes/04-version-control.md b/episodes/04-version-control.md index 3fc23754..bcfc39ab 100644 --- a/episodes/04-version-control.md +++ b/episodes/04-version-control.md @@ -26,7 +26,7 @@ how this tool assists us in producing reproducible and sustainable scientific pr We will create a new software project from our existing code, make some changes to it and track them with version control, and then push those changes to a remote server for safe-keeping. -### What is a version control system? +## What is a version control system? Version control is the practice of tracking and managing changes to files. Version control systems are software tools that assist in the management of these @@ -34,7 +34,7 @@ file changes over time. They keep track of every modification to the files in a special database that allows users to "travel through time" and compare earlier versions of the files with the current state. -### Motivation for using a version control system +## Why use a version control system? The main motivation as scientists to use version control in our projects is for reproducibility purposes. As hinted to above, by tracking and storing every change @@ -51,7 +51,7 @@ Later on in this workshop, we will also see how using a version control system allows many people to collaborate on the same project without a lot of manual effort to combine different items of work. -### Git version control system +## Git version control system Git is one of the version control systems around and the one we will be using in this course. It is primarily used for source code management in software development but it can be used to track changes in files @@ -92,7 +92,6 @@ Git is a distributed version control system allowing for multiple people to be w Initially, we will use Git to start tracking changes to files on our local machines; later on we will start sharing our work on GitHub allowing other people to see and contribute to our work. - ### Create a new repository Create a new directory in the `Desktop` folder for our work, and then change the current working directory @@ -309,8 +308,6 @@ $ git commit -m "Replace spaces in Python filename with underscores" rename my code v2.py => my_code_v2.py (100%) ``` - - ### Advanced solution We initially renamed the Python file using the `mv` command, and we than had to `git add` *both* `my_code_v2.py` @@ -533,7 +530,7 @@ methods: Using `git reset` command produces a "cleaner" history, but does not tell the full story and your work. -### Pushing to a Git server +## Interacting with a remote Git server Git is also a distributed version control system, allowing us to synchronise work between any two or more copies of the same repository - the ones that are not located on your machine. @@ -548,13 +545,15 @@ machines. ![Git - distributed version control system, image from W3Docs (freely available)](episodes/fig/git-distributed.png){alt='2 Git repositories belonging to 2 different developers linked to a central repository and one another showing two way flow of information in each link'} +[GitHub][github] is an online software development platform that can act as a central remote server. +It uses Git underneath and provides facilities for storing, tracking, and collaborating on software projects. +Other Git hosting services are available, such as [GitLab](https://gitlab.com) and [Bitbucket](https://bitbucket.org). + Distributing our projects in this way also opens us up to collaboration, since colleagues would be able to access our projects, make their own copies on their machines, and conduct their own work. -We will now go through how to push a local project to [GitHub](https://github.com), -though other Git hosting services are available, such as [GitLab](https://gitlab.com) -and [Bitbucket](https://bitbucket.org). +We will now go through how to push a local project on [GitHub](https://github.com) and share it publicly. 1. In your browser, navigate to and sign into your account 2. In the top right hand corner of the screen, there is a menu labelled "+" with diff --git a/episodes/05-code-environment.md b/episodes/05-code-environment.md new file mode 100644 index 00000000..ab27995f --- /dev/null +++ b/episodes/05-code-environment.md @@ -0,0 +1,382 @@ +--- +title: Reproducible development environment +teaching: 30 +exercises: 0 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What are virtual environments in software development and why use them? +- How can we manage Python virtual coding environments and external (third-party) libraries on our machines? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +After completing this episode, participants should be able to: + +- Set up a Python virtual coding environment for a software project using `venv` and `pip`. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +So far we have created a local Git repository to track changes in our software project and pushed it to GitHub +to enable others to see and contribute to it. + +We now want to start developing the code further. +If we have a look at our script, we may notice a few `import` lines like the following: + +```python +import json +``` + +```python +import csv +``` + +```python +import datetime as dt +``` + +```python +import matplotlib.pyplot as plt +``` +This means that our code requires several **external libraries** +(also called third-party packages or dependencies) - +`json`, `csv`, `datetime` and `matplotlib`. + +Python applications often use external libraries that do not come as part of the standard Python distribution. +This means that you will have to use a **package manager** tool to install them on your system. +Applications will also sometimes need a +specific version of an external library +(e.g. because they were written to work with feature, class, +or function that may have been updated in more recent versions), +or a specific version of Python interpreter. +This means that each Python application you work with may require a different setup +and a set of dependencies so it is useful to be able to keep these configurations +separate to avoid confusion between projects. +The solution for this problem is to create a self-contained +**virtual environment** per project, +which contains a particular version of Python installation +plus a number of additional external libraries. + +## Virtual development environments +So what exactly are virtual environments, and why use them? + +A Python virtual environment helps us create an **isolated working copy** of a software project +that uses a specific version of Python interpreter +together with specific versions of a number of external libraries +installed into that virtual environment. +Python virtual environments are implemented as +directories with a particular structure within software projects, +containing links to specified dependencies +allowing isolation from other software projects on your machine that may require +different versions of Python or external libraries. + +It is recommended to create a separate virtual environment for each project. +Then you do not have to worry about changes to the environment of the current project you are working on +affecting other projects - you can use different Python versions and different versions of the same third party +dependency by different projects on your machine independently from one another. + +Another big motivator for using virtual environments is that they make sharing your code with others much easier - +as we will see shortly you can record your virtual environment in a special file and share it with your collaborators +who can then recreate the same development environment on their machines. + +You do not have to worry too much about specific versions of external libraries +that your project depends on most of the time. +Virtual environments also enable you to always use +the latest available version without specifying it explicitly. +They also enable you to use a specific older version of a package for your project, should you need to. + +## Managing virtual environments + +There are several command line tools used for managing Python virtual environments - we will use `venv`, +available by default from the standard `Python` distribution since `Python 3.3`. + +Part of managing your (virtual) working environment involves +installing, updating and removing external packages on your system. +The Python package manager tool `pip` is most commonly used for this - +it interacts and obtains the packages from the central repository called +[Python Package Index (PyPI)](https://pypi.org/). + +So, we will use `venv` and `pip` in combination to help us create and share our virtual development environments. + +## Creating virtual environments + +Creating a virtual environment with `venv` is done by executing the following command: + +```bash +$ python -m venv /path/to/new/virtual/environment +``` + +where `/path/to/new/virtual/environment` is a path to a directory where you want to place it - +conventionally within your software project so they are co-located. +This will create the target directory for the virtual environment. + +For our project let's create a virtual environment called "venv_spacewalks" from our project's root directory. + +Firstly, ensure you are located within the project's root directory: + +```bash +$ cd /path/to/spacewalks +``` + +```bash +$ python -m venv venv_spacewalks +``` + +If you list the contents of the newly created directory "venv_spacewalks", on a Mac or Linux system +(slightly different on Windows as explained below) you should see something like: + +```bash +$ ls -l venv_spacewalks +``` + +```output +total 8 +drwxr-xr-x 12 alex staff 384 5 Oct 11:47 bin +drwxr-xr-x 2 alex staff 64 5 Oct 11:47 include +drwxr-xr-x 3 alex staff 96 5 Oct 11:47 lib +-rw-r--r-- 1 alex staff 90 5 Oct 11:47 pyvenv.cfg +``` +So, running the `python -m venv venv_spacewalks` command created the target directory called "venv_spacewalks" +containing: + +- `pyvenv.cfg` configuration file + with a home key pointing to the Python installation from which the command was run, +- `bin` subdirectory (called `Scripts` on Windows) + containing a symlink of the Python interpreter binary used to create the environment + and the standard Python library, +- `lib/pythonX.Y/site-packages` subdirectory (called `Lib\site-packages` on Windows) + to contain its own independent set of installed Python packages isolated from other projects, and +- various other configuration and supporting files and subdirectories. + +Once you’ve created a virtual environment, you will need to activate it. + +On Mac or Linux, it is done as: + +```bash +$ source venv_spacewalks/bin/activate +(venv_spacewalks) $ +``` + +On Windows, recall that we have `Scripts` directory instead of `bin` +and activating a virtual environment is done as: + +```bash +$ source venv_spacewalks/Scripts/activate +(venv_spacewalks) $ +``` + +Activating the virtual environment will change your command line’s prompt +to show what virtual environment you are currently using +(indicated by its name in round brackets at the start of the prompt), +and modify the environment so that running Python will get you +the particular version of Python configured in your virtual environment. + +You can verify you are using your virtual environment's version of Python +by checking the path using the command `which`: + +```bash +(venv_spacewalks) $ which python +``` + +When you’re done working on your project, you can exit the environment with: + +```bash +(venv_spacewalks) $ deactivate +``` + +If you've just done the `deactivate`, +ensure you reactivate the environment ready for the next part: + +```bash +$ source venv_spacewalks/bin/activate +(venv_spacewalks) $ +``` + +Note that, since our software project is being tracked by Git, +the newly created virtual environment will show up in version control - +we will see how to handle it using Git in one of the subsequent episodes. + +## Installing external packages + +We noticed earlier that our code depends on four **external packages/libraries** - +`json`, `csv`, `datetime` and `matplotlib`. +As of Python 3.5, Python comes with in-built JSON and CSV libraries - this means there is no need to install these +additional packages (if you are using a fairly recent version of Python), but you still need to import them in any +script that uses them. +However, we still need to install packages `datetime` and `matplotlib` as they do not come as standard with +Python distribution. + +To install the latest version of packages `datetime` and `matplotlib` with `pip` +you use pip's `install` command and specify the package’s name, e.g.: + +```bash +(venv_spacewalks) $ python -m pip install datetime +(venv_spacewalks) $ python -m pip install matplotlib +``` + +or like this to install multiple packages at once for short: + +```bash +(venv_spacewalks) $ python -m pip install datetime matplotlib +``` + +The above commands have installed packages `datetime` and `matplotlib` in our currently active `venv_spacewalks` +environment and will not affect any other Python projects we may have on our machines. + +If you run the `python -m pip install` command on a package that is already installed, +`pip` will notice this and do nothing. + +To install a specific version of a Python package +give the package name followed by `==` and the version number, +e.g. `python -m pip install matplotlib==3.5.3`. + +To specify a minimum version of a Python package, +you can do `python -m pip install matplotlib>=3.5.1`. + +To upgrade a package to the latest version, e.g. `python -m pip install --upgrade matplotlib`. + +To display information about a particular installed package do: + +```bash +(venv_spacewalks) $ python -m pip show matplotlib +``` +```output +Name: matplotlib +Version: 3.9.0 +Summary: Python plotting package +Home-page: +Author: John D. Hunter, Michael Droettboom +Author-email: Unknown +License: License agreement for matplotlib versions 1.3.0 and later +========================================================= +... +Location: /opt/homebrew/lib/python3.11/site-packages +Requires: contourpy, cycler, fonttools, kiwisolver, numpy, packaging, pillow, pyparsing, python-dateutil +Required-by: +``` + +To list all packages installed with `pip` (in your current virtual environment): + +```bash +(venv_spacewalks) $ python -m pip list +``` +```output +Package Version +--------------- ----------- +contourpy 1.2.1 +cycler 0.12.1 +DateTime 5.5 +fonttools 4.53.1 +kiwisolver 1.4.5 +matplotlib 3.9.2 +numpy 2.0.1 +packaging 24.1 +pillow 10.4.0 +pip 23.3.1 +pyparsing 3.1.2 +python-dateutil 2.9.0.post0 +pytz 2024.1 +setuptools 69.0.2 +six 1.16.0 +zope.interface 7.0.1 +``` + +To uninstall a package installed in the virtual environment do: `python -m pip uninstall `. +You can also supply a list of packages to uninstall at the same time. + +## Sharing virtual environments + +You are collaborating on a project with a team so, naturally, +you will want to share your environment with your collaborators +so they can easily 'clone' your software project with all of its dependencies +and everyone can replicate equivalent virtual environments on their machines. +`pip` has a handy way of exporting, saving and sharing virtual environments. + +To export your active environment - +use `python -m pip freeze` command to produce a list of packages installed in the virtual environment. +A common convention is to put this list in a `requirements.txt` file in your project's root directory: + +```bash +(venv_spacewalks) $ python -m pip freeze > requirements.txt +(venv_spacewalks) $ cat requirements.txt +``` +```output +contourpy==1.2.1 +cycler==0.12.1 +DateTime==5.5 +fonttools==4.53.1 +kiwisolver==1.4.5 +matplotlib==3.9.2 +numpy==2.0.1 +packaging==24.1 +pillow==10.4.0 +pyparsing==3.1.2 +python-dateutil==2.9.0.post0 +pytz==2024.1 +six==1.16.0 +zope.interface==7.0.1 +``` + +The first of the above commands will create a `requirements.txt` file in your current directory. +Yours may look a little different, +depending on the version of the packages you have installed, +as well as any differences in the packages that they themselves use. + +The `requirements.txt` file can then be committed to a version control system +(we will see how to do this using Git in a moment) +and get shipped as part of your software and shared with collaborators and/or users. + +Note that you only need to share the small `requirements.txt` file with your collaborators - and not the entire +`venv_spacewalks` directory with packages contained in your virtual environment. +We need to tell Git to ignore that directory, so it is not tracked and shared - we do this by creating a file +`.gitignore` in the root directory of our project and adding a line `venv_spacewalks` to it. + +Let's now put `requirements.txt` under version control and share it along with our code. + +```bash +(venv_spacewalks) $ git add requirements.txt +(venv_spacewalks) $ git commit -m "Initial commit of requirements.txt." +(venv_spacewalks) $ git push origin main +``` + +Your collaborators or users of your software can now download your software's source code and replicate the same +virtual software environment for running your code on their machines using `requirements.txt` to install all +the necessary depending packages. + +To recreate a virtual environment from `requirements.txt`, from the project root one can do the following: + +```bash +(venv_spacewalks) $ python -m pip install -r requirements.txt +``` + +As your project grows - you may need to update your environment for a variety of reasons, e.g.: + +- one of your project's dependencies has just released a new version (dependency version number update), +- you need an additional package for data analysis (adding a new dependency), or +- you have found a better package and no longer need the older package +(adding a new and removing an old dependency). + +What you need to do in this case (apart from installing the new and removing the packages that are no longer needed +from your virtual environment) is update the contents of the `requirements.txt` file accordingly +by re-issuing `pip freeze` command and propagate the updated `requirements.txt` file to your collaborators +via your code sharing platform. + +## Further reading + +We recommend the following resources for some additional reading on the topic of this episode: + +- [Official Python Documentation: Virtual Environments and Packages](https://docs.python.org/3/tutorial/venv.html) + +Also check the [full reference set](learners/reference.md#litref) for the course. + + +:::::: keypoints +- Virtual environments keep Python versions and dependencies required by different projects separate. +- A Python virtual environment is itself a directory structure. +- You can use `venv` to create and manage Python virtual environments, and `pip` to install and manage Python +external (third-party) libraries. +- By convention, you can save and export your Python virtual environment in a `requirements.txt` in your project's root +directory, which can then be shared with collaborators/users and used to replicate your virtual environment elsewhere. +:::::: \ No newline at end of file diff --git a/episodes/05-code-readability.md b/episodes/05-code-readability.md deleted file mode 100644 index d48784df..00000000 --- a/episodes/05-code-readability.md +++ /dev/null @@ -1,797 +0,0 @@ ---- -title: Code readability -teaching: 60 -exercises: 30 ---- - -::: questions -- Why does code readability matter? -- How can I organise my code to be more readable? -- What types of documentation can I include to improve the readability of my code? -::: - -::: objectives -After completing this episode, participants should be able to: - -- Organise code into reusable functions that achieve a singular purpose -- Choose function and variable names that help explain the purpose of the function or variable -- Write informative inline comments and docstrings to provide more detail about what the code is doing -::: - -In this episode, we will introduce the concept of readable code and consider how it can help create reusable -scientific software and empower collaboration between researchers. - -## Motivation for code readability - -When someone writes code, they do so based on requirements that are likely to change in the future. -Requirements change because software interacts with the real world, which is dynamic. -When these requirements change, the developer (who is not necessarily the same person who wrote the original code) -must implement the new requirements. -They do this by reading the original code to understand the different abstractions, and identify what needs to change. -Readable code facilitates the reading and understanding of the abstraction phases and, as a result, facilitates the -evolution of the codebase. -Readable code saves future developers' time and effort. - -In order to develop readable code, we should ask ourselves: "If I re-read this piece of code in fifteen days or one -year, will I be able to understand what I have done and why?" -Or even better: "If a new person who just joined the project reads my software, will they be able to understand -what I have written here?" - -We will now learn about a few software best practices we can follow to help create readable code. - -## Code layout - -Our script currently places import statements throughout the code. -Python convention is to place all import statements at the top of the script so that dependant libraries -are clearly visible and not buried inside the code (even though there are standard ways of describing dependencies - -e.g. using `requirements.txt` file). -This will help readability (accessibility) and reusability of our code. - -Our code after the modification should look like the following. - -```python -import json -import csv -import datetime as dt -import matplotlib.pyplot as plt - -# https://data.nasa.gov/resource/eva.json (with modifications) -data_f = open('./eva-data.json', 'r') -data_t = open('./eva-data.csv','w') -g_file = './cumulative_eva_graph.png' -fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose") - -data=[] - -for i in range(374): - line=data_f.readline() - print(line) - data.append(json.loads(line[1:-1])) -#data.pop(0) -## Comment out this bit if you don't want the spreadsheet - -w=csv.writer(data_t) - -time = [] -date =[] - -j=0 -for i in data: - print(data[j]) - # and this bit - w.writerow(data[j].values()) - if 'duration' in data[j].keys(): - tt=data[j]['duration'] - if tt == '': - pass - else: - t=dt.datetime.strptime(tt,'%H:%M') - ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60) - print(t,ttt) - time.append(ttt) - if 'date' in data[j].keys(): - date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d')) - #date.append(data[j]['date'][0:10]) - - else: - time.pop(0) - j+=1 - -t=[0] -for i in time: - t.append(t[-1]+i) - -date,time = zip(*sorted(zip(date, time))) - -plt.plot(date,t[1:], 'ko-') -plt.xlabel('Year') -plt.ylabel('Total time spent in space to date (hours)') -plt.tight_layout() -plt.savefig(g_file) -plt.show() -``` - -Let's make sure we commit our changes. - -```bash -git add eva_data_analysis.py -git commit -m "Move import statements to the top of the script" -``` - -```output -[main a97a9e1] Move import statements to the top of the script - 1 file changed, 4 insertions(+), 4 deletions(-) -``` - -## Standard libraries - -Our script currently reads the data line-by-line from the JSON data file and uses custom code to manipulate -the data. -Variables of interest are stored in lists but there are more suitable data structures (e.g. data frames) -to store data in our case. -By choosing custom code over standard and well-tested libraries, we are making our code less readable and understandable -and more error-prone. - -The main functionality of our code can be rewritten as follows using `Pandas` library to load and manipulate the data -in data frames. - -```python -import pandas as pd -import matplotlib.pyplot as plt - - -data_f = './eva-data.json' -data_t = './eva-data.csv' -g_file = './cumulative_eva_graph.png' - -data = pd.read_json(data_f, convert_dates=['date']) -data['eva'] = data['eva'].astype(float) -data.dropna(axis=0, inplace=True) -data.sort_values('date', inplace=True) - -data.to_csv(data_t, index=False) - -data['duration_hours'] = data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) -data['cumulative_time'] = data['duration_hours'].cumsum() -plt.plot(data['date'], data['cumulative_time'], 'ko-') -plt.xlabel('Year') -plt.ylabel('Total time spent in space to date (hours)') -plt.tight_layout() -plt.savefig(g_file) -plt.show() - -``` - -We should replace the existing code in our Python script `eva_data_analysis.py` with the above code and commit the -changes. Remember to use an informative commit message. - -```bash -git status -git add eva_data_analysis.py -git commit -m "Refactor code to use standard libraries" -``` -```output -[main 0ba9b04] "Refactor code to use standard libraries"" - 1 file changed, 11 insertions(+), 46 deletions(-) -``` - -## Command-line interface to code - -Let's add a command-line interface to our script to allow us pass the data file to be read and the output file to be -written to as parameters to our script and avoid hard-coding them. -This improves interoperability and reusability of our code as it can now be run from the -command line terminal and integrated into other scripts or workflows/pipelines (e.g. another script can produce our -input data and can be "chained" with our code in a data analysis pipeline). - -We will use `sys.argv` library to read the command line arguments passes to our script and make them available in our -code as a list. -The first element of the list is the name of the script itself, and the following -elements are the arguments passed to the script. - -Our modified code will now look as follows. - -```python -import pandas as pd -import matplotlib.pyplot as plt -import sys - - -if __name__ == '__main__': - - if len(sys.argv) < 3: - data_f = './eva-data.json' - data_t = './eva-data.csv' - print(f'Using default input and output filenames') - else: - data_f = sys.argv[1] - data_t = sys.argv[2] - print('Using custom input and output filenames') - - g_file = './cumulative_eva_graph.png' - - print(f'Reading JSON file {data_f}') - data = pd.read_json(data_f, convert_dates=['date']) - data['eva'] = data['eva'].astype(float) - data.dropna(axis=0, inplace=True) - data.sort_values('date', inplace=True) - - print(f'Saving to CSV file {data_t}') - data.to_csv(data_t, index=False) - - print(f'Plotting cumulative spacewalk duration and saving to {g_file}') - data['duration_hours'] = data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) - data['cumulative_time'] = data['duration_hours'].cumsum() - plt.plot(data.date, data.cumulative_time, 'ko-') - plt.xlabel('Year') - plt.ylabel('Total time spent in space to date (hours)') - plt.tight_layout() - plt.savefig(g_file) - plt.show() - print("--END--") -``` - -We can now run our script from the command line passing the json input data file and csv output data file as: - -```bash -python eva_data_analysis.py eva_data.json eva_data.csv -``` - -Remember to commit our changes. - -```bash -git status -git add eva_data_analysis.py -git commit -m "Add commandline functionality to script" -``` -```output -[main b5883f6] Add commandline functionality to script - 1 file changed, 30 insertions(+), 16 deletions(-) -``` - -## Meaningful variable names - -Variables are the most common thing you will assign when coding, and it's really important that it is clear what each variable means in order to understand what the code is doing. -If you return to your code after a long time doing something else, or share your code with a colleague, it should be easy enough to understand what variables are involved in your code from their names. -Therefore we need to give them clear names, but we also want to keep them concise so the code stays readable. -There are no "hard and fast rules" here, and it's often a case of using your best judgment. - -Some useful tips for naming variables are: - -- Short words are better than single character names - - For example, if we were creating a variable to store the speed to read a file, `s` (for 'speed') is not descriptive enough but `MBReadPerSecondAverageAfterLastFlushToLog` is too long to read and prone to mispellings. `ReadSpeed` (or `read_speed`) would suffice. - - If you're finding it difficult to come up with a variable name that is *both* short and descriptive, go with the short version and use an inline comment to desribe it further (more on those in the next section!) - - This guidance doesn't necessarily apply if your variable is a well-known constant in your domain, for example, *c* represents the speed of light in Physics -- Try to be descriptive where possible, and avoid names like `foo`, `bar`, `var`, `thing`, and so on - -There are also some gotchas to consider when naming variables: - -- There may be some restrictions on which characters you can use in your variable names. For instance, in Python, only alphanumeric characters and underscores are permitted. -- Particularly in Python, you cannot *begin* your variable names with numerical characters as this will raise a syntax error. - - Numerical characters can be included in a variable name, just not as the first character. For example, `read_speed1` is a valid variable name, but `1read_speed` isn't. (This behaviour may be different for other programming languages.) -- In some programming languages, such as Python, variable names are case sensitive. So `speed_of_light` and `Speed_Of_Light` will **not** be equivalent. -- Programming languages often have global pre-built functions, such as `input`, which you may accidentally overwrite if you assign a variable with the same name. - - Again in Python, you would actually reassign the `input` name and no longer be able to access the original `input` function if you used this as a variable name. So in this case opting for something like `input_data` would be preferable. (This behaviour may be explicitly disallowed in other programming languages.) - -:::::: challenge -### Give a descriptive name to a variable - -Below we have a variable called `var` being set the value of 9.81. -`var` is not a very descriptive name here as it doesn't tell us what 9.81 means, yet it is a very common constant in physics! -Go online and find out which constant 9.81 relates to and suggest a new name for this variable. - -Hint: the units are *metres per second squared*! - -``` python -var = 9.81 -``` - -::: solution -### Solution - -Yes, $$9.81 m/s^2 $$ is the [gravitational force exerted by the Earth](https://en.wikipedia.org/wiki/Gravity_of_Earth). -It is often referred to as "little g" to distinguish it from "big G" which is the [Gravitational Constant](https://en.wikipedia.org/wiki/Gravitational_constant). -A more decriptive name for this variable therefore might be: - -``` python -g_earth = 9.81 -``` -::: -:::::: - - -:::::: challenge - -Let's apply this to `eva_data_analysis.py`. - -a. Edit the code as follows to use descriptive variable names: - - - Change data_f to input_file - - Change data_t to output_file - - Change g_file to graph_file - - Change data to eva_df - -b. Commit your changes to your repository. Remember to use an informative commit message. - - -::: solution - -Updated code: -```python -if __name__ == '__main__': - - if len(sys.argv) < 3: - input_file = './eva-data.json' - output_file = './eva-data.csv' - print(f'Using default input and output filenames') - else: - input_file = sys.argv[1] - output_file = sys.argv[2] - print('Using custom input and output filenames') - - graph_file = './cumulative_eva_graph.png' - - print(f'Reading JSON file {input_file}') - eva_df = pd.read_json(input_file, convert_dates=['date']) - eva_df['eva'] = eva_df['eva'].astype(float) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - - print(f'Saving to CSV file {output_file}') - eva_df.to_csv(output_file, index=False) - - print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') - eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) - eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum() - plt.plot(eva_df.date, eva_df.cumulative_time, 'ko-') - plt.xlabel('Year') - plt.ylabel('Total time spent in space to date (hours)') - plt.tight_layout() - plt.savefig(graph_file) - plt.show() - print("--END--") -``` - -Commit changes: -```bash -git status -git add eva_data_analysis.py -git commit -m "Use descriptive variable names" -``` - -::: -:::::: - - -## Inline comments - -Commenting is a very useful practice to help convey the context of the code. -It can be helpful as a reminder for your future self or your collaborators as to why code is written in a certain way, how it is achieving a specific task, or the real-world implications of your code. - -There are many ways to add comments to code, the most common of which is inline comments. - -``` python -# In Python, inline comments begin with the `#` symbol and a single space. -``` - -Again, there are few hard and fast rules to using comments, just apply your best judgment. -But here are a few things to keep in mind when commenting your code: - -- **Avoid using comments to explain *what* your code does.** If your code is too complex for other programmers to understand, consider rewriting it for clarity rather than adding comments to explain it. -- Focus on the ***why*** and the ***how***. -- Make sure you're not reiterating something that your code already conveys on its own. Comments shouldn't echo your code. -- Keep them short and concise. Large blocks of text quickly become unreadable and difficult to maintain. -- Comments that contradict the code are worse than no comments. Always make a priority of keeping comments up-to-date when code changes. - -### Examples of helpful vs. unhelpful comments - -#### Unhelpful: - -``` python -statetax = 1.0625 # Assigns the float 1.0625 to the variable 'statetax' -citytax = 1.01 # Assigns the float 1.01 to the variable 'citytax' -specialtax = 1.01 # Assigns the float 1.01 to the variable 'specialtax' -``` - -The comments in this code simply tell us what the code does, which is easy enough to figure out without the inline comments. - -#### Helpful: - -``` python -statetax = 1.0625 # State sales tax rate is 6.25% through Jan. 1 -citytax = 1.01 # City sales tax rate is 1% through Jan. 1 -specialtax = 1.01 # Special sales tax rate is 1% through Jan. 1 -``` - -In this case, it might not be immediately obvious what each variable represents, so the comments offer helpful, real-world context. -The date in the comment also indicates when the code might need to be updated. - -::: challenge -### Add some comments to a code block - -a. Examine `eva_data_analysis.py`. -Add as many inline comments as you think is required to help yourself and others understand what that code is doing. -b. Commit your changes to your repository. Remember to use an informative commit message. - -Hint: Inline comments in Python are denoted by a `#` symbol. - -::: solution -### Solution - -Some good inline comments may look like the example below. - -``` python -import pandas as pd -import matplotlib.pyplot as plt -import sys - - -if __name__ == '__main__': - - if len(sys.argv) < 3: - input_file = './eva-data.json' - output_file = './eva-data.csv' - print(f'Using default input and output filenames') - else: - input_file = sys.argv[1] - output_file = sys.argv[2] - print('Using custom input and output filenames') - - graph_file = './cumulative_eva_graph.png' - - print(f'Reading JSON file {input_file}') - # Read the data from a JSON file into a Pandas dataframe - eva_df = pd.read_json(input_file, convert_dates=['date']) - # Clean the data by removing any incomplete rows and sort by date - eva_df['eva'] = eva_df['eva'].astype(float) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - - print(f'Saving to CSV file {output_file}') - # Save dataframe to CSV file for later analysis - eva_df.to_csv(output_file, index=False) - - print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') - # Plot cumulative time spent in space over years - eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) - eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum() - plt.plot(eva_df.date, eva_df.cumulative_time, 'ko-') - plt.xlabel('Year') - plt.ylabel('Total time spent in space to date (hours)') - plt.tight_layout() - plt.savefig(graph_file) - plt.show() - print("--END--") -``` - -Commit changes: -```bash -git status -git add eva_data_analysis.py -git commit -m "Add inline comments to the code" -``` - -::: -::: - -## Functions - -Functions are a fundamental concept in writing software and are one of the core ways you can organise your code to improve its readability. -A function is an isolated section of code that performs a single, *specific* task that can be simple or complex. -It can then be called multiple times with different inputs throughout a codebase, but it's definition only needs to appear once. - -Breaking up code into functions in this manner benefits readability since the smaller sections are easier to read and understand. -Since functions can be reused, codebases naturally begin to follow the [Don't Repeat Yourself principle][dry-principle] which prevents software from becoming overly long and confusing. -The software also becomes easier to maintain because, if the code encapsulated in a function needs to change, it only needs updating in one place instead of many. -As we will learn in a future episode, testing code also becomes simpler when code is written in functions. -Each function can be individually checked to ensure it is doing what is intended, which improves confidence in the software as a whole. - -::: challenge -### Create a function - -Below is a function that reads in a JSON file into a dataframe structure using the [`pandas` library][pandas-org] - but the code is out of order! - -Reorder the lines of code within the function so that the JSON file is read in using the `read_json` method, any incomplete rows are *dropped*, the values are *sorted* by date, and then the cleaned dataframe is *returned*. -There is also a `print` statement that will display which file is being read in on the command line for verification. - -``` python -import pandas as pd - -def read_json_to_dataframe(input_file): - eva_df.sort_values('date', inplace=True) - eva_df.dropna(axis=0, inplace=True) - print(f'Reading JSON file {input_file}') - return eva_df - eva_df = pd.read_json(input_file, convert_dates=['date']) -``` - -::: solution -### Solution - -Here is the correct order of the code for the function. - -``` python -import pandas as pd - -def read_json_to_dataframe(input_file): - print(f'Reading JSON file {input_file}') - eva_df = pd.read_json(input_file, convert_dates=['date']) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - return eva_df -``` - -We have chosen to create a function for reading in data files since this is a very common task within research software. -While this isn't that many lines of code, thanks to using pandas inbuilt methods, it can be useful to package this together into a function if you need to read in a lot of similarly structured files and process them in the same way. -::: -::: - -## Docstrings - -Docstrings are a specific type of documentation that are provided within functions, and [classes][python-classes] too. -A function docstring should explain what the isolated code is doing, what parameters the function needs (these are inputs) and what form they should take, what the function outputs (you may see words like 'returns' or 'yields' here), and errors (if any) that might be raised. - -Providing these docstrings helps improve code readability since it makes the function code more transparent and aids understanding. -Particularly, docstrings that provide information on the input and output of functions makes it easier to reuse them in other parts of the code, without having to read the full function to understand what needs to be provided and what will be returned. - -Docstrings are another case where there are no hard and fast rules for writing them. -Acceptable docstring formats can range from single- to multi-line. -You can use your best judgment on how much documentation a particular function needs. - -### Example of a single-line docstring - -``` python -def add(x, y): - """Add two numbers together""" - return x + y -``` - -### Example of a multi-line docstring: - -``` python -def add(x, y=1.0): - """ - Adds two numbers together. - - Args: - x: A number to be included in the addition. - y (float, optional): A float number to be included in the addition. Defaults to 1.0. - - Returns: - float: The sum of x and y. - """ - return x + y -``` - -Some projects may have their own guidelines on how to write docstrings, such as [numpy][numpy-docstring]. -If you are contributing code to a wider project or community, try to follow the guidelines and standards they provide for codestyle. - -As your code grows and becomes more complex, the docstrings can form the content of a reference guide allowing developers to quickly look up how to use the APIs, functions, and classes defined in your codebase. -Hence, it is common to find tools that will automatically extract docstrings from your code and generate a website where people can learn about your code without downloading/installing and reading the code files - such as [MkDocs][mkdocs-org]. - -::: challenge -### Writing docstrings - -Write a docstring for the `read_json_to_dataframe` function from the previous exercise. -Things you may want to consider when writing your docstring are: - -- Describing what the function does -- What kind of inputs does the function take? Are they required or optional? Do they have default values? -- What output will the function produce? - -Hint: Python docstrings are defined by enclosing the text with `"""` above and below. This text is also indented to the same level as the code defined beneath it, which is 4 whitespaces. - -::: solution -### Solution - -A good enough docstring for this function would look like this: - -``` python -def read_json_to_dataframe(input_file): - """ - Read the data from a JSON file into a Pandas dataframe - Clean the data by removing any incomplete rows and sort by date - """ - print(f'Reading JSON file {input_file}') - eva_df = pd.read_json(input_file, - convert_dates=['date']) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - return eva_df -``` - -Using [Google's docstring convention](google-doc-string), the docstring may look more like this: - -``` python -def read_json_to_dataframe(input_file): - """ - Read the data from a JSON file into a Pandas dataframe. - Clean the data by removing any incomplete rows and sort by date - - Args: - input_file_ (str): The path to the JSON file. - - Returns: - eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure - """ - print(f'Reading JSON file {input_file}') - eva_df = pd.read_json(input_file, - convert_dates=['date']) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - return eva_df -``` -::: -::: - -## Improving our code - -Finally, let's apply these good practices to `eva_data_analysis.py` -and organise our code into functions with descriptive names and docstrings. - -``` python -import pandas as pd -import matplotlib.pyplot as plt -import sys - - -def read_json_to_dataframe(input_file_): - """ - Read the data from a JSON file into a Pandas dataframe. - Clean the data by removing any incomplete rows and sort by date - - Args: - input_file_ (str): The path to the JSON file. - - Returns: - eva_df (pd.DataFrame): The loaded dataframe. - """ - print(f'Reading JSON file {input_file_}') - eva_df = pd.read_json(input_file_, convert_dates=['date']) - eva_df['eva'] = eva_df['eva'].astype(float) - eva_df.dropna(axis=0, inplace=True) - eva_df.sort_values('date', inplace=True) - return eva_df - - -def write_dataframe_to_csv(df_, output_file_): - """ - Write the dataframe to a CSV file. - - Args: - df_ (pd.DataFrame): The input dataframe. - output_file_ (str): The path to the output CSV file. - - Returns: - None - """ - print(f'Saving to CSV file {output_file_}') - df_.to_csv(output_file_, index=False) - -def text_to_duration(duration): - """ - Convert a text format duration "HH:MM" to duration in hours - - Args: - duration (str): The text format duration - - Returns: - duration_hours (float): The duration in hours - """ - hours, minutes = duration.split(":") - duration_hours = int(hours) + int(minutes)/60 - return duration_hours - - -def add_duration_hours_variable(df_): - """ - Add duration in hours (duration_hours) variable to the dataset - - Args: - df_ (pd.DataFrame): The input dataframe. - - Returns: - df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added - """ - df_copy = df_.copy() - df_copy["duration_hours"] = df_copy["duration"].apply( - text_to_duration - ) - return df_copy - - -def plot_cumulative_time_in_space(df_, graph_file_): - """ - Plot the cumulative time spent in space over years - - Convert the duration column from strings to number of hours - Calculate cumulative sum of durations - Generate a plot of cumulative time spent in space over years and - save it to the specified location - - Args: - df_ (pd.DataFrame): The input dataframe. - graph_file_ (str): The path to the output graph file. - - Returns: - None - """ - print(f'Plotting cumulative spacewalk duration and saving to {graph_file_}') - df_ = add_duration_hours_variable(df_) - df_['cumulative_time'] = df_['duration_hours'].cumsum() - plt.plot(df_.date, df_.cumulative_time, 'ko-') - plt.xlabel('Year') - plt.ylabel('Total time spent in space to date (hours)') - plt.tight_layout() - plt.savefig(graph_file_) - plt.show() - - -if __name__ == '__main__': - - if len(sys.argv) < 3: - input_file = './eva-data.json' - output_file = './eva-data.csv' - print(f'Using default input and output filenames') - else: - input_file = sys.argv[1] - output_file = sys.argv[2] - print('Using custom input and output filenames') - - graph_file = './cumulative_eva_graph.png' - - eva_data = read_json_to_dataframe(input_file) - - write_dataframe_to_csv(eva_data, output_file) - - plot_cumulative_time_in_space(eva_data, graph_file) - - print("--END--") - -``` - -Finally, let's commit these changes to our local repository and then push to our remote repository on GitHub to publish -these changes. -Remember to use an informative commit message. - -```bash -git status -git add eva_data_analysis.py -git commit -m "Organise code into functions" -git push origin main -``` - - -## Summary - -During this episode, we have discussed the importance of code readability and explored some software engineering -practices that help facilitate this. - -Code readability is important because it makes it simpler and quicker for a person (future you or a collaborator) -to understand what purpose the code is serving, and therefore begin contributing to it more easily, saving time and -effort. - -Some best practices we have covered towards code readability include: - -- Variable naming practices for descriptive yet concise code -- Inline comments to provide real-world context -- Functions to isolate specific code sections for re-use -- Docstrings for documenting functions to facilitate their re-use - -## Further reading - -We recommend the following resources for some additional reading on the topic of this episode: - -- ['Code Readability Matters' from the Guardian's engineering blog][guardian-code-readability] -- [PEP 8 Style Guide for Python][pep8-comments] -- [Coursera: Inline commenting in Python][coursera-inline-comments] -- [Introducing Functions from Introduction to Python][python-functions-into] -- [W3Schools.com Python Functions][python-functions-w3schools] - -Also check the [full reference set](learners/reference.md#litref) for the course. - -::: keypoints -- Readable code is easier to understand, maintain, debug and extend! -- Creating functions from the smallest, reusable units of code will help compartmentalise which parts of the code are doing what actions -- Choosing descriptive variable and function names will communicate their purpose more effectively -- Using inline comments and docstrings to describe parts of the code will help transmit understanding, and verify that the code is correct -::: diff --git a/episodes/06-code-readability.md b/episodes/06-code-readability.md new file mode 100644 index 00000000..82466dde --- /dev/null +++ b/episodes/06-code-readability.md @@ -0,0 +1,868 @@ +--- +title: Code readability +teaching: 60 +exercises: 30 +--- + +::: questions + +- Why does code readability matter? +- How can I organise my code to be more readable? +- What types of documentation can I include to improve the readability of my code? + +::: + +::: objectives + +After completing this episode, participants should be able to: + +- Organise code into reusable functions that achieve a singular purpose +- Choose function and variable names that help explain the purpose of the function or variable +- Write informative inline comments and docstrings to provide more detail about what the code is doing + +::: + +In this episode, we will introduce the concept of readable code and consider how it can help create reusable +scientific software and empower collaboration between researchers. + +When someone writes code, they do so based on requirements that are likely to change in the future. +Requirements change because software interacts with the real world, which is dynamic. +When these requirements change, the developer (who is not necessarily the same person who wrote the original code) +must implement the new requirements. +They do this by reading the original code to understand the different abstractions, and identify what needs to change. +Readable code facilitates the reading and understanding of the abstraction phases and, as a result, facilitates the +evolution of the codebase. +Readable code saves future developers' time and effort. + +In order to develop readable code, we should ask ourselves: "If I re-read this piece of code in fifteen days or one +year, will I be able to understand what I have done and why?" +Or even better: "If a new person who just joined the project reads my software, will they be able to understand +what I have written here?" + +We will now learn about a few software best practices we can follow to help create more readable code. +Before that, make sure your virtual development environment is active. + +Before we move on with further code modifications, make sure your virtual development environment is active. + +::: instructor + +::: callout + +### Activate your virtual environment +If it is not already active, make sure to remind learners to activate their virtual environments from the root of +the software project directory in command line terminal (e.g. Bash or GitBash): + +```bash +$ source venv_spacewalks/bin/activate # Mac or Linux +$ source venv_spacewalks/Scripts/activate # Windows +(venv_spacewalks) $ +``` +::: + + +At this point, the state of the software repository should be as in https://github.com/carpentries-incubator/astronaut-data-analysis-not-so-fair/tree/06-code-readability +and the `eva_data_analysis.py` code should look like as follows: + +``` python + +# https://data.nasa.gov/resource/eva.json (with modifications) +data_f = open('./eva-data.json', 'r') +data_t = open('./eva-data.csv','w') +g_file = './cumulative_eva_graph.png' + +fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose") + +data=[] +import json + +for i in range(374): + line=data_f.readline() + print(line) + data.append(json.loads(line[1:-1])) +#data.pop(0) +## Comment out this bit if you don't want the spreadsheet +import csv + +w=csv.writer(data_t) + +import datetime as dt + +time = [] +date =[] + +j=0 +for i in data: + print(data[j]) + # and this bit + w.writerow(data[j].values()) + if 'duration' in data[j].keys(): + tt=data[j]['duration'] + if tt == '': + pass + else: + t=dt.datetime.strptime(tt,'%H:%M') + ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60) + print(t,ttt) + time.append(ttt) + if 'date' in data[j].keys(): + date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d')) + #date.append(data[j]['date'][0:10]) + + else: + time.pop(0) + j+=1 + +t=[0] +for i in time: + t.append(t[-1]+i) + +date,time = zip(*sorted(zip(date, time))) + +import matplotlib.pyplot as plt + +plt.plot(date,t[1:], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(g_file) +plt.show() + +``` +::: + +## Place `import` statements at the top + +Let's have a look our code again - the first thing we may notice is that our script currently places import statements +throughout the code. +Conventionally, all import statements are placed at the top of the script so that dependant libraries +are clearly visible and not buried inside the code (even though there are standard ways of describing dependencies - +e.g. using `requirements.txt` file). +This will help readability (accessibility) and reusability of our code. + +Our code after the modification should look like the following. + +```python + +# https://data.nasa.gov/resource/eva.json (with modifications) +data_f = open('./eva-data.json', 'r') +data_t = open('./eva-data.csv','w') +g_file = './cumulative_eva_graph.png' + +fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose") + +data=[] +import json + +for i in range(374): + line=data_f.readline() + print(line) + data.append(json.loads(line[1:-1])) +#data.pop(0) +## Comment out this bit if you don't want the spreadsheet +import csv + +w=csv.writer(data_t) + +import datetime as dt + +time = [] +date =[] + +j=0 +for i in data: + print(data[j]) + # and this bit + w.writerow(data[j].values()) + if 'duration' in data[j].keys(): + tt=data[j]['duration'] + if tt == '': + pass + else: + t=dt.datetime.strptime(tt,'%H:%M') + ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60) + print(t,ttt) + time.append(ttt) + if 'date' in data[j].keys(): + date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d')) + #date.append(data[j]['date'][0:10]) + + else: + time.pop(0) + j+=1 + +t=[0] +for i in time: + t.append(t[-1]+i) + +date,time = zip(*sorted(zip(date, time))) + +import matplotlib.pyplot as plt + +plt.plot(date,t[1:], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(g_file) +plt.show() + + +``` + +Let's make sure we commit our changes. + +```bash +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Move import statements to the top of the script" +``` + +## Use meaningful variable names + +Variables are the most common thing you will assign when coding, and it's really important that it is clear what each variable means in order to understand what the code is doing. +If you return to your code after a long time doing something else, or share your code with a colleague, it should be easy enough to understand what variables are involved in your code from their names. +Therefore we need to give them clear names, but we also want to keep them concise so the code stays readable. +There are no "hard and fast rules" here, and it's often a case of using your best judgment. + +Some useful tips for naming variables are: + +- Short words are better than single character names. For example, if we were creating a variable to store the speed +to read a file, `s` (for 'speed') is not descriptive enough but `MBReadPerSecondAverageAfterLastFlushToLog` is too long +to read and prone to misspellings. `ReadSpeed` (or `read_speed`) would suffice. +- If you are finding it difficult to come up with a variable name that is both short and descriptive, +go with the short version and use an inline comment to describe it further (more on those in the next section). +This guidance does not necessarily apply if your variable is a well-known constant in your domain - +for example, *c* represents the speed of light in physics. +- Try to be descriptive where possible and avoid meaningless or funny names like `foo`, `bar`, `var`, `thing`, etc. + +There are also some restrictions to consider when naming variables in Python: + +- Only alphanumeric characters and underscores are permitted in variable names. +- You cannot begin your variable names with a numerical character as this will raise a syntax error. +Numerical characters can be included in a variable name, just not as the first character. For example, `read_speed1` is a valid variable name, but `1read_speed` isn't. (This behaviour may be different for other programming languages.) +- Variable names are case sensitive. So `speed_of_light` and `Speed_Of_Light` are not the same. +- Programming languages often have global pre-built functions, such as `input`, which you may accidentally overwrite +if you assign a variable with the same name and no longer be able to access the original `input` function. In this case, +opting for something like `input_data` would be preferable. Note that this behaviour may be explicitly disallowed in other +programming languages but is not in Python. + +:::::: challenge + +### Give a descriptive name to a variable + +Below we have a variable called `var` being set the value of 9.81. +`var` is not a very descriptive name here as it doesn't tell us what 9.81 means, yet it is a very common constant in physics! +Go online and find out which constant 9.81 relates to and suggest a new name for this variable. + +Hint: the units are *metres per second squared*! + +``` python +var = 9.81 +``` + +::: solution +### Solution + +Yes, $$9.81 m/s^2 $$ is the [gravitational force exerted by the Earth](https://en.wikipedia.org/wiki/Gravity_of_Earth). +It is often referred to as "little g" to distinguish it from "big G" which is the [Gravitational Constant](https://en.wikipedia.org/wiki/Gravitational_constant). +A more descriptive name for this variable therefore might be: + +``` python +g_earth = 9.81 +``` +::: +:::::: + + +:::::: challenge + +Let's apply this to `eva_data_analysis.py`. + +a. Edit the code as follows to use descriptive variable names: + + - Change data_f to input_file + - Change data_t to output_file + - Change g_file to graph_file + +b. What other variable names in our code would benefit from renaming? +c. Commit your changes to your repository. Remember to use an informative commit message. + + +::: solution + +Updated code: +```python + +import json +import csv +import datetime as dt +import matplotlib.pyplot as plt + +# Data source: https://data.nasa.gov/resource/eva.json (with modifications) +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +fieldnames = ("EVA #", "Country", "Crew ", "Vehicle", "Date", "Duration", "Purpose") + +data=[] + +for i in range(374): + line=input_file.readline() + print(line) + data.append(json.loads(line[1:-1])) +#data.pop(0) +## Comment out this bit if you don't want the spreadsheet + +w=csv.writer(output_file) + +time = [] +date =[] + +j=0 +for i in data: + print(data[j]) + # and this bit + w.writerow(data[j].values()) + if 'duration' in data[j].keys(): + tt=data[j]['duration'] + if tt == '': + pass + else: + t=dt.datetime.strptime(tt,'%H:%M') + ttt = dt.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second).total_seconds()/(60*60) + print(t,ttt) + time.append(ttt) + if 'date' in data[j].keys(): + date.append(dt.datetime.strptime(data[j]['date'][0:10], '%Y-%m-%d')) + #date.append(data[j]['date'][0:10]) + + else: + time.pop(0) + j+=1 + +t=[0] +for i in time: + t.append(t[-1]+i) + +date,time = zip(*sorted(zip(date, time))) + +plt.plot(date,t[1:], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() + +``` +We should also rename variables `w`, `t`, `ttt` to be more descriptive. + +Commit changes: +```bash +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Use descriptive variable names" +``` + +::: +:::::: + +## Use standard libraries + +Our script currently reads the data line-by-line from the JSON data file and uses custom code to manipulate +the data. +Variables of interest are stored in lists but there are more suitable data structures (e.g. data frames) +to store data in our case. +By choosing custom code over standard and well-tested libraries, we are making our code less readable and understandable +and more error-prone. + +The main functionality of our code can be rewritten as follows using the `Pandas` library to load and manipulate the +data in data frames. + +First, we need to install this dependency into our virtual environment (which should be active at this point). + +```bash +(venv_spacewalks) $ python -m pip install pandas +``` +The code should now look like: + +```python +import matplotlib.pyplot as plt +import pandas as pd + +# Data source: https://data.nasa.gov/resource/eva.json (with modifications) +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +eva_df = pd.read_json(input_file, convert_dates=['date']) +eva_df['eva'] = eva_df['eva'].astype(float) +eva_df.dropna(axis=0, inplace=True) +eva_df.sort_values('date', inplace=True) + +eva_df.to_csv(output_file, index=False) + +eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) +eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum() +plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() + +``` + +We should replace the existing code in our Python script `eva_data_analysis.py` with the above code and commit the +changes. Remember to use an informative commit message. + +```bash +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Refactor code to use standard libraries" +``` + +Make sure to capture the changes to your virtual development environment too. + +```bash +(venv_spacewalks) $ python -m pip freeze > requirements.txt +(venv_spacewalks) $ git add requirements.txt +(venv_spacewalks) $ git commit -m "Added Pandas library." +(venv_spacewalks) $ git push origin main +``` + +## Use comments to explain functionality + +Commenting is a very useful practice to help convey the context of the code. +It can be helpful as a reminder for your future self or your collaborators as to why code is written in a certain way, +how it is achieving a specific task, or the real-world implications of your code. + +There are several ways to add comments to code: + +- An **inline comment** is a comment on the same line as a code statement. +Typically, it comes after the code statement and finishes when the line ends and +is useful when you want to explain the code line in short. +Inline comments in Python should be separated by at least two spaces from the statement; they start with a # followed +by a single space, and have no end delimiter. +- A **multi-line** or **block comment** can span multiple lines and has a start and end sequence. +To comment out a block of code in Python, you can either add a # at the beginning of each line of the block or +surround the entire block with three single (`'''`) or double quotes (`"""`). + +``` python +x = 5 # In Python, inline comments begin with the `#` symbol and a single space. + +''' +This is a multiline +comment +in Python. +''' +``` + +Here are a few things to keep in mind when commenting your code: + +- Focus on the **why** and the **how** of your code - avoid using comments to explain **what** your code does. +If your code is too complex for other programmers to understand, consider rewriting it for clarity rather than adding +comments to explain it. +- Make sure you are not reiterating something that your code already conveys on its own. Comments should not echo your +code. +- Keep comments short and concise. Large blocks of text quickly become unreadable and difficult to maintain. +- Comments that contradict the code are worse than no comments. Always make a priority of keeping comments up-to-date +when code changes. + +### Examples of unhelpful comments + +``` python +statetax = 1.0625 # Assigns the float 1.0625 to the variable 'statetax' +citytax = 1.01 # Assigns the float 1.01 to the variable 'citytax' +specialtax = 1.01 # Assigns the float 1.01 to the variable 'specialtax' +``` + +The comments in this code simply tell us what the code does, which is easy enough to figure out without the inline comments. + +### Examples of helpful comments + +``` python +statetax = 1.0625 # State sales tax rate is 6.25% through Jan. 1 +citytax = 1.01 # City sales tax rate is 1% through Jan. 1 +specialtax = 1.01 # Special sales tax rate is 1% through Jan. 1 +``` + +In this case, it might not be immediately obvious what each variable represents, so the comments offer helpful, +real-world context. +The date in the comment also indicates when the code might need to be updated. + +::: challenge + +### Add comments to our code + +a. Examine `eva_data_analysis.py`. +Add as many comments as you think is required to help yourself and others understand what that code is doing. +b. Commit your changes to your repository. Remember to use an informative commit message. + +::: solution + +### Solution + +Some good comments may look like the example below. + +``` python + +import matplotlib.pyplot as plt +import pandas as pd + + +# https://data.nasa.gov/resource/eva.json (with modifications) +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +print("--START--") +print(f'Reading JSON file {input_file}') +# Read the data from a JSON file into a Pandas dataframe +eva_df = pd.read_json(input_file, convert_dates=['date']) +eva_df['eva'] = eva_df['eva'].astype(float) +# Clean the data by removing any incomplete rows and sort by date +eva_df.dropna(axis=0, inplace=True) +eva_df.sort_values('date', inplace=True) + +print(f'Saving to CSV file {output_file}') +# Save dataframe to CSV file for later analysis +eva_df.to_csv(output_file, index=False) + +print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') +# Plot cumulative time spent in space over years +eva_df['duration_hours'] = eva_df['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) +eva_df['cumulative_time'] = eva_df['duration_hours'].cumsum() +plt.plot(eva_df['date'], eva_df['cumulative_time'], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() +print("--END--") + +``` + +Commit changes: +```bash +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Add inline comments to the code" +``` + +::: +::: + +## Separate units of functionality + +Functions are a fundamental concept in writing software and are one of the core ways you can organise your code to +improve its readability. +A function is an isolated section of code that performs a single, *specific* task that can be simple or complex. +It can then be called multiple times with different inputs throughout a codebase, but its definition only needs to +appear once. + +Breaking up code into functions in this manner benefits readability since the smaller sections are easier to read +and understand. +Since functions can be reused, codebases naturally begin to follow the [Don't Repeat Yourself principle][dry-principle] +which prevents software from becoming overly long and confusing. +The software also becomes easier to maintain because, if the code encapsulated in a function needs to change, +it only needs updating in one place instead of many. +As we will learn in a future episode, testing code also becomes simpler when code is written in functions. +Each function can be individually checked to ensure it is doing what is intended, which improves confidence in +the software as a whole. + +::: callout +Decomposing code into functions helps with reusability of blocks of code and eliminating repetition, +but, equally importantly, it helps with code readability and testing. +::: + +Looking at our code, you may notice it contains different pieces of functionality: + +1. reading the data from a JSON file +2. processing/cleaning the data and preparing it for analysis +3. data analysis and visualising the results +4. converting and saving the data in the CSV format + +Let's refactor our code so that reading the data in JSON format into a dataframe (step 1.) and converting it and saving +to the CSV format (step 4.) are extracted into separate functions. +Let's name those functions `read_json_to_dataframe` and `write_dataframe_to_csv` respectively. +The main part of the script should then be simplified to invoke these new functions, while the functions themselves +contain the complexity of each of these two steps. + +Our code may look something like the following. + +``` python + +import matplotlib.pyplot as plt +import pandas as pd + +def read_json_to_dataframe(input_file): + print(f'Reading JSON file {input_file}') + # Read the data from a JSON file into a Pandas dataframe + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + # Clean the data by removing any incomplete rows and sort by date + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + print(f'Saving to CSV file {output_file}') + # Save dataframe to CSV file for later analysis + df.to_csv(output_file, index=False) + + +# Main code + +print("--START--") + +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +# Read the data from JSON file +eva_data = read_json_to_dataframe(input_file) + +# Convert and export data to CSV file +write_dataframe_to_csv(eva_data, output_file) + +print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') +# Plot cumulative time spent in space over years +eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) +eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum() +plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() + +print("--END--") + +``` + +We have chosen to create function for reading in and writing out data files since this is a very common task within +research software. +While these functions do not contain that many lines of code due to using the `pandas` inbuilt methods that do all the +complex data reading, converting and writing operations, +it can be useful to package these steps together into reusable functions if you need to read in or write out a lot of +similarly structured files and process them in the same way. + +## Use docstrings to document functions + +Docstrings are a specific type of documentation that are provided within functions and [Python classes][python-classes]. +A function docstring should explain what that particular code is doing, what parameters the function needs (inputs) +and what form they should take, what the function outputs (you may see words like 'returns' or 'yields' here), +and errors (if any) that might be raised. + +Providing these docstrings helps improve code readability since it makes the function code more transparent and aids +understanding. +Particularly, docstrings that provide information on the input and output of functions makes it easier to reuse them +in other parts of the code, without having to read the full function to understand what needs to be provided and +what will be returned. + +Python docstrings are defined by enclosing the text with 3 double quotes (`"""`). +This text is also indented to the same level as the code defined beneath it, which is 4 whitespaces by convention. + +### Example of a single-line docstring + +``` python +def add(x, y): + """Add two numbers together""" + return x + y +``` + +### Example of a multi-line docstring + +``` python +def divide(x, y): + """ + Divide number x by number y. + + Args: + x: A number to be divided. + y: A number to divide by. + + Returns: + float: The division of x by y. + + Raises: + ZeroDivisionError: Cannot divide by zero. + """ + return x / y +``` + +Some projects may have their own guidelines on how to write docstrings, such as [numpy][numpy-docstring]. +If you are contributing code to a wider project or community, try to follow the guidelines and standards they provide +for code style. + +As your code grows and becomes more complex, the docstrings can form the content of a reference guide allowing +developers to quickly look up how to use the APIs, functions, and classes defined in your codebase. +Hence, it is common to find tools that will automatically extract docstrings from your code and generate a +website where people can learn about your code without downloading/installing and reading the code files - +such as [MkDocs][mkdocs-org]. + +Let's write a docstring for the function `read_json_to_dataframe` we introduced in the previous exercise using the +[Google Style Python Docstrings Convention][google-doc-string]. +Remember, questions we want to answer when writing the docstring include: + +- What the function does? +- What kind of inputs does the function take? Are they required or optional? Do they have default values? +- What output will the function produce? +- What exceptions/errors, if any, it can produce? + +Our `read_json_to_dataframe` function fully described by a docstring may look like: + +```python +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + # Read the data from a JSON file into a Pandas dataframe + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + # Clean the data by removing any incomplete rows and sort by date + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df +``` + +:::::: challenge + +### Writing docstrings + +Write a docstring for the function `write_dataframe_to_csv` we introduced earlier. + +::: solution + +### Solution + +Our `write_dataframe_to_csv` function fully described by a docstring may look like: +```python +def write_dataframe_to_csv(df, output_file): +""" +Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + # Save dataframe to CSV file for later analysis + df.to_csv(output_file, index=False) +``` +::: + +:::::: + +Finally, our code may look something like the following: + +``` python + +import matplotlib.pyplot as plt +import pandas as pd + + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + # Read the data from a JSON file into a Pandas dataframe + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + # Clean the data by removing any incomplete rows and sort by date + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + # Save dataframe to CSV file for later analysis + df.to_csv(output_file, index=False) + + +# Main code + +print("--START--") + +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +# Read the data from JSON file +eva_data = read_json_to_dataframe(input_file) + +# Convert and export data to CSV file +write_dataframe_to_csv(eva_data, output_file) + +print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') +# Plot cumulative time spent in space over years +eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) +eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum() +plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() + +print("--END--") + +``` + +Do not forget to commit any uncommitted changes you may have and then push your work to GitHub. + +```bash +(venv_spacewalks) $ git add +(venv_spacewalks) $ git commit -m "Your commit message" +(venv_spacewalks) $ git push origin main +``` + + +## Further reading + +We recommend the following resources for some additional reading on the topic of this episode: + +- [7 tell-tale signs of unreadable code](https://www.index.dev/blog/7-tell-tale-signs-of-unreadable-code-how-to-identify-and-fix-the-problem) +- ['Code Readability Matters' from the Guardian's engineering blog][guardian-code-readability] +- [PEP 8 Style Guide for Python][pep8-comments] +- [Coursera: Inline commenting in Python][coursera-inline-comments] +- [Introducing Functions from Introduction to Python][python-functions-intro] +- [W3Schools.com Python Functions][python-functions-w3schools] + +Also check the [full reference set](learners/reference.md#litref) for the course. + +::: keypoints +- Readable code is easier to understand, maintain, debug and extend (reuse) - saving time and effort. +- Choosing descriptive variable and function names will communicate their purpose more effectively. +- Using inline comments and docstrings to describe parts of the code will help transmit understanding and context. +- Use libraries or packages for common functionality to avoid duplication. +- Creating functions from the smallest, reusable units of code will make the code more readable and help. +compartmentalise which parts of the code are doing what actions and isolate specific code sections for re-use. +::: diff --git a/episodes/07-code-structure.md b/episodes/07-code-structure.md new file mode 100644 index 00000000..d089a9fa --- /dev/null +++ b/episodes/07-code-structure.md @@ -0,0 +1,690 @@ +--- +title: Code structure +teaching: 60 +exercises: 30 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can be bests structure code into reusable components with a single responsibility? +- What is a common code structure (pattern) for creating software that can read input from command line? +- What are conventional places to store data, code, results, tests and auxiliary information and metadata +within our software or research project? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives +After completing this episode, participants should be able to: + +- Structure code that is modular and split into small, reusable functions. +- Use the common code pattern for creating software that can read input from command line +- Follow best practices in structuring code and organising software/research project directories for improved +readability, accessibility and reproducibility. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +In the previous episode we have seen some tools and practices that can help up improve readability of our code - +including breaking our code into small, reusable functions that perform one specific task. +We are going to explore a bit more how using common code structures can improve readability, accessibility and +reusability of our code, and will expand these practices on our (research or code) projects as a whole. + +Before we move on with further code modifications, make sure your virtual development environment is active. + +:::::: instructor + +::: callout + +### Activate your virtual environment +If it is not already active, make sure to remind learners to activate their virtual environments from the root of +the software project directory in command line terminal (e.g. Bash or GitBash): + +```bash +$ source venv_spacewalks/bin/activate # Mac or Linux +$ source venv_spacewalks/Scripts/activate # Windows +(venv_spacewalks) $ +``` +::: + + +At this point, the state of the software repository should be as in +https://github.com/carpentries-incubator/astronaut-data-analysis-not-so-fair/tree/07-code-structure +and the `eva_data_analysis.py` code should look like as follows: + +``` python +import matplotlib.pyplot as plt +import pandas as pd + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + # Read the data from a JSON file into a Pandas dataframe + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + # Clean the data by removing any incomplete rows and sort by date + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + # Save dataframe to CSV file for later analysis + df.to_csv(output_file, index=False) + + +# Main code + +print("--START--") + +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +# Read the data from JSON file +eva_data = read_json_to_dataframe(input_file) + +# Convert and export data to CSV file +write_dataframe_to_csv(eva_data, output_file) + +print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') +# Plot cumulative time spent in space over years +eva_data['duration_hours'] = eva_data['duration'].str.split(":").apply(lambda x: int(x[0]) + int(x[1])/60) +eva_data['cumulative_time'] = eva_data['duration_hours'].cumsum() +plt.plot(eva_data['date'], eva_data['cumulative_time'], 'ko-') +plt.xlabel('Year') +plt.ylabel('Total time spent in space to date (hours)') +plt.tight_layout() +plt.savefig(graph_file) +plt.show() + +print("--END--") + + +``` +:::::: + +## Functions for Modular and Reusable Code + +As we have already seen in the previous episode - functions play a key role in creating modular and reusable code. +We are going to carry on improving our code following these principles: + +- Each function should have a single, clear responsibility. This makes functions easier to understand, test, and reuse. +- Functions should accept parameters to allow flexibility and reusability in different contexts; avoid hard-coding +values inside functions/code (e.g. data files to read from/write to); instead, pass them as arguments. +- Split complex tasks into smaller, simpler functions that can be composed; each function should handle a distinct +part of a larger task. +- Write functions that can be easily combined or reused with other functions to build more complex functionality. + +Bearing in mind the above principles, we can further simplify our code by extracting the code to process, +analyse our data and plot a graph into a separate function `plot_cumulative_time_in_space`, and then further break down +the code to convert the data column containing spacewalk durations as text into numbers which we can perform +arithmetic operations over, and add that numerical data as a new column in our dataset. + +The main part of our code then becomes much simpler and more readable, only +containing the invocation of the following three functions: + +```python +eva_data = read_json_to_dataframe(input_file) +write_dataframe_to_csv(eva_data, output_file) +plot_cumulative_time_in_space(eva_data, graph_file) +``` +Remember to add docstrings and comments to the new functions to explain their functionalities. + +Our new code may look like the following. + +```python + + +import matplotlib.pyplot as plt +import pandas as pd + + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + df.to_csv(output_file, index=False) + +def text_to_duration(duration): + """ + Convert a text format duration "HH:MM" to duration in hours + + Args: + duration (str): The text format duration + + Returns: + duration_hours (float): The duration in hours + """ + hours, minutes = duration.split(":") + duration_hours = int(hours) + int(minutes)/60 + return duration_hours + + +def add_duration_hours_variable(df): + """ + Add duration in hours (duration_hours) variable to the dataset + + Args: + df (pd.DataFrame): The input dataframe. + + Returns: + df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added + """ + df_copy = df.copy() + df_copy["duration_hours"] = df_copy["duration"].apply( + text_to_duration + ) + return df_copy + + +def plot_cumulative_time_in_space(df, graph_file): + """ + Plot the cumulative time spent in space over years + + Convert the duration column from strings to number of hours + Calculate cumulative sum of durations + Generate a plot of cumulative time spent in space over years and + save it to the specified location + + Args: + df (pd.DataFrame): The input dataframe. + graph_file (str): The path to the output graph file. + + Returns: + None + """ + print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') + df = add_duration_hours_variable(df) + df['cumulative_time'] = df['duration_hours'].cumsum() + plt.plot(df.date, df.cumulative_time, 'ko-') + plt.xlabel('Year') + plt.ylabel('Total time spent in space to date (hours)') + plt.tight_layout() + plt.savefig(graph_file) + plt.show() + + +# Main code + +print("--START--") + +input_file = open('./eva-data.json', 'r') +output_file = open('./eva-data.csv', 'w') +graph_file = './cumulative_eva_graph.png' + +eva_data = read_json_to_dataframe(input_file) + +write_dataframe_to_csv(eva_data, output_file) + +plot_cumulative_time_in_space(eva_data, graph_file) + +print("--END--") + + +``` + +## Command-line interface to code + +A common way to structure code is to have a command-line interface to allow the passing of input data file to be +read and the output file to be written to as parameters to our script and avoid hard-coding them. +This improves interoperability and reusability of our code as it can now be run from the +command line terminal and integrated into other scripts or workflows/pipelines. +For example, another script can produce +our input data and can be "chained" with our code in a more complex data analysis pipeline. +Or we can invoke our script in a loop to quickly analyse a number of input data files from a directory. + +There is a common code structure (pattern) for this: + +```python +# import modules + +def main(args): + # perform some actions + +if __name__ == "__main__": + # perform some actions before main() + main(args) +``` + +In this pattern the actions performed by the script are contained within the `main` function +(which does not need to be called `main`, but using this convention helps others in understanding your code). +The `main` function is then called within the `if` statement `__name__ == "__main__"`, +after some other actions have been performed (usually the parsing of command-line arguments, +which will be explained below). +`__name__` is a special variable which is set by the Python interpreter before the execution of any code in the +source file. +What value is given by the interpreter to `__name__` is determined by the manner in which the script is loaded. + +If we run the source file directly using the Python interpreter, e.g.: + +```bash +$ python3 eva_data_analysis.py +``` + +then the interpreter will assign the hard-coded string `"__main__"` to the `__name__` variable: + +```python +__name__ = "__main__" +... +# rest of your code +``` + +However, if your source file is imported by another Python script, e.g: + +```python +import eva_data_analysis +``` + +then the Python interpreter will assign the name "eva_data_analysis" from the import statement to the +`__name__` variable: + +```python +__name__ = "eva_data_analysis" +... +# rest of your code +``` + +Because of this behaviour of the Python interpreter, we can put any code that should only be executed when running +the script directly within the `if __name__ == "__main__":` structure, +allowing the rest of the code within the script to be safely imported by another script if we so wish. + +While it may not seem very useful to have your script importable by another script, +there are a number of situations in which you would want to do this: + +- for testing of your code, you can have your testing framework import the main script, + and run special test functions which then call the `main` function directly; +- where you want to not only be able to run your script from the command-line, + but also provide a programmer-friendly application programming interface (API) for advanced users. + +We will use `sys` library to read the command line arguments passed to our script and make them available in our +code as a list - remember to import this library first. + +Our modified code will now look as follows. + +```python +import json +import csv +import datetime as dt +import matplotlib.pyplot as plt +import pandas as pd +import sys + +def main(input_file, output_file, graph_file): + print("--START--") + + eva_data = read_json_to_dataframe(input_file) + + write_dataframe_to_csv(eva_data, output_file) + + plot_cumulative_time_in_space(eva_data, graph_file) + + print("--END--") + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + df.to_csv(output_file, index=False) + +def text_to_duration(duration): + """ + Convert a text format duration "HH:MM" to duration in hours + + Args: + duration (str): The text format duration + + Returns: + duration_hours (float): The duration in hours + """ + hours, minutes = duration.split(":") + duration_hours = int(hours) + int(minutes)/60 + return duration_hours + + +def add_duration_hours_variable(df): + """ + Add duration in hours (duration_hours) variable to the dataset + + Args: + df (pd.DataFrame): The input dataframe. + + Returns: + df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added + """ + df_copy = df.copy() + df_copy["duration_hours"] = df_copy["duration"].apply( + text_to_duration + ) + return df_copy + + +def plot_cumulative_time_in_space(df, graph_file): + """ + Plot the cumulative time spent in space over years + + Convert the duration column from strings to number of hours + Calculate cumulative sum of durations + Generate a plot of cumulative time spent in space over years and + save it to the specified location + + Args: + df (pd.DataFrame): The input dataframe. + graph_file (str): The path to the output graph file. + + Returns: + None + """ + print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') + df = add_duration_hours_variable(df) + df['cumulative_time'] = df['duration_hours'].cumsum() + plt.plot(df.date, df.cumulative_time, 'ko-') + plt.xlabel('Year') + plt.ylabel('Total time spent in space to date (hours)') + plt.tight_layout() + plt.savefig(graph_file) + plt.show() + + +if __name__ == "__main__": + + if len(sys.argv) < 3: + input_file = './eva-data.json' + output_file = './eva-data.csv' + print(f'Using default input and output filenames') + else: + input_file = sys.argv[1] + output_file = sys.argv[2] + print('Using custom input and output filenames') + + graph_file = './cumulative_eva_graph.png' + main(input_file, output_file, graph_file) + +``` + +We can now run our script from the command line passing the JSON input data file and CSV output data file as: + +```bash +(venv_spacewalks) $ python eva_data_analysis.py eva_data.json eva_data.csv +``` + +Remember to commit our changes. + +```bash +(venv_spacewalks) $ git status +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Add command line functionality to script" +``` + +## Directory structure for software projects + +One of the steps to make your work more easily readable and reproducible is to organise your software projects +following certain conventions on consistent and informative directory structure. +This way, people will immediately know where to find things within your project. +Here are some general guidelines that apply to all research projects (including software projects): + +- Put all files related to a project into a single directory +- Do not mix project files - different projects should have separate directories and repositories (it is OK to copy +files into multiple places if both projects require them) +- Avoid spaces in directory and file names – they can cause errors when read by computers +- If you have sensitive data - you can put it in a private repository on GitHub +- Use .gitignore to specify what files should not be tracked - e.g. passwords, local configuration, etc. +- Add a README file to your repository to describe the project and instructions on running the code or reproducing +the results (we will covered this later in this course). + +```output +project_name/ +├── README.md # overview of the project +├── data/ # data files used in the project +│ ├── README.md # describes where data came from +│ └── sub-folder/ # may contain subdirectories, e.g. for intermediate files from the analysis +├── manuscript/ # manuscript describing the results +├── results/ # results of the analysis (data, tables, figures) +├── src/ # contains all code in the project +│ ├── LICENSE # license for your code +│ ├── requirements.txt # software requirements and dependencies +│ └── ... +├── src/ # source code for your project +├── doc/ # documentation for your project +├── index.rst +├── main_script.py # main script/code entry point +└── ... +``` + +- Source code is typically placed in the `src/` or `source/` directory (and its subdirectories containing hierarchical +libraries of your code). The main script or the main entry to your code may remain in the project root. +- Data is typically placed under `data/` + - Raw data or input files can also be placed under `raw_data/` - original data should not be modified and should be kept raw + - Processed or cleaned data or intermediate results from data analysis can be placed under `processed_data/` +- Documentation is typically placed or compiled into `doc/` or `docs/` +- Results are typically placed under results/ +- Tests are typically placed under tests/ + +:::::: challenge + +Refactor your software project so that input data is stored in `data/` directory and results (the graph and CSV +data files) saved in `results/` directory. + +::: solution + +```python +import matplotlib.pyplot as plt +import pandas as pd +import sys + +# https://data.nasa.gov/resource/eva.json (with modifications) + +def main(input_file, output_file, graph_file): + print("--START--") + + eva_data = read_json_to_dataframe(input_file) + + write_dataframe_to_csv(eva_data, output_file) + + plot_cumulative_time_in_space(eva_data, graph_file) + + print("--END--") + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df + + +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + df.to_csv(output_file, index=False) + +def text_to_duration(duration): + """ + Convert a text format duration "HH:MM" to duration in hours + + Args: + duration (str): The text format duration + + Returns: + duration_hours (float): The duration in hours + """ + hours, minutes = duration.split(":") + duration_hours = int(hours) + int(minutes)/60 + return duration_hours + + +def add_duration_hours_variable(df): + """ + Add duration in hours (duration_hours) variable to the dataset + + Args: + df (pd.DataFrame): The input dataframe. + + Returns: + df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added + """ + df_copy = df.copy() + df_copy["duration_hours"] = df_copy["duration"].apply( + text_to_duration + ) + return df_copy + + +def plot_cumulative_time_in_space(df, graph_file): + """ + Plot the cumulative time spent in space over years + + Convert the duration column from strings to number of hours + Calculate cumulative sum of durations + Generate a plot of cumulative time spent in space over years and + save it to the specified location + + Args: + df (pd.DataFrame): The input dataframe. + graph_file (str): The path to the output graph file. + + Returns: + None + """ + print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') + df = add_duration_hours_variable(df) + df['cumulative_time'] = df['duration_hours'].cumsum() + plt.plot(df.date, df.cumulative_time, 'ko-') + plt.xlabel('Year') + plt.ylabel('Total time spent in space to date (hours)') + plt.tight_layout() + plt.savefig(graph_file) + plt.show() + + +if __name__ == "__main__": + + if len(sys.argv) < 3: + input_file = 'data/eva-data.json' + output_file = 'results/eva-data.csv' + print(f'Using default input and output filenames') + else: + input_file = sys.argv[1] + output_file = sys.argv[2] + print('Using custom input and output filenames') + + graph_file = 'results/cumulative_eva_graph.png' + main(input_file, output_file, graph_file) + +``` + +```bash +(venv_spacewalks) $ git status +(venv_spacewalks) $ git add eva_data_analysis.py data results +(venv_spacewalks) $ git commit -m "Update project's directory structure" +``` +::: + +:::::: + +## Further reading + +We recommend the following resources for some additional reading on the topic of this episode: + +- [Organizing your projects](https://coderefinery.github.io/reproducible-research/organizing-projects/) chapter from the [CodeRefinery's Reproducible Research tutorial](https://coderefinery.github.io/reproducible-research/intro/) +- [MIT Broad Reseach Communication Lab's "File Structure" guide](https://mitcommlab.mit.edu/broad/commkit/file-structure/) + +Also check the [full reference set](learners/reference.md#litref) for the course. + +::: keypoints + +- Good practices for code and project structure are essential for creating readable, accessible and reproducibile projects. + +::: diff --git a/episodes/06-code-testing.md b/episodes/08-code-correctness.md similarity index 86% rename from episodes/06-code-testing.md rename to episodes/08-code-correctness.md index 55bd9f46..5a6a1db8 100644 --- a/episodes/06-code-testing.md +++ b/episodes/08-code-correctness.md @@ -1,17 +1,20 @@ --- -title: "Code testing" +title: "Code correctness" teaching: 60 exercises: 30 --- ::: questions + - How can we verify that our code is correct? - How can we automate our software tests? - What makes a "good" test? - Which parts of our code should we prioritize for testing? + ::: ::: objectives + After completing this episode, participants should be able to: - Explain why code testing is important and how this supports FAIR @@ -27,67 +30,200 @@ After completing this episode, participants should be able to: appropriately. - Evaluate code coverage to identify how much of the codebase is being tested and identify areas that need further tests. -::: -## Motivation for Code Testing +::: +Now that we have improved the structure and readability of our code - it is much easier to +test its functionality and improve it further. The goal of software testing is to check that the actual results -produced by a piece of code meet our expectations i.e. are correct. +produced by a piece of code meet our expectations, i.e. are correct. -Adopting software testing as part of our research workflow helps us to -conduct better research and produce FAIR software. +Before we move on with further code modifications, make sure your virtual development environment is active. + +:::::: instructor + +::: callout + +### Activate your virtual environment +If it is not already active, make sure to remind learners to activate their virtual environments from the root of +the software project directory in command line terminal (e.g. Bash or GitBash): + +```bash +$ source venv_spacewalks/bin/activate # Mac or Linux +$ source venv_spacewalks/Scripts/activate # Windows +(venv_spacewalks) $ +``` +::: + +At this point, the state of the software repository should be as in +https://github.com/carpentries-incubator/astronaut-data-analysis-not-so-fair/tree/08-code-correctness +and the `eva_data_analysis.py` code should look like as follows: + +``` python +import matplotlib.pyplot as plt +import pandas as pd +import sys + +# https://data.nasa.gov/resource/eva.json (with modifications) + +def main(input_file, output_file, graph_file): + print("--START--") + + eva_data = read_json_to_dataframe(input_file) + + write_dataframe_to_csv(eva_data, output_file) + + plot_cumulative_time_in_space(eva_data, graph_file) + + print("--END--") + +def read_json_to_dataframe(input_file): + """ + Read the data from a JSON file into a Pandas dataframe. + Clean the data by removing any incomplete rows and sort by date + + Args: + input_file_ (str): The path to the JSON file. + + Returns: + eva_df (pd.DataFrame): The cleaned and sorted data as a dataframe structure + """ + print(f'Reading JSON file {input_file}') + eva_df = pd.read_json(input_file, convert_dates=['date']) + eva_df['eva'] = eva_df['eva'].astype(float) + eva_df.dropna(axis=0, inplace=True) + eva_df.sort_values('date', inplace=True) + return eva_df -### Better Research -Software testing can help us be better, more productive researchers. +def write_dataframe_to_csv(df, output_file): + """ + Write the dataframe to a CSV file. + + Args: + df (pd.DataFrame): The input dataframe. + output_file (str): The path to the output CSV file. + + Returns: + None + """ + print(f'Saving to CSV file {output_file}') + df.to_csv(output_file, index=False) + +def text_to_duration(duration): + """ + Convert a text format duration "HH:MM" to duration in hours + + Args: + duration (str): The text format duration + + Returns: + duration_hours (float): The duration in hours + """ + hours, minutes = duration.split(":") + duration_hours = int(hours) + int(minutes)/60 + return duration_hours -Testing helps us to identify and fix problems with our code early and -quickly and allows us to demonstrate to ourselves and others that our -code does what we claim. More importantly, we can share our tests -alongside our code, allowing others to check this for themselves. -### FAIR Software +def add_duration_hours_variable(df): + """ + Add duration in hours (duration_hours) variable to the dataset -Software testing also underpins the FAIR process by giving us the -confidence to engage in open research practices. + Args: + df (pd.DataFrame): The input dataframe. -If we are not sure that our code works as intended and produces accurate -results, we are unlikely to feel confident about sharing our code with -others. Software testing brings piece of mind by providing a -step-by-step approach that we can apply to verify that our code is -correct. + Returns: + df_copy (pd.DataFrame): A copy of df_ with the new duration_hours variable added + """ + df_copy = df.copy() + df_copy["duration_hours"] = df_copy["duration"].apply( + text_to_duration + ) + return df_copy -Software testing also supports the FAIR process by improving the -**accessibility** and **reusability** of our code. -**Accessible**: +def plot_cumulative_time_in_space(df, graph_file): + """ + Plot the cumulative time spent in space over years -- Well written software tests capture the expected behaviour of our - code and can be used alongside documentation to help developers - quickly make sense of our code. + Convert the duration column from strings to number of hours + Calculate cumulative sum of durations + Generate a plot of cumulative time spent in space over years and + save it to the specified location -**Reusable**: + Args: + df (pd.DataFrame): The input dataframe. + graph_file (str): The path to the output graph file. -- A well tested codebase allows developers to experiment with new - features safe in the knowledge that tests will reveal if their - changes have broken any existing functionality.\ -- The act of writing tests encourages to structure our code as - individual functions and often results in a more modular, readable, - maintainable codebase that is easier to extend or repurpose. + Returns: + None + """ + print(f'Plotting cumulative spacewalk duration and saving to {graph_file}') + df = add_duration_hours_variable(df) + df['cumulative_time'] = df['duration_hours'].cumsum() + plt.plot(df.date, df.cumulative_time, 'ko-') + plt.xlabel('Year') + plt.ylabel('Total time spent in space to date (hours)') + plt.tight_layout() + plt.savefig(graph_file) + plt.show() -### Types of Software Tests -There are many different types of software testing including: +if __name__ == "__main__": -- **Unit Tests**: Unit tests focus on testing individual functions in + if len(sys.argv) < 3: + input_file = 'data/eva-data.json' + output_file = 'results/eva-data.csv' + print(f'Using default input and output filenames') + else: + input_file = sys.argv[1] + output_file = sys.argv[2] + print('Using custom input and output filenames') + + graph_file = 'results/cumulative_eva_graph.png' + main(input_file, output_file, graph_file) + + +``` + +:::::: + +## Why use software testing? + +Adopting software testing as part of our research workflow helps us to +conduct **better research** and produce **FAIR software**: + +- Software testing can help us be more productive as it helps us to identify and fix problems with our code early and + quickly and allows us to demonstrate to ourselves and others that our + code does what we claim. More importantly, we can share our tests + alongside our code, allowing others to verify our software for themselves. +- The act of writing tests encourages to structure our code as individual functions and often results in a more + **readable**, modular and maintainable codebase that is easier to extend or repurpose. +- Software testing improves the **accessibility** and **reusability** of our code - well-written software tests + capture the expected behaviour of our code and can be used alongside documentation to help other developers + quickly make sense of our code. In addition, a well tested codebase allows developers to experiment with new + features safe in the knowledge that tests will reveal if their changes have broken any existing functionality. +- Software testing underpins the FAIR process by giving us the + confidence to engage in open research practices - if we are not sure that our code works as intended and produces accurate + results, we are unlikely to feel confident about sharing our code with + others. Software testing brings piece of mind by providing a + step-by-step approach that we can apply to verify that our code is + correct. + + +## Types of software tests + +There are many different types of software testing. + +- **Unit tests** focus on testing individual functions in isolation. They ensure that each small part of the software performs as intended. By verifying the correctness of these individual units, we can catch errors early in the development process. -- **Integration Tests**: Integration tests, check how different parts +- **Integration tests** check how different parts of the code e.g. functions work together. -- **Regression Tests**: Regression tests are used to ensure that new +- **Regression tests** are used to ensure that new changes or updates to the codebase do not adversely affect the existing functionality. They involve checking whether a program or part of a program still generates the same results after changes @@ -101,7 +237,7 @@ concepts and techniques we cover will provide a solid foundation applicable to other types of testing. ::: challenge -## Types of Software Tests +### Types of software tests Fill in the blanks in the sentences below: @@ -126,7 +262,7 @@ Fill in the blanks in the sentences below: ::: ::: -### Informal Testing +## Informal testing **How should we test our code?** Let’s start by considering the following scenario. A collaborator on our project has sent us the @@ -244,7 +380,7 @@ However, there are limitations to this approach: tested and which have not ::: -## Formal Testing +## Formal testing We can overcome some of these limitations by formalising our testing process. A formal approach to testing our function(s) is to write @@ -285,8 +421,8 @@ Let's create a new python file `test_code.py` in the root of our project folder to store our tests. ``` bash -cd Spacewalks -touch test_code.py +(venv_spacewalks) $ cd spacewalks +(venv_spacewalks) $ touch test_code.py ``` First, we import text_to_duration into our test script. Then, we then @@ -417,7 +553,7 @@ and re-run the test script. As our code base and tests grow, this will become cumbersome. This is not ideal and can be overcome by automating our tests using a testing framework. -## Using a Testing Framework +## Using a testing framework Our approach so far has had two major limitations: @@ -433,7 +569,16 @@ We will use the python testing framework pytest with its code coverage plugin pytest-cov. To install these libraries, open a terminal and type: ``` bash -python -m pip install pytest pytest-cov +(venv_spacewalks) $ python -m pip install pytest pytest-cov +``` + +Make sure to also capture the changes to our virtual development environment. + +```bash +(venv_spacewalks) $ python -m pip freeze > requirements.txt +(venv_spacewalks) $ git add requirements.txt +(venv_spacewalks) $ git commit -m "Added pytest and pytest-cov libraries." +(venv_spacewalks) $ git push origin main ``` Let’s make sure that our tests are ready to work with pytest. @@ -465,8 +610,8 @@ move it to a dedicated test folder and rename our test_code.py file to test_eva_analysis.py. ``` bash -mkdir tests -mv test_code.py tests/test_eva_analysis.py +(venv_spacewalks) $ mkdir tests +(venv_spacewalks) $ mv test_code.py tests/test_eva_analysis.py ``` Before we re-run our tests using pytest, let's update our second test. @@ -544,7 +689,7 @@ def text_to_duration(duration): Finally, let's run our tests: ``` bash -python -m pytest +(venv_spacewalks) $ python -m pytest ``` ``` output @@ -990,7 +1135,7 @@ def text_to_duration(duration): ``` ``` bash -python -m pytest --cov +(venv_spacewalks) $ python -m pytest --cov ``` ``` output @@ -1018,7 +1163,7 @@ To get an in-depth report about which parts of our code are tested and which are not, we can add the option `--cov-report=html`. ``` bash -python -m pytest --cov --cov-report=html +(venv_spacewalks) $ python -m pytest --cov --cov-report=html ``` This option generates a folder `htmlcov` which contains a html code @@ -1053,7 +1198,7 @@ b. Which functions in our code base are currently untested? ::: solution ``` bash -python -m pytest --cov --cov-report=html +(venv_spacewalks) $ python -m pytest --cov --cov-report=html ``` a. The proportion of the code base NOT covered by our tests is @@ -1463,151 +1608,22 @@ Finally lets commit our test suite to our codebase and push the changes to GitHub. ``` bash -git add eva_data_analysis.py -git commit -m "Add additional analysis functions" -git add tests/ -git commit -m "Add test suite" -git push origin main -``` - -## Continuous Integration (Optional) - -::: callout -### Continuous Integration - -So far, we have run our tests locally using. - -``` bash -python -m pytest +(venv_spacewalks) $ git add eva_data_analysis.py +(venv_spacewalks) $ git commit -m "Add additional analysis functions" +(venv_spacewalks) $ git add tests/ +(venv_spacewalks) $ git commit -m "Add test suite" +(venv_spacewalks) $ git push origin main ``` -A limitation of this approach is that we must remember to run our tests -each time we make any changes. - -Continuous integration services provide the infrastructure to -automatically run a\ -code's test suite every time changes are pushed to a remote repository. -This means that each time we (or our colleagues) push to the remote, the -test suite will be run to verify that our tests still pass. +## Continuous Integration for automated testing -If we are using GitHub, we can use the continuous integration service -GitHub Actions to automatically run our tests. +Continuous Integration (CI) services provide the infrastructure to +automatically run the code's test suite every time changes are pushed to a remote repository. +There is an [extra episode on configuring CI for automated tests on GitHub](../learners/ci-for-testing.md) +for some additional reading. -To setup this up: - -- Navigate to the spacewalks folder: - -``` bash -cd ~/Desktop/Spacewalks -``` - -- To setup continuous integration on GitHub actions, the dependencies - of our code must be recorded in a `requirements.txt` file in the - root of our repository. -- You can find out more about creating requirements.txt files from - CodeRefinery's tutorial on "Recording Dependencies". -- For now, add the following list of code dependencies to - requirements.txt in the root of the spacewalks repository: - -``` bash -touch requirements.txt -``` - -Content of `requirements.txt`: - -``` output -numpy -pandas -matplotlib -pytest -pytest-cov -``` - -- Commit the changes to your repository: - -``` bash -git add requirements.txt -git commit -m "Add requirements.txt file" -``` - -Now let's define out continuous integration workflow: - -- Create a hidden folder .github/workflows - -``` bash -mkdir -p .github/workflows -touch .github/workflows/main.yml -``` - -- Define the continuous integration workflow to run on GitHub actions. - -``` yaml -name: CI - -# We can specify which Github events will trigger a CI build -on: push - -# now define a single job 'build' (but could define more) -jobs: - - build: - - # we can also specify the OS to run tests on - runs-on: ubuntu-latest - - # a job is a sequence of steps - steps: - - # Next we need to checkout out repository, and set up Python - # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Set up Python 3.12 - uses: actions/setup-python@v4 - with: - python-version: "3.12" - - - name: Install Python dependencies - run: | - python3 -m pip install --upgrade pip - python3 -m pip install -r requirements.txt - - - name: Test with PyTest - run: | - python3 -m pytest --cov -``` - -This workflow definition file instructs GitHub Actions to run our unit -tests using python version 3.12 each time code is pushed to our -repository, - -- Let's push these changes to our repository and see if the tests are - run on GitHub. - -``` bash -git add .github/workflows/main.yml -git commit -m "Add GitHub actions workflow" -git push origin main -``` - -- To find out if the workflow has run, navigate to the following page - in your browser: - -``` -https://github.com/YOUR-REPOSITORY/actions -``` - -- On the left of this page a sidebar titled "Actions" lists all the - workflows that are active in our repository. You should "CI" here - (the `name` of the workflow we just added to our repository ). -- The body of the page lists the outcome of all historic workflow - runs. If the workflow was triggered successfully when we pushed to - the repository, you should see one workflow run listed here. -::: - -### Summary +## Summary During this episode, we have covered how to use software tests to verify the correctness of our code. We have seen how to write a unit test, how @@ -1621,11 +1637,8 @@ engage in open research practices. Tests also document the intended behaviour of our code for other developers and mean that we can experiment with changes to our code knowing that our tests will let us know if we break any existing functionality. In other words, software -testing suppors FAIR software by making our code more Accessible and -Reusable. - -To find out more about this topic, please see the "Further reading" -section below. +testing supports the FAIR software principles by making our code more **accessible** and +**reusable**. ## Further reading @@ -1652,6 +1665,7 @@ Also check the [full reference set](learners/reference.md#litref) for the course. ::: keypoints + 1. Code testing supports the FAIR principles by improving the accessibility and re-usability of research code. 2. Unit testing is crucial as it ensures each functions works @@ -1662,4 +1676,5 @@ the course. ensure your code performs correctly under a variety of conditions. 5. Test coverage can help you to identify parts of your code that require additional testing. +6. ::: diff --git a/episodes/07-documenting-code.md b/episodes/09-code-documentation.md similarity index 90% rename from episodes/07-documenting-code.md rename to episodes/09-code-documentation.md index 4f85e776..8a6178ec 100644 --- a/episodes/07-documenting-code.md +++ b/episodes/09-code-documentation.md @@ -1,5 +1,5 @@ --- -title: Documenting code +title: Code documentation teaching: 60 exercises: 30 --- @@ -25,39 +25,49 @@ After completing this episode, participants should be able to: :::::::::::::::::::::::::::::::::::::::::::::::: -## Motivation for documenting software - We have seen how writing inline comments and docstrings within our code can help with improving its readability. The purpose of software documentation is to communicate other important information about our software (its purpose, dependencies, how to install and run it, etc.) to the people who need it – both users and developers. -### Better research +## Why document our software? Software documentation is often perceived as a thankless and time-consuming task with few tangible benefits and is often neglected in research projects. -However, like software testing, documenting our software can help us and others become more productive researchers. -Here are some advantages of documenting our code: +However, like software testing, documenting our software can help us and others +conduct **better research** and produce **FAIR software**: - Good documentation captures important methodological details ready for when we come to publish our research - Good documentation can help us return to a project seamlessly after time away -- Documentation can facilitate collaborations by helping to onboard new project members +- Documentation can facilitate collaborations by helping us onboard new project members quickly and more easily - Good documentation can save us time by answering frequently asked questions (FAQs) about our code for us +- Software documentation supports the FAIR research software principles by improving the re-usability of our code. + - Good documentation can make our software more understandable and reusable by others, and can bring us some citations + and credit + - How-to guides and tutorials ensure that users can install our software independently and make use of its basic features + - Reference guides and background information can help developers understand our code sufficiently to + modify/extend/repurpose it. + +Before we move on with further code modifications, make sure your virtual development environment is active. -### FAIR software +::: callout -Software documentation supports FAIR software by improving the re-usability of our code. +### Activate your virtual environment +If it is not already active, make sure to activate your virtual environment from the root of your project directory +in your command line terminal (e.g. Bash or GitBash): -- Good documentation can make our software more understandable and reusable by others, and can bring us some citations - and credit -- How-to guides and tutorials ensure that users can install our software independently and make use of its basic features -- Reference guides and background information can help developers understand our code sufficiently to -modify/extend/repurpose it. +```bash +$ source venv_spacewalks/bin/activate # Mac or Linux +$ source venv_spacewalks/Scripts/activate # Windows +(venv_spacewalks) $ +``` + +::: ## Software-level documentation -In previous episodes we encountered several different forms of in-code documentation -including in-line comments and docstrings. +In previous episodes we encountered several different forms of in-code documentation aspects, +including in-line comments and docstrings. These are an excellent way to improve the readability of our code, but by themselves are insufficient to ensure that our code is easy to use, understand and modify - @@ -102,67 +112,48 @@ A README file acts as a “landing page” for your code repository on GitHub an ::::::::::::::::::::::::::::::::::::: challenge -### READMEs and The FAIR Principles +### README and the FAIR principles -Think about the question below. Your instructors may ask you to share your answer in a shared notes document and/or discuss them with other participants. +Think about the question below. Your instructors may ask you to share your answer in a shared notes document and/or +discuss them with other participants. -Here are some of the major sections you might find in a typical README. Which are **essential** to support the FAIR principles? Which are optional? +Here are some of the major sections you might find in a typical README. +Which are **essential** to support the FAIR principles? Which are optional? + Purpose of the code + Audience (who the code is intended for) -+ Installation Instructions -+ Contribution Guide -+ How to Get Help ++ Installation instructions ++ Contribution guide ++ How to get help + License -+ Software Citation -+ Usage Example ++ Software citation ++ Usage example + Dependencies and their versions + FAQs + Code of Conduct :::::::::::::::::::::::: solution -To support the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), certain sections in a README file are more important than others. Below is a breakdown of the sections that are ESSENTIAL / OPTIONAL in a README to align with these principles. - -### Essential - -1. **Purpose of the code** - - **Explanation:** Clearly explains what the code does. This is essential for findability and reusability. - -2. **Installation Instructions** - - **Explanation:** Provides step-by-step instructions on how to install the software, ensuring accessibility. - -3. **Usage Example** - - **Explanation:** Provides examples of how to use the code, helping users understand its functionality and enhancing reusability. - -4. **License** - - **Explanation:** Specifies the terms under which the code can be used, which is crucial for legal clarity and reusability. - -5. **Dependencies and their versions** - - **Explanation:** Lists the external libraries and tools required to run the code, including their versions. This is essential for reproducibility and interoperability. - -6. **Software Citation** - - **Explanation:** Provides citation information for academic use, ensuring proper attribution and reusability. - -### Optional +To support the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), +certain sections in a README file are more important than others. +Below is a breakdown of the sections that are *essential* or *optional* in a README to align with these principles. -7. **Audience (who the code is intended for)** - - **Explanation:** Helps users identify if the code is relevant to them, improving findability and usability. - +#### Essential -8. **How to Get Help** - - **Explanation:** Informs users where they can get help, ensuring better accessibility. +- **Purpose of the code** - clearly explains what the code does; essential for findability and reusability. +- **Installation instructions** - provides step-by-step instructions on how to install the software, ensuring accessibility. +- **Usage Example** - provides examples of how to use the code, helping users understand its functionality and enhancing reusability. +- **License**- specifies the terms under which the code can be used, which is crucial for legal clarity and reusability. +- **Dependencies and their versions** - lists the external libraries and tools required to run the code, including their versions; essential for reproducibility and interoperability. +- **Software citation** - provides citation information for academic use, ensuring proper attribution and reusability. +#### Optional -9. **Contribution Guide** - - **Explanation:** Encourages and guides contributions from the community, enhancing the code's development and reusability. - -10. **FAQs** - - **Explanation:** Provides answers to common questions, aiding in troubleshooting and improving accessibility. - - -11. **Code of Conduct** - - **Explanation:** Sets expectations for behaviour in the community, fostering a welcoming environment and enhancing accessibility. +- **Audience (who the code is intended for)** - helps users identify if the code is relevant to them, improving findability and usability. +- **How to get help** - informs users where they can get help, ensuring better accessibility. +- **Contribution guide** - encourages and guides contributions from the community, enhancing the code's development and reusability. +- **FAQs** - provide answers to common questions, aiding in troubleshooting and improving accessibility. +- **Code of Conduct** - sets expectations for behaviour in the community, fostering a welcoming environment and enhancing accessibility. ::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: @@ -877,6 +868,21 @@ How does the content and language of our example tutorial differ from our exampl ::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: +::: callout + +### Commit and push your changes + +Do not forget to commit any uncommited changes you may have and then push your work to GitHub. + +```bash +git add +git commit -m "Your commit message" +git push origin main +``` + +::: + + ## Further reading We recommend the following resources for some additional reading on the topic of this episode: diff --git a/episodes/08-open-collaboration.md b/episodes/10-open-collaboration.md similarity index 94% rename from episodes/08-open-collaboration.md rename to episodes/10-open-collaboration.md index b318bb99..ee2050ef 100644 --- a/episodes/08-open-collaboration.md +++ b/episodes/10-open-collaboration.md @@ -1,5 +1,5 @@ --- -title: Open project collaboration & management +title: Open collaboration on code teaching: 60 exercises: 30 --- @@ -22,11 +22,27 @@ After completing this episode, participants should be able to: :::::::::::::::::::::::::::::::::::::::::::::::: -# Sharing your code to encourage collaboration - In addition to adding a license and other metadata to our code (covered in previous episode) there are several other important steps to consider before sharing the code publicly. +Before we move on with further code modifications, make sure your virtual development environment is active. + +::: callout + +### Activate your virtual environment +If it is not already active, make sure to activate your virtual environment from the root of your project directory +in your command line terminal (e.g. Bash or GitBash): + +```bash +$ source venv_spacewalks/bin/activate # Mac or Linux +$ source venv_spacewalks/Scripts/activate # Windows +(venv_spacewalks) $ +``` + +::: + +## Sharing code to encourage collaboration + ### Making the code public By default repositories created on Github are private and only their creator can see them. @@ -177,12 +193,12 @@ itself and not a paper, although a short abstract or description of the software ::: -# Working with collaborators +## Working with collaborators The strength of online collaboration tools such as Github doesn't just lie in the ability to share code. They also allow us to track problems with that code, for multiple developers to work on it independently and bring their changes together and to review those changes before they are accepted. -## Tracking issues with code +### Tracking issues with code A key feature of Github (as opposed to Git itself) is the issue tracker. This provides us with a place to keep track of any problems or bugs in the code and to discuss them with other developers. Sometimes advanced users will also use issue trackers of public projects to report problems they are having (and sometimes this is misused by users @@ -213,9 +229,7 @@ Create a new issue in your repository's issue tracker by doing the following: - Click on the "New issue" button - Enter a title and description for the issue - Click the "Submit Issue" button to create the issue. - - -::::::::::::::::::::::::::::::::::::::::::::::: +::: ### Discussing an issue @@ -227,7 +241,7 @@ We can also reference other issues by writing a # symbol and the number of the o Once an issue is solved then it can be closed. This can be done either by pressing the "Close" button in the Github web interface or by making a commit which includes the word "fixes", "fixed", "close", "closed" or "closes" followed by a # symbol and the issue number. -## Working in parallel with Git branches +### Working in parallel with Git branches Branching is a feature of Git that allows two or more parallel streams of work. Commits can be made to one branch without interfering with @@ -235,7 +249,7 @@ another. Branches are commonly used as a way for one developer to work on a new feature or a bug fix while other developers work on other features. When those new features or bug fixes are complete, the branch will be merged back with the main (sometimes called master) branch. -### Creating a new branch +#### Creating a new branch New git branches are created with the `git branch` command. This should be followed by the name of the branch to create. It is common practice when the bug we are fixing has a corresponding issue to name the branch after the issue number and name. @@ -273,7 +287,7 @@ To create a branch and change to it in a single command we can use `git switch` git switch -c 02-another-bug ``` -### Committing to a branch +#### Committing to a branch Once we have switched to a branch any further commits that are made will go to that branch. When we run a `git commit` command we'll see the name of the branch we're committing to in the output of `git commit`. Let's edit our code and fix the lack of default values bug that we entered into the issue tracker earlier on. @@ -330,7 +344,7 @@ command. git pull origin 01-extra-brakcet-bug ``` -## Merging branches +### Merging branches When we have completed working on a branch (for example fixing a bug) then we can merge our branch back into the main one (or any other branch). This is done with the `git merge` command. @@ -346,7 +360,7 @@ Now we're back on the main branch we can go ahead and merge the changes from the git merge 01-extra-bracket-bug ``` -## Pull requests +### Pull requests On larger projects we might need to have a code review process before changes are merged, especially before they are merged onto the main branch that might be what is being released as the public version of the software. Github has a process for this that it calls a "Pull Request", other Git services such as GitLab have different names for this, GitLab calls them "Merge Requests". @@ -375,7 +389,7 @@ projects that you contribute to might have their own rules about what kind of me Go ahead and click on "Merge pull request", then "Confirm merge". The changes will now be merged together. Github gives us the option to delete the branch we were working on, since it's history is preserved in the main branch there isn't any reason to keep it. -### Using forks instead of branches +#### Using forks instead of branches A fork is similar to a branch, but instead of it being part of the same repository it is a entirely new copy of the repository. Forks are commonly used by Github users who wish to work on a project that they're not a member of. Typically forking will copy the repository to our own namespace (e.g. github.com/username/reponame instead of github.com/projectname/reponame) @@ -399,23 +413,31 @@ They will get an email and an alert within Github to accept your invitation to w - Commit these changes to your fork - Create a pull request back to the original repository - Your partner will now receive your pull request and can review +::: -::::::::::::::::::::::::::::::::::::::::::::::: +::: callout -## Acknowledgements +### Commit and push your changes -The content of this episode was inspired / heavily borrowed from the following resources: +Do not forget to commit any uncommited changes you may have and then push your work to GitHub. + +```bash +git add +git commit -m "Your commit message" +git push origin main +``` + +::: -- Software carpentry git lesson licensing and citation sections - https://swcarpentry.github.io/git-novice/11-licensing.html and https://swcarpentry.github.io/git-novice/12-citation.html -- Carpentries Github Skill up - https://carpentries-incubator.github.io/github-skill-up-instructors/ and https://carpentries.github.io/github-skill-up-maintainers/ -- RSG Soton Git lesson - https://southampton-rsg.github.io/swc-git-novice/06-collab/index.html ## Further reading We recommend the following resources for some additional reading on the topic of this episode: - Licencing and citation episodes from the [Software Carpentry's Git Novice lesson][swc-git-lesson] +- Carpentries Github Skill ups for [instructors][git-skillup-instructors] and [maintainers][git-skillup-maintainers] +- [RSG Southampton Git lesson][git-soton] - [collaboration section][git-soton-collaboration] - [Open source definition][osd-definition], by the [Open Source Initiative][osd] - [What is free software?][free-software] diff --git a/episodes/10-wrap-up.md b/episodes/11-wrap-up.md similarity index 100% rename from episodes/10-wrap-up.md rename to episodes/11-wrap-up.md diff --git a/learners/ci-for-testing.md b/learners/ci-for-testing.md new file mode 100644 index 00000000..7a1ad514 --- /dev/null +++ b/learners/ci-for-testing.md @@ -0,0 +1,141 @@ +--- +title: Continuous Integration for automated testing +teaching: 20 +exercises: 20 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can I automate the testing of my repository's code in a way that scales well? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +After completing this episode, participants should be able to: + +- Describe the benefits of using Continuous Integration for further automation of testing +- Enable GitHub Actions Continuous Integration for public open source repositories +- Use continuous integration to automatically run unit tests and code coverage when changes are committed to a version control repository + +:::::::::::::::::::::::::::::::::::::::::::::::: + + +So far in this course, we have run our tests locally using: + +``` bash +$ python -m pytest +``` + +A limitation of this approach is that we must remember to run our tests +each time we make any changes. Continuous Integration (CI) services provide the infrastructure to +automatically run the code's test suite every time changes are pushed to a remote repository. +This means that each time we (or our colleagues) push to the remote, the +test suite will be run to verify that our tests still pass. + +If we are using GitHub, we can use the Continuous Integration service called +GitHub Actions to automatically run our tests. +To setup this up, the following steps are needed: + +- Navigate to your software project directory, e.g. `$ cd ~/Desktop/spacewalks` +- Make sure your software's dependencies are recorded in a `requirements.txt` file in the +root of our repository, as we have been doing so far. +Make sure to capture the latest changes in your dependencies with: + + ```bash + $ source venv_spacewalks/bin/activate # to activate your virtual environment + (venv_spacewalks) $ python -m pip freeze > requirements.txt + ``` +Your `requirements.txt` should like this: + + ``` bash + numpy + pandas + matplotlib + pytest + pytest-cov + ``` +- Push any changes to `requirements.txt` to GitHub: + + ```bash + (venv_spacewalks) $ git add requirements.txt + (venv_spacewalks) $ git commit -m "Updated requirements.txt." + (venv_spacewalks) $ git push origin main + ``` +- Define the CI integration workflow `main.yml` in a newly created folder `.github/workflows` off your project root: + + ``` bash + (venv_spacewalks) $ mkdir -p .github/workflows + (venv_spacewalks) $ touch .github/workflows/main.yml + ``` +- Populate the CI workflow file with commands to run on GitHub Actions. + + ``` yaml + name: CI + + # We can specify which Github events will trigger a CI build + on: push + + # now define a single job 'build' (but could define more) + jobs: + + build: + + # we can also specify the OS to run tests on + runs-on: ubuntu-latest + + # a job is a sequence of steps + steps: + + # Next we need to checkout out repository, and set up Python + # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Set up Python 3.12 + uses: actions/setup-python@v4 + with: + python-version: "3.12" + + - name: Install Python dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install -r requirements.txt + + - name: Test with PyTest + run: | + python3 -m pytest --cov + ``` +This workflow definition file instructs GitHub Actions to run our unit +tests using Python version 3.12 each time code is pushed to our +repository. + +- Push these changes to our repository to initiate running of tests on GitHub. + + ``` bash + (venv_spacewalks) $ git add .github/workflows/main.yml + (venv_spacewalks) $ git commit -m "Add GitHub actions workflow" + (venv_spacewalks) $ git push origin main + ``` + +- To check if workflow has run, navigate to the following page in your browser: + + ``` + https://github.com/YOUR-REPOSITORY/actions + ``` +On the left of this page a sidebar titled "Actions" lists all the +workflows that are active in our repository. You should see "CI" here +(which is the `name` of the workflow we just added to our repository). +The body of the page lists the outcome of all historic workflow +runs. If the workflow was triggered successfully when we pushed to +the repository, you should see one workflow run listed here. + + +::: keypoints + +- Continuous Integration can run tests automatically to verify changes as code develops in our repository. +- CI builds are typically triggered by commits pushed to a repository. +- We need to write a configuration file to inform a CI service what to do for a build. +- We can run - and get reports from - different CI infrastructure builds simultaneously. + +::: \ No newline at end of file diff --git a/episodes/09-code-ethics.md b/learners/ethical-environmental-considerations.md similarity index 99% rename from episodes/09-code-ethics.md rename to learners/ethical-environmental-considerations.md index 27e69420..c41271c6 100644 --- a/episodes/09-code-ethics.md +++ b/learners/ethical-environmental-considerations.md @@ -1,5 +1,5 @@ --- -title: Ethical considerations for research software +title: Ethical & environmental considerations around research software teaching: 45 exercises: 15 --- diff --git a/learners/reference.md b/learners/reference.md index c7ff2b66..b141cbd7 100644 --- a/learners/reference.md +++ b/learners/reference.md @@ -82,3 +82,7 @@ and Australian Research Data Commons - [CODECHECK][codecheck], an approach for independent execution of computations underlying research articles +- Carpentries Github Skill ups for [instructors][git-skillup-instructors] and [maintainers][git-skillup-maintainers] + +- [RSG Southampton Git lesson][git-soton] - [collaboration section][git-soton-collaboration] + diff --git a/links.md b/links.md index 6ff9c4b6..341a7fda 100644 --- a/links.md +++ b/links.md @@ -38,6 +38,12 @@ any links that you are not going to use. [beginner-guide-reproducible-research]: https://esajournals.onlinelibrary.wiley.com/doi/10.1002/bes2.1801 [swc-git-lesson]: https://swcarpentry.github.io/git-novice [swc-git-lesson-track]: https://swcarpentry.github.io/git-novice/04-changes.html +[swc-git-lesson-licencing]: https://swcarpentry.github.io/git-novice/11-licensing.html +[swc-git-lesson-citation]: https://swcarpentry.github.io/git-novice/12-citation.html +[git-skillup-instructors]: https://carpentries-incubator.github.io/github-skill-up-instructors/ +[git-skillup-maintainers]: https://carpentries.github.io/github-skill-up-maintainers/ +[git-soton]: https://southampton-rsg.github.io/swc-git-novice/index.html +[git-soton-collaboration]: https://southampton-rsg.github.io/swc-git-novice/06-collab/index.html [git-diff-docs]: https://git-scm.com/docs/git-diff [ttw-guide-version-control]: https://the-turing-way.netlify.app/reproducible-research/vcs [how-git-works]: https://www.pluralsight.com/courses/how-git-works @@ -108,3 +114,4 @@ any links that you are not going to use. [coursera-inline-comments]: https://www.coursera.org/tutorials/python-comment#inline-commenting-in-python [python-functions-intro]: https://introtopython.org/introducing_functions.html [python-functions-w3schools]: https://www.w3schools.com/python/python_functions.asp +[github]: https://github.com