From 2a8685bbdcc395fbab8c56279929ae709d455540 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 21 Oct 2024 10:30:53 +0100 Subject: [PATCH 01/33] added oxford commas for consistency --- book/glossary.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/book/glossary.md b/book/glossary.md index 1dc46b50..b95de9a8 100644 --- a/book/glossary.md +++ b/book/glossary.md @@ -13,7 +13,7 @@ Treating a problem as an idea or concept, rather than a detailed individual exam Abstraction is used to manage the complexity of software, by describing our logic in a more generic way or by hiding complexity. When we use similar logic in multiple parts of our code, we can abstract this logic into a generic function or method to be reused. -When part of our process is very complex but is self-contained we can hide this complexity it by putting it inside a function and referring to the function. +When part of our process is very complex, but is self-contained, we can hide this complexity it by putting it inside a function and referring to the function. ### Application Programming Interface (API) @@ -198,7 +198,7 @@ Using open-source software and publishing our code with open-source licenses ens Open-source programming languages are free to use. We recommend using Python and R, which support the good practices outlined in this guidance. -Proprietary software, including SPSS, SAS and Stata are closed source. +Proprietary software, including SPSS, SAS, and Stata are closed source. These are expensive to use and do not support many of the good analysis practices outlined in this guidance. Not using open source analysis tools means that our users need to purchase software to reproduce our results. @@ -236,12 +236,12 @@ A orchestrated chain of programs that will execute the next program(s) in the ch ### Procedural Running -Where a script is executed as statements run one after the other. +Where a script is executed as statements that run one after the other. ### Program -A main script that can contain statements, functions and methods. The main script may import functions and methods from other scripts. This is the script that would be run by the user/orchestration tool. +A main script that can contain statements, functions, and methods. The main script may import functions and methods from other scripts. This is the script that would be run by the user/orchestration tool. ### Python and R @@ -258,7 +258,7 @@ Ability to gain an understanding of code within a reasonable amount of time. ### Reproducible Analytical Pipelines (RAP) Reproducible Analytical Pipelines (RAP) are analyses that are carried out following good software engineering practices that are described by this guidance. -They focus on the use of open-source analytical tools and a variety of techniques to deliver reproducible, auditable and assured data analyses. +They focus on the use of open-source analytical tools and a variety of techniques to deliver reproducible, auditable, and assured data analyses. RAP is more generally a culture of wanting to improve the quality of our analysis, by improving the quality assurance of our analysis code. From d83f87cd50354070f41698e696f55c88b5ab7c67 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 21 Oct 2024 15:00:08 +0100 Subject: [PATCH 02/33] punctuation and typo changes --- book/managers_guide.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/book/managers_guide.md b/book/managers_guide.md index ea71721a..16957fd9 100644 --- a/book/managers_guide.md +++ b/book/managers_guide.md @@ -34,7 +34,7 @@ When managing analytical work, you should not need an in-depth understanding of However, you should be confident that the approach the team has taken is appropriate given the user need, and that proportionate quality assurance is being applied to the development and running of the analysis. -You should work with your team to decide on which quality assurance practices are necessary given each piece of analysis. +You should work with your team to decide which quality assurance practices are necessary given each piece of analysis. You might find our [](checklists.md) useful templates for defining the target level of assurance. When possible, you should define the target assurance level before starting the analysis. @@ -50,7 +50,7 @@ can be used as a reference to apply these practices to analysis. You should identify where each analyst is along the pathway - they should look to develop the next skill in the pathway and apply this, rather than attempting to adopt them all at once. -Note that it is important to maintain technical skills in the analysis team for sustainability, to ensure that the analysis can be understood, updated and maintained. +Note that it is important to maintain technical skills in the analysis team for sustainability, to ensure that the analysis can be understood, updated, and maintained. ``` Despite the initial cost of developing technical skills, @@ -80,7 +80,7 @@ Understanding user needs ensures that the analysis is valuable. * There should be a plan to consult users at the beginning and throughout the development process, to ensure that their needs are being met. * The methodology and data should be suitable for the question being asked. -* The analysis must be developed by more than one individual, to allow pair programming, peer review and mentoring. This increases the sustainability of analysis. +* The analysis must be developed by more than one individual, to allow pair programming, peer review, and mentoring. This increases the sustainability of analysis. * The analysis should be carried out using open-source analysis tools, wherever possible. Your team should be able to explain why they have chosen the analysis tools and why they are confident that they are fit for purpose. @@ -118,7 +118,7 @@ to assess how the team are progressing towards the target quality assurance prac * Logic should be written as functions, so that it can be reused consistently and tested. * Related functions should be grouped together in the same file, so that it is easy to find them. -* Logic with different responsibilities (e.g. reading data versus transforming data) should be clearly separated. +* Logic with different responsibilities (e.g., reading data versus transforming data) should be clearly separated. * When code can be reused for other analyses, it should be stored and shared as a package. @@ -134,7 +134,7 @@ to assess how the team are progressing towards the target quality assurance prac [Code documentation](code_documentation.md) is essential for business continuity and sustainability. -* Every function should be documented in the code, so that it is clear what the function is supposed to do. +* Every function should be documented in the code, so it is clear what the function is supposed to do. * Function documentation should include what goes in and what comes out of each function. * Where code will be run or re-used by others, documentation should include usage examples and test data. @@ -147,7 +147,7 @@ to assess how the team are progressing towards the target quality assurance prac * Software and package versions should be documented with the code. Typically package versions are recorded using `setup.py` and `requirements.txt` files (Python) or a `DESCRIPTION` file (R). * Code should not be dependent on a specific computer to run. Running it on a colleague's system can help to check this. -* When the same analysis is run multiple times or on different systems it should give reproducible outcomes. +* When the same analysis is run multiple times or on different systems, it should give reproducible outcomes. * Container systems, like Docker, help to create reproducible environments to run code. @@ -207,4 +207,4 @@ Large differences in the outcome of the results may indicate an issue with the a Code quality improves over time, as your team learn more about good practices. * The team should be aiming to meet the agreed assurance level, but should also consider which practices could be applied next to improve the code beyond this. -* You should review training needs in your team and allow time for continuos personal development of these practices. +* You should review training needs in your team and allow time for continuous personal development of these practices. From e2015a24cc4f1698993e8c2beb98db170d3f44e1 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 23 Oct 2024 10:57:25 +0100 Subject: [PATCH 03/33] minor punctuation edits on Principles --- book/principles.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/book/principles.md b/book/principles.md index 13d648c6..b64f8a5f 100644 --- a/book/principles.md +++ b/book/principles.md @@ -1,6 +1,6 @@ # Principles -When we do analysis it must be fit for purpose. +When we do analysis, it must be fit for purpose. If it isn't, we risk misinforming decisions. Bad analysis can result in harm or misallocation of public funds. As such, we must take the right steps to ensure high quality analysis. @@ -46,7 +46,7 @@ Each of these pieces of guidance advocate reproducibility as a core tenet of qua Reproducibility is the only thing that you can guarantee in your analysis. It is the first pillar of good analysis. -If you can't prove that you can run the same analysis, with the same data, and obtain the same results then you are not adding a valuable analysis. +If you can't prove that you can run the same analysis, with the same data, and obtain the same results, then you are not adding a valuable analysis. The additional assurances of peer review, rigorous testing, and validity are secondary to being able to reproduce any analysis that you carry out in a proportionate amount of time. @@ -98,11 +98,11 @@ We don't want to overburden analysts with QA procedures when they are not requir In government we advocate **proportionality** - the right quality assurance procedures for the right analysis. Analysis can be proportionately assured through peer review and defensive design. -We advocate following your department's guidance on what proportionate quality assurance looks like. +We suggest following your department's guidance on what proportionate quality assurance looks like. Most departments derive their guidance from the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government). Assurance is best demonstrated through peer review. Peer reviewers must be able to understand your analytical choices and be able to reproduce your conclusions. -In particularly high risk analysis, dual running should be considered. +Dual running should be considered, particularly for high risk analysis. Guarantees of quality assurance should be published alongside any report, or be taken into consideration by decision makers. @@ -116,7 +116,7 @@ These workflows should follow the principles of reproducible analysis. We call these [Reproducible Analytical Pipelines (RAP)](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/). Reproducible analysis is still not widely practised across government. -Many analysts use proprietary (paid-for) analytical tools like SAS or SPSS in combination with programs like Excel, Word or Acrobat to create statistical products. +Many analysts use proprietary (paid-for) analytical tools like SAS or SPSS in combination with programs like Excel, Word, or Acrobat to create statistical products. The processes for creating statistics in this way are usually manual or semi-manual. Colleagues then typically repeat parts of the process manually to quality assure the outputs. From fd20a7d7d35b7e22123792fa0a6b4a9501fda6b2 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 23 Oct 2024 11:43:25 +0100 Subject: [PATCH 04/33] punctuation and grammatical edits --- book/modular_code.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/book/modular_code.md b/book/modular_code.md index 4af2afb4..201535b8 100644 --- a/book/modular_code.md +++ b/book/modular_code.md @@ -9,9 +9,9 @@ These have been tailored to a more analytical workflow. To get the most benefit from this section, you should have an understanding of core programming concepts such as: * storing information in variables -* using control flow, such as if-statements and for-loops -* writing code as functions or classes -* using functions or classes in your code +* using control flow, such as if-statements and for-loops. +* writing code as functions or classes. +* using functions or classes in your code. You can find links to relevant training in the [](learning.md) section of the book. ``` @@ -19,18 +19,18 @@ You can find links to relevant training in the [](learning.md) section of the bo ## Motivation -Code that is repetitive, disorganised or overly complex can be difficult to understand, even for experienced programmers. +Code that is repetitive, disorganised, or overly complex can be difficult to understand, even for experienced programmers. This makes assuring, testing or changing the code more burdersome. -It also makes it harder to spot and fix mistakes. +You may also find it harder to spot and fix mistakes. Code that isn't modular can cause a range of issues: -- walls of repetitive code that is hard to absorb -- long, complex scripts that are hard to follow -- over-complicated code where a simpler solution could be used -- a code base that makes it difficult to find what you're looking for +- walls of repetitive code that is hard to absorb. +- long, complex scripts that are hard to follow. +- over-complicated code where a simpler solution could be used. +- a code base that makes it difficult to find what you're looking for. -This chapter highlights ways to write modular code that is easier to read, review and maintain. +This chapter highlights ways to write modular code that is easier to read, review, and maintain. These practices will also help you implement the other good coding practices you will come across in this book, such as version control, review, testing and documentation. Because of this, modular code is fundamental to making analysis more reproducible, auditable and assured. @@ -39,13 +39,13 @@ Because of this, modular code is fundamental to making analysis more reproducibl ## Modular code Breaking your code down into smaller, more manageable chunks is a sensible way to improve readability. -Regardless of the language, there are often techniques to containerise your code into self-contained parts such as modules, classes or functions. +Regardless of the language, there are often techniques to containerise your code into self-contained parts such as modules, classes, or functions. (functions)= ### Write re-usable code as functions -In the early stages of analysis we often copy and paste code to 'make it work'. As this work matures, it is worth taking repetitive code and turning it into functions. +In the early stages of analysis, we often copy and paste code to 'make it work'. As this work matures, it is worth taking repetitive code and turning it into functions. Functions allow us to make a piece of logic reusable in a consistent and readable way, and also makes it easier for us to [test our logic](testing_code.md). When starting to write functions, you should consider what is the right level of complexity for a single function. @@ -272,7 +272,7 @@ When multiple classes have a similar application programming interface (API, i.e A good real-world example of this can be seen in the `scikit-learn` package, where the different linear model types are represented by different classes. Each linear model class supports a common set of methods, e.g. `fit()` and `predict()`. As such, any model can then be used in a pipeline and swapped out with minimal effort. -Therefore, when thinking about how to break you code up into classes consider the use of standardised methods across similar objects to make them interchangeable. +Therefore, when thinking about how to break you code up into classes, consider the use of standardised methods across similar objects to make them interchangeable. (class-responsibilities)= @@ -545,10 +545,10 @@ That said, great strengths of notebooks include their flexibility in displaying and their ability to present final research code alongside a narrative. Therefore the top 2 reasons to use notebooks in the project lifecycle is to: -- explore and 'play' with the data while developing your methods. -- turn notebooks into HTML reports to present results to end users. +- Explore and 'play' with the data while developing your methods. +- Turn notebooks into HTML reports to present results to end users. -In short, notebooks are not suitable for modularising analysis pipelines, however, they are a great way to do research analytics and to present results. +In short, notebooks are not suitable for modularising analysis pipelines. However, they are a great way to do research analytics and to present results. Therefore, as the exploratory part of your analysis draws to a close, or there is a need to produce similar analysis more regularly, it is wise to refactor notebooks. Reusable functions and classes can be moved to modules and the main analysis pipeline might instead be reproducibly run from a script. Here are a few suggestions to consider when refactoring code from notebooks: From a7730df7112786d01c97c5c77d562b0fb5ecc8d0 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 23 Oct 2024 14:19:34 +0100 Subject: [PATCH 05/33] punctuation consistency in Readable Code --- book/readable_code.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/book/readable_code.md b/book/readable_code.md index 8f71a416..9cbd14c6 100644 --- a/book/readable_code.md +++ b/book/readable_code.md @@ -9,8 +9,8 @@ These have been tailored to a more analytical workflow. To get the most benefit from this section, you should have an understanding modular code, which was covered in the [previous chapter](modular_code.md). You should also have an understand of core programming concepts such as: -* storing information in variables -* using control flow, such as if-statements and for-loops +* storing information in variables. +* using control flow, such as if-statements and for-loops. You can find links to relevant training in the [](learning.md) section of the book. ``` @@ -24,9 +24,9 @@ Code is read more often than it is written. -- Guido van Rossum (creator of Python) ``` -When writing code, we should expect that at some point someone else will need to understand, use and adapt it. +When writing code, we should expect that at some point someone else will need to understand, use, and adapt it. This might be yourself in six months time. -As such, it is important to empathise with these potential users and write code that is tidy, understandable and does not add unnecessary complexity. +As such, it is important to empathise with these potential users and write code that is tidy, understandable, and does not add unnecessary complexity. Doing this will make for a 'self-documenting' codebase that does not need as much additional documentation. This chapter highlights good coding practices that will improve the readability and maintainability of your code. @@ -62,8 +62,8 @@ This includes variables, functions, classes and any other objects that can be as Someone reading your code will benefit greatly if you use names that are: -- informative and not misleading -- concise but not cryptic +- informative and not misleading. +- concise but not cryptic. (naming-variables)= @@ -307,7 +307,7 @@ if (is_clean(data)) { Class names are usually started with a capital letter, and in `CamelCase`, as this differentiates them from `variableNames` and `variable_names`. Class names follow the same advice as for [](naming-functions) - namely, is it obvious from the class name what it does? -If its too complex to name concisely, it is an indication of too many [responsibilities](class-responsibilities) +If it is too complex to name concisely, it is an indication of too many [responsibilities](class-responsibilities) and you should refactor your code into more, smaller classes. Method names in a class closely follow the requirements for [](naming-functions), as methods are just functions that are tied to a class. @@ -325,20 +325,20 @@ As discussed in the [](modular_code.md) chapter, writing custom classes is more Programming languages can differ in a myriad of ways. One way R and Python differ, for example, is their use of indentation. -Indentation is part of the well defined syntax of Python while it is not for R. +Indentation is part of the well defined syntax of Python but is not for R. This does not mean that you shouldn't use indentation in R to make your code more readable. -If in doubt it is often wise to consult how to use formatting to write more readable code by finding the style guidelines for your language. +If in doubt, you should consider consulting how to use formatting to write more readable code by finding the style guidelines for your language. Generally, code style guides provide a standard or convention for formatting and laying out your code. The purpose of these style guides is to increase consistency across the programming community for a given language. They might include how to appropriately: -- comment or document your code -- name your functions, variables or classes -- separate elements of your code with whitespace -- use indentation to make sure your code is readable -- provide other useful guidance regarding formatting +- comment or document your code. +- name your functions, variables or classes. +- separate elements of your code with whitespace. +- use indentation to make sure your code is readable. +- provide other useful guidance regarding formatting. The existence of such style guides does not necessarily mean that each individual or team will apply these conventions to the letter. Organisations and developer teams often have needs that might not be addressed in a general style guidance document. @@ -800,7 +800,7 @@ which should follow the single responsibility concepts outlined earlier. For example, within the section of your code concerned with modelling data, you might have a set of functions to download data from an external data store. These functions should only be responsible for receiving the required data safely and providing it to the `Model` object. -If you had the need to download data from different sources online (i.e. Database, CSV or other), you might create several download functions. +If you had the need to download data from different sources online (i.e., Database, CSV or other), you might create several download functions. To pick the right function for each model you might create a ['LoaderFactory']() who's only responsibility is to provide the `Model` with the right loading function for the right data source. From 27f0d8e1fd81e4580a9d61b58e011610716a3865 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 23 Oct 2024 14:57:36 +0100 Subject: [PATCH 06/33] consistency edits for Structure chapter --- book/project_structure.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/project_structure.md b/book/project_structure.md index 57fb39db..da39da7b 100644 --- a/book/project_structure.md +++ b/book/project_structure.md @@ -138,7 +138,7 @@ There must be an immutable store for raw data in your project structure. ### Check that outputs are disposable You should be able to dispose of your outputs, deleting them, without worrying. -If you are worried about deleting your outputs (i.e. results) then it is unlikely you have confidence in being able to reproduce your results. +If you are worried about deleting your outputs (i.e., results) then it is unlikely you have confidence in being able to reproduce your results. It is good practice to delete and regenerate your outputs frequently when developing analysis. From 276f42dde27f8199e84c1b5a0162e0a6e98ca5ba Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 09:32:53 +0100 Subject: [PATCH 07/33] minor edits on code documentation chapter --- book/code_documentation.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/book/code_documentation.md b/book/code_documentation.md index 7d5f022e..122b366c 100644 --- a/book/code_documentation.md +++ b/book/code_documentation.md @@ -279,9 +279,9 @@ You should use it to write function documentation as part of R packages. If you are documenting functions that are not part of a package, you can use regular code comments. However, documenting functions using roxygen syntax can be helpful if you are planning on packaging the code in future and gives a clear structure to your documentation. -You might find that writing function, class or package descriptions prior to writing their code helps you to focus on the task at hand. +You might find that writing function, class, or package descriptions prior to writing their code helps you to focus on the task at hand. The documentation should be a specification of what the code is expected to do. -As documentation tends to be user-focussed, this approach helps you to keep the user's needs in mind when developing code and +As documentation tends to be user-focused, this approach helps you to keep the user's needs in mind when developing code and provides a quick reference when more information on its capabilities are required. Lastly, perhaps one of the key things to remember when writing docstrings is to **keep them up to date**. @@ -323,11 +323,11 @@ In those cases, the structure is a lot looser and will depend on what the script The docstrings should be brief and avoid repeating details found in function documentation or other code comments later in the script. You may want to include: -- a title -- a brief description -- any important usage notes not covered elsewhere -- any copyright information if the script reproduces open source code from elsewhere -- academic citations, if applicable +- A title. +- A brief description. +- Any important usage notes not covered elsewhere. +- Any copyright information if the script reproduces open source code from elsewhere. +- Academic citations, if applicable. ````{tabs} @@ -359,7 +359,8 @@ Duck census main analysis Produces the duck census bulletin outputs for the annual publication. -Check the configuration file before running and run from the command line. Detailed setup and desk instructions can be found in README.md. +Check the configuration file before running and run from the command line. +Detailed setup and desk instructions can be found in README.md. """ ``` @@ -369,7 +370,8 @@ Check the configuration file before running and run from the command line. Detai # # Produces the duck census bulletin outputs for the annual publication. # -# Check the configuration file before running and run from the command line. Detailed setup and desk instructions can be found in README.md. +# Check the configuration file before running and run from the command line. +Detailed setup and desk instructions can be found in README.md. ``` ```` @@ -432,7 +434,7 @@ provide a good demonstration of how you would apply it in practice. Once built, the HTML files containing your documentation can be opened in any browser. Usually this means looking for an `index.html` file in the output directory and opening it with your browser. -This is sufficient for local usage, however, in order to improve the end-user experience and remove the need to browse the files looking for `index.html`, +This is sufficient for local usage. However, in order to improve the end-user experience and remove the need to browse the files looking for `index.html`, it is wise to host this documentation somewhere where it will be publicly available. Your version control platform might support hosting web pages already. From 680e77f4360a89a320fc04209b0e61b563901471 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 10:34:41 +0100 Subject: [PATCH 08/33] punctuation edits as far as Vignettes in project_docs --- book/project_documentation.md | 36 +++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/book/project_documentation.md b/book/project_documentation.md index 30fb94f4..0a721ecf 100644 --- a/book/project_documentation.md +++ b/book/project_documentation.md @@ -13,19 +13,19 @@ This file can be any text type, including `.txt`, `.md`, and `.rst`, and can be We suggest the following for a good README: -- Short statement of intent -- Longer description describing the problem that your project solves and how it solves it -- Basic installation instructions or link to installation guide -- Example usage -- Screenshot if your project has a graphical user interface -- Links to related projects +- Short statement of intent. +- Longer description describing the problem that your project solves and how it solves it. +- Basic installation instructions or link to installation guide. +- Example usage. +- Screenshot if your project has a graphical user interface. +- Links to related projects. ## Contributing guidance When collaborating, it is also useful to outline the standards used within your project. This might include particular packages that should used for certain tasks and guidance on the [code style](code-style) used in the project. -If you plan to have contributors from outside your organisation it is useful to include a code of conduct too. +If you plan to have contributors from outside your organisation, it is useful to include a code of conduct too. Please [see GitHub](https://docs.github.com/en/github/building-a-strong-community/adding-a-code-of-conduct-to-your-project) for advice on creating a code of conduct. For an example, see the CONTRIBUTING file from our [gptables package](https://github.com/best-practice-and-impact/gptables/blob/master/CONTRIBUTING.md): @@ -70,7 +70,7 @@ format for documenting features using docstrings. non-minor changes. 2. The review process follows a similar process to ROpenSci. 3. Reviewers will be requested from associated communities. -4. Only once reviewers are satisfied, will the `dev` branch be released. +4. The `dev` branch will only be released once reviewers are satisfied. ``` ```` @@ -124,17 +124,17 @@ which is formatted into HTML when viewed on our repository. ## User desk instructions -If your project is very user focussed for one particular task, +If your project is very user focused for one particular task, for example developing a statistic production pipeline for other analysts to execute, it is very important that the code users understand how to appropriately run your code. These instructions should include: -- How to set up an environment to run your code (including how to install dependencies) -- How to run your code -- What outputs (if any) your code or system produces and how these should be interpreted -- What quality assurance has been carried out and what further quality assurance of outputs is required -- How to maintain your project (including how to update data sources) +- How to set up an environment to run your code (including how to install dependencies). +- How to run your code. +- What outputs (if any) your code or system produces and how these should be interpreted. +- What quality assurance has been carried out and what further quality assurance of outputs is required. +- How to maintain your project (including how to update data sources). ## Dependencies @@ -142,7 +142,7 @@ These instructions should include: The environment that your code runs in includes the machine, the operating system (Windows, Mac, Linux...), the programming language, and any external packages. It is important to record this information to ensure reproducibility. -The simplest way to document which packages your code is dependent on, is to record them in a text file. +The simplest way to document which packages your code is dependent on is to record them in a text file. This is typically called `requirements.txt`. Python packages record their dependencies within their `setup.py` file, via `setup(install_requires=...)`. @@ -182,7 +182,7 @@ A BibTeX entry for LaTeX users is ``` This might include multiple citations, if your project includes multiple datasets, pieces of code or outputs with their own -[DOI's](https://en.wikipedia.org/wiki/Digital_object_identifier). +[DOIs](https://en.wikipedia.org/wiki/Digital_object_identifier). See this [GitHub guide for more information on making your public code citable](https://guides.github.com/activities/citable-code/). @@ -202,9 +202,9 @@ This can help users to understand how different code elements interact, and how Another good example is this vignette describing [how to design vignettes](http://r-pkgs.had.co.nz/vignettes.html) in Rmarkdown. You can produce this type of documentation in any format, though Rmarkdown is particularly effectively at combining sections of code, -code outputs and descriptive text. +code outputs, and descriptive text. -You might also consider providing examples in an interactive notebook, that users can run for themselves. +You might also consider providing examples in an interactive notebook that users can run for themselves. ## Versioning From 8a56c84850c8b35bd5e8fd5bd8bc46721ed9d5a5 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 12:02:37 +0100 Subject: [PATCH 09/33] edits for continuity in punctuation --- book/version_control.md | 58 +++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 31 deletions(-) diff --git a/book/version_control.md b/book/version_control.md index ab9ee09a..228289eb 100644 --- a/book/version_control.md +++ b/book/version_control.md @@ -53,11 +53,11 @@ Ideally, you should include all code and documentation that is required to under This may include: -* Project documentation including readme and contribution guidance -* Packaged functions and code documentation -* Unit tests -* Dummy data and example configuration files -* User documentation such as desk notes and installation guidance +* Project documentation including readme and contribution guidance. +* Packaged functions and code documentation. +* Unit tests. +* Dummy data and example configuration files. +* User documentation such as desk notes and installation guidance. ## Exclude sensitive information from version control @@ -67,18 +67,18 @@ In a public repository, you may need to omit confidential or sensitive aspects o You should **never** include the following in your code repository: -* passwords, credentials or keys -* real configuration files -* code that contains sensitive information - * for example, code that describes a method for fraud detection - * or code that contains references to personally identifiable data - * or code that might compromise security protocols -* data, except for small example datasets +* Passwords, credentials or keys. +* Real configuration files. +* Code that contains sensitive information. + * for example, code that describes a method for fraud detection. + * or code that contains references to personally identifiable data. + * or code that might compromise security protocols. +* Data, except for small example datasets. ``` See [](excluding-from-git) for details on how to mitigate the risk of including sensitive information in a Git repository. -It is again worth stressing the importance of not committing sensitive, unpublished or disclosive data to your Git history. +It is again worth stressing the importance of not committing sensitive, unpublished, or disclosive data to your Git history. If you would like to include an example for end-users, a minimal dummy dataset can be committed to the repository. When creating realistic dummy data, care should be taken not to disclose sensitive features of the true data such as distributions and trends. Dummy data should always be carefully peer reviewed before being added to a repository. @@ -100,7 +100,7 @@ If you are not familiar with using command line tools, or Git specifically, you Commits are collections of changes to one or more files in a repository. Every commit is attributed to the author of the changes, providing an audit trail. -Each commit has a unique hash - or identifier - associated with it, which has a long (e.g. `121b5b4f18231e4ee32c9c61f6754429f9572743`) and short version (e.g. `121b5b4`). +Each commit has a unique hash - or identifier - associated with it, which has a long (e.g., `121b5b4f18231e4ee32c9c61f6754429f9572743`) and short version (e.g., `121b5b4`). These hashes allow us to refer to specific changes, but each commit also has an associated message that is used to describe the changes. ```{note} @@ -270,9 +270,9 @@ however, analysts should opt to use the most simple and beneficial approach to b ```{note} Although we have used very simple branch names in the examples above, it's important that you use informative names for your branches in practice. -If using an [issue tracker](issues) (e.g. GitHub Issues or Jira), it can be useful to include the issue number in branch names (e.g. `#155-fix-index-aggregation`). +If using an [issue tracker](issues) (e.g. GitHub Issues or Jira), it can be useful to include the issue number in branch names (e.g., `#155-fix-index-aggregation`). This makes it easy to trace the branch back to the associated issue or task. -Otherwise, aim to use meaningful names that describe the feature or bug that the changes will be focussed on. +Otherwise, aim to use meaningful names that describe the feature or bug that the changes will be ed on. ``` @@ -288,11 +288,7 @@ If using the command line, you can also run `git status` to output the names of When merge conflicts happen, git will mark the clashes using this syntax: ```none -<<<<<<< HEAD -Changes on the current branch -======= Changes on the branch being merged ->>>>>>> new ======= ``` @@ -304,7 +300,7 @@ To resolve the merge conflict, you will need to make the necessary changes and d Once you have resolved all conflicting text manually (there may be more than one), then you can add and commit the changes to resolve the merge conflicts. Avoid merge conflicts whenever possible. -Do this by not editing the same files across different branches. +Do this by avoiding editing the same files across different branches. If this is difficult to do, it may be that your scripts are too monolithic and should be modularised or split into multiple scripts. @@ -320,7 +316,7 @@ Therefore, storing large files in Git typically slows down your development work [Git Large Files Storage (LFS)](https://git-lfs.github.com/) is a Git extension that allows you to version large files, but without storing the files in your repository history. Large files in your repository's history are instead replaced with a small text-based pointer. -This pointer references versions of the actual files, which are stored in a separate part of your remote repository (e.g. GitHub or GitLab). +This pointer references versions of the actual files, which are stored in a separate part of your remote repository (e.g., GitHub or GitLab). When you `pull` a repository including large files, only the current version of the file is retrieved from the remote server, rather than its whole history. This reduces the size of your local repository and the time taken to `push` and `pull` changes. [Git-LFS integrates well with a normal Git workflow](https://www.youtube.com/watch?v=uLR1RNqJ1Mw) and can be used for specific files, @@ -332,7 +328,7 @@ Despite this support for large files, we recommend that remote Git repositories Versioning of your data could instead be handled independently to your code; the version of your code should not be influenced directly by changes in the data and vice versa. This separation can be achieved using a tool like [DVC](https://dvc.org/), which allows you to specify where data versions are store (locally or on the cloud). -Alternative, third party storage (e.g. cloud-based 'buckets' or databases) can provide easy data storage with varying levels of version control capability. +Alternative, third party storage (e.g., cloud-based 'buckets' or databases) can provide easy data storage with varying levels of version control capability. ### Tag new releases @@ -464,7 +460,7 @@ It is [not currently possible to prevent the notebooks from retaining cell outpu The best way to handle this situation is to clear the outputs from your notebooks before committing them to Git repositories. This can be done from the notebook itself, by going to the menu `Cell > All > Output > Clear` and then saving your notebook. -Alternatively, this can be done from the command line, by running this command with your notebook file path: +Alternatively, this can be done from the command line by running this command with your notebook file path: ```none jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace @@ -582,12 +578,12 @@ the issues system works very similarly to other tools like Trello and Jira. The basic elements of an issue are the: -* Title and description, provided by the person that submitted the issue -* Labels that categorise the issue (e.g. enhancement or bug) -* Comments section where the issue can be discussed -* Assigned developers that are working on resolving the issue +* Title and description, provided by the person that submitted the issue. +* Labels that categorise the issue (e.g., enhancement or bug). +* Comments section where the issue can be discussed. +* Assigned developers that are working on resolving the issue. -Within an issue's description and comments, you can reference other issues both within (e.g. `#12`) and between repos, +Within an issue's description and comments, you can reference other issues both within (e.g., `#12`) and between repos, and tag individuals to notify them of your comments (`@whatstheirface`). Similarly, issues can be linked to specific [changes that will be merged to resolve or help to solve the issue](pull-requests). This makes them useful for discussing bugs and new features within a team. @@ -616,7 +612,7 @@ The development branch here may be within the same repo, a [Fork](forking) of th The initial description of the PR should include the high level changes that have been made and might point to any relevant issues that it resolves. Much like issues, PRs can be linked to other issues and PRs, providing a coherent narrative of development work. -[Keywords can be used when linking an issue (e.g. 'fixes #42')](https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) +[Keywords can be used when linking an issue (e.g., 'fixes #42')](https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) to trigger the issue to close the PR is merged. Contributors can also be assigned or tagged in discussion, which can be useful for requesting help or review of a group of changes. @@ -639,7 +635,7 @@ alt: The GitHub Pull Request web interface. Changes from [an example Pull Request on the `fingertipsR` package](https://github.com/ropensci/fingertipsR/pull/91/files). ``` -In the "Files changed" section of a PR (shown above) altered sections of files are shown before and after the changes were made, on the left and right respectively. +In the "Files changed" section of a PR (shown above), altered sections of files are shown before and after the changes were made, on the left and right respectively. Where changes have deleted lines of code, these lines are highlighted in red on the left panel. And changes that add lines of code to the file are shown on the right. This highlighted summary of changes provides a useful interface for [peer review](peer_review.md). From ebb8bf466c9666045665510b02c2f482a653341c Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 13:54:24 +0100 Subject: [PATCH 10/33] minor edits for punctuation and style consistency --- book/configuration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/configuration.md b/book/configuration.md index 96aa0f7d..6665f8ca 100644 --- a/book/configuration.md +++ b/book/configuration.md @@ -10,7 +10,7 @@ This section describes how we can define analysis configuration that is easy to ## Basic configuration Configuration for your analysis code should include high level parameters (settings) that can be used to easily adjust how your analysis runs. -This might include paths to input and output files, database connection settings and model parameters that are likely to be adjusted between runs. +This might include paths to input and output files, database connection settings, and model parameters that are likely to be adjusted between runs. In early development of our analysis, lets imagine that we have a script that looks something like this: @@ -66,7 +66,7 @@ As such, other analysts would need to read through the script and replace these As we'll demonstrate below, collecting flexible parts of our code together makes it easier for others to update them. When splitting our data and using our model to make predictions, we've provided some parameters to the functions that we have used to perform these tasks. -Eventually, we might reuse some of these parameters elsewhere in our script (e.g. the random seed) +Eventually, we might reuse some of these parameters elsewhere in our script (e.g., the random seed) and we are likely to adjust these parameters between runs of our analysis. To make it easier to adjust these consistently throughout our script, we should store them in variables. We should also store these variables with any other parameters and options, so that it's easy to identify where they should be adjusted. From b61174bcee35f75c1bfcddbd34737fe4494fec48 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 14:25:58 +0100 Subject: [PATCH 11/33] consistency in formatting for data chapter --- book/data.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/book/data.md b/book/data.md index a6e6c73c..5d1c57be 100644 --- a/book/data.md +++ b/book/data.md @@ -27,7 +27,7 @@ It is assumed that most data are now stored digitally. Digital data risk becoming inaccessible as technology develops and commonly used software changes. Long term data storage should use open or standard file formats. There are [recommended formats](https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats.aspx) for storing different data types, -though we suggest avoiding formats that depend on proprietary software like SPSS, STATA and SAS. +though we suggest avoiding formats that depend on proprietary software like SPSS, STATA, and SAS. Short term storage, for use in analysis, might use any format that is suitable for the analysis task. However, most analysis tools should support reading data directly from safe long term storage, including databases. @@ -35,7 +35,7 @@ However, most analysis tools should support reading data directly from safe long ### Spreadsheets -Spreadsheets (e.g. Microsoft Excel formats and open equivalents) are a very general data analysis tool. +Spreadsheets (e.g., Microsoft Excel formats and open equivalents) are a very general data analysis tool. The cost of their easy to use interface and flexibility is increased difficulty of quality assurance. ```{figure} https://imgs.xkcd.com/comics/norm_normal_file_format.png @@ -76,7 +76,7 @@ Popular open source DBMS include: * Redis The most common form of database is a relational database. -Data in the tables of a relational database are linked by common keys (e.g. unique identifiers). +Data in the tables of a relational database are linked by common keys (e.g., unique identifiers). This allows you to store data with minimal duplication within a table, but quickly collect related data when required. Relational DBMS are called RDBMS. @@ -140,15 +140,15 @@ A data dictionary describes the contents and format of a dataset. For variables in tabular datasets, you might document: -* a short description of what each variable represents -* the frame of reference of the data -* variable labels, if categorical -* valid values or ranges, if numerical -* representation of missing data -* reference to the question, if survey data -* reference to any related variables in the dataset -* if derived, detail how variables were obtained or calculated -* any rules for use or processing of the data, set by the data owner +* A short description of what each variable represents. +* The frame of reference of the data. +* Variable labels, if categorical. +* Valid values or ranges, if numerical. +* Representation of missing data. +* Reference to the question, if survey data. +* Reference to any related variables in the dataset. +* If derived, detail how variables were obtained or calculated. +* Any rules for use or processing of the data, set by the data owner. See this detailed example - the [National Workforce Data Set](https://www.datadictionary.nhs.uk/data_sets/administrative_data_sets/national_workforce_data_set.html#dataset_national_workforce_data_set), @@ -168,11 +168,11 @@ This form of documentation may not contain detailed information on how to use ea but an IAR does increase visibility of data flows. An IAR may include: -* the owner of each dataset -* a high level description of the dataset -* the reason that your organisation holds the dataset -* how the information is stored and secured -* the risk of information being lost or compromised +* The owner of each dataset. +* A high level description of the dataset. +* The reason that your organisation holds the dataset. +* How the information is stored and secured. +* The risk of information being lost or compromised. GOV.UK provides [IAR templates](https://www.gov.uk/government/publications/information-asset-register) that your department might use to structure their IAR. From 8184c8e02f697cf32441a2870b665453d4b93152 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 15:19:53 +0100 Subject: [PATCH 12/33] removed duplicated paragrapjs --- book/peer_review.md | 20 +++----------------- 1 file changed, 3 insertions(+), 17 deletions(-) diff --git a/book/peer_review.md b/book/peer_review.md index ded78c7e..3b61c7d5 100644 --- a/book/peer_review.md +++ b/book/peer_review.md @@ -234,10 +234,10 @@ For example, focussing on documentation in one session and functionality in the The thought of someone else reviewing your code in this way encourages good practices from the outset: -* Clear code and documentation - so that others with no experience can use and test your code -* Usable dependency management - so that others can run your code in their own environment +* Clear code and documentation - so that others with no experience can use and test your code. +* Usable dependency management - so that others can run your code in their own environment. -Separate review is aided by features of most version control platforms. See [](version_control.md) for more information. +Most version control platforms have features that can aid separate review. See [](version_control.md) for more information. #### Case study - rOpenSci review @@ -247,20 +247,6 @@ a community led initiative that curates open source, statistical R packages. rOpenSci apply a rigorous peer review process to assure the quality of packages before including them in their collection. This peer review process is entirely remote and is performed in the open, via GitHub pull requests. -In this example, from colleagues at Public Health England, -[the `fingertipsR` package is reviewed](https://github.com/ropensci/software-review/issues/168). -The initial comment describes the package that is being submitted and includes a check against a list of minimum requirements. -The [`goodpractice` R package](http://mangothecat.github.io/goodpractice/) is used to check that good R packaging practices have been followed. -[Continuous integration](continuous-integration) is commonly used to carry out automated checks on code repositories. -The reports from these checks can save reviewers time, by providing indicators of things like code complexity and test coverage. - - -#### Case studies - -Here we discuss an example from [rOpenSci](https://ropensci.org/); a community led initiative that curates open source, statistical R packages. -rOpenSci apply a rigorous peer review process to assure the quality of packages before including them in their collection. -This peer review process is entirely remote and is performed in the open, via GitHub pull requests. - In this example, from colleagues at Public Health England, [the `fingertipsR` package is reviewed](https://github.com/ropensci/software-review/issues/168). The initial comment describes the package that is being submitted and includes a check against a list of minimum requirements. The [`goodpractice` R package](http://mangothecat.github.io/goodpractice/) is used to check that good R packaging practices have been followed. From 3f3919d5978cf0476e5432ea938f53bc987227c8 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 24 Oct 2024 15:49:17 +0100 Subject: [PATCH 13/33] consistency in style and punctuation for CI chapter --- book/continuous_integration.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/book/continuous_integration.md b/book/continuous_integration.md index bd1f238b..609003d1 100644 --- a/book/continuous_integration.md +++ b/book/continuous_integration.md @@ -40,7 +40,7 @@ these can be achieved in a number of ways such as use of Git hooks and workflows [Git hooks](https://git-scm.com/docs/githooks) are scripts that can be set to run locally at specific points in your Git workflow, such as pre-commit, pre-push, etc. -They can be used to automate code quality assurance tasks, e.g. run tests, ensure style guides are followed, or enforce commit standards. +They can be used to automate code quality assurance tasks, e.g., run tests, ensure style guides are followed, or enforce commit standards. For example, we might set up a `pre-commit` or `pre-push` hook that runs our tests before we make each commit or push to the remote repository. This might stop our commit/push if the tests fail, so that we don't push breaking changes to our remote repository. @@ -172,7 +172,7 @@ This workflow will report whether our test code ran successfully for each of the It is important to maintain the documentation relating to your project to ensure contributors and users can understand, maintain and use your product correctly. One basic way of doing this is maintaining markdown files within a GitHub repository. -However, there exist multiple tools that can transform these markdown files into HTML content. +However, multiple tools exist that can transform these markdown files into HTML content. A popular tool for building and deploying HTML documentation is [Sphinx](https://www.sphinx-doc.org/en/master/). Here are two examples of repositories that use sphinx to build its documentation: @@ -193,11 +193,11 @@ You can see a detailed example of CI in practice in the `jupyter-book` project. A recent version of the [`jupyter-book` CI workflow](https://github.com/executablebooks/jupyter-book/blob/6fb0cbe4abb5bc29e9081afbe24f71d864b40475/.github/workflows/tests.yml) includes: -* Checking code against style guidelines, using [pre-commit](https://pre-commit.com/) -* Running code tests over - * a range of Python versions - * multiple versions of specific dependencies (`sphinx` here) - * multiple operating systems -* Reporting test coverage -* Checking that documentation builds successfully -* Deploying a new version of the `jupyter-book` package to [the python package index (PyPI)](https://pypi.org/) +* Checking code against style guidelines, using [pre-commit](https://pre-commit.com/). +* Running code tests over: + * a range of Python versions. + * multiple versions of specific dependencies (`sphinx` here). + * multiple operating systems. +* Reporting test coverage. +* Checking that documentation builds successfully. +* Deploying a new version of the `jupyter-book` package to [the python package index (PyPI)](https://pypi.org/). From 62994593f52a7d8fcf423011971109e5a470e52e Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Tue, 29 Oct 2024 11:03:07 +0000 Subject: [PATCH 14/33] reduced passive sentences in introduction --- book/intro.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/intro.md b/book/intro.md index d9f3dbab..0d696bbd 100644 --- a/book/intro.md +++ b/book/intro.md @@ -10,7 +10,7 @@ We are extremely grateful for any feedback that you are able to provide on exist ## How to get the most out of the book This guidance describes software engineering good practices that are tailored to those working with data using code. -It is designed for those who would like to quality assure their code and increase the reproducibility of their analyses. +It is designed to support you to quality assure your code and increase the reproducibility of your analyses. Software that apply these practices are referred to as reproducible analytical pipelines (RAP). This guidance is relevant if you are: @@ -31,7 +31,7 @@ you should strive to apply the most appropriate quality assurance practices give The principles in this book are language agnostic. The book does not aim to form a comprehensive learning resource and you may often need to study further resources to implement these practices. -That said, examples and useful references are provided for **Python** and **R**, as open source languages that are commonly applied across government. +That said, we have provided examples and useful references for **Python** and **R**, as these open source languages that are commonly applied across government. ## About us From 3fdd5029d221ecda9a6a1a4cd6facc6e79ddba2e Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Tue, 29 Oct 2024 11:07:58 +0000 Subject: [PATCH 15/33] removed several passive sentences in glossary --- book/glossary.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/book/glossary.md b/book/glossary.md index b95de9a8..2e204ffa 100644 --- a/book/glossary.md +++ b/book/glossary.md @@ -40,7 +40,7 @@ A computer program that runs code in a particular programming language. For example, the program that reads your Python or R analysis code and runs it. A non-interactive interpreter runs code in order, which is important for reproducibility. -Interactive interpreters allow you to run individual lines of code, which means that code can be run out of order. +Interactive interpreters allow you to run individual lines of code, which means that you can run code out of order. Notebooks use interactive interpreters. These are not suitable for running analysis pipelines, because they do not ensure that the code is run reproducibly. @@ -200,13 +200,13 @@ We recommend using Python and R, which support the good practices outlined in th Proprietary software, including SPSS, SAS, and Stata are closed source. These are expensive to use and do not support many of the good analysis practices outlined in this guidance. -Not using open source analysis tools means that our users need to purchase software to reproduce our results. +Not using open source analysis tools means that your users need to purchase software to reproduce your results. ### Packages Packages are file structures that contain code and documentation. -These collections of code are designed to be installed and re-used with ease. +These collections of code are designed for you to install and re-use them with ease. Packages in Python and R act as extensions, to allow us to reuse code that has already been written. "Library" is similarly used to describe a software collection, however, packages are more specifically for distribution of code. From 4fb416d5e353c9981c09f84875ff6878afd55597 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Tue, 29 Oct 2024 11:52:59 +0000 Subject: [PATCH 16/33] removed passive sentences in managers guide --- book/managers_guide.md | 74 +++++++++++++++++++++--------------------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/book/managers_guide.md b/book/managers_guide.md index 16957fd9..19deac0f 100644 --- a/book/managers_guide.md +++ b/book/managers_guide.md @@ -10,12 +10,12 @@ Please get in touch with feedback or case studies to support the guidance or emailing us at [emailing us](mailto:ASAP@ons.gov.uk). ``` -This section of the guidance is targeted at those who manage data analysis/science/engineering work in government +This section of the guidance targets those who manage data analysis/science/engineering work in government or those acting as product owners for analytical products. It aims to help you support your team to apply the good quality assurance practices described in the wider [Quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). -Processes that apply these good practices are referred to as a reproducible analytical pipelines (RAP). +We refer to processes that apply these good practices as a reproducible analytical pipelines (RAP). Before applying this guidance, you should have a basic awareness of the tools and techniques used to do quality analysis as code - the [introduction to RAP course](https://learninghub.ons.gov.uk/course/view.php?id=1236) outlines these. @@ -27,11 +27,11 @@ You should use this when designing and managing the development of analysis as c ## Apply quality assurance proportional to risk -As described by [the Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government), +As the [the Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) notes, the quality assurance of our analysis should be proportional to the complexity and risk of the analysis. When managing analytical work, you should not need an in-depth understanding of the analysis code to trust that it is working correctly. -However, you should be confident that the approach the team has taken is appropriate given the user need, +However, you should be confident that the team has taken an appropriate approach given the user need, and that proportionate quality assurance is being applied to the development and running of the analysis. You should work with your team to decide which quality assurance practices are necessary given each piece of analysis. @@ -45,8 +45,8 @@ While quality assurance must be applied relative to the risk and complexity of t It will take time to learn to apply the necessary good practices, so you should support their gradual development of these skills. [The RAP learning pathway](https://learninghub.ons.gov.uk/mod/page/view.php?id=8699) provides training in good practices. -Then the wider [Quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) -can be used as a reference to apply these practices to analysis. +You can use the wider [Quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) +as a reference to apply these practices to analysis. You should identify where each analyst is along the pathway - they should look to develop the next skill in the pathway and apply this, rather than attempting to adopt them all at once. @@ -59,11 +59,11 @@ There are number of case studies that describe how [good quality assurance practices have improved government analysis](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/rap-case-studies/). Not following good practices creates [technical debt](https://en.wikipedia.org/wiki/Technical_debt), which slows down further development of the analysis. -This can be necessary for delivering to short deadlines, but time should be set aside to address this for continued development of the analysis. +This can be necessary for delivering to short deadlines, but you should set aside time to address this for continued development of the analysis. Where quality assurance of the code doesn't meet your target level of assurance, for example where there is limited time or skill, -then it is necessary to supplement this with more in-depth assurance of outputs. +then you should supplement this with more in-depth assurance of outputs. This might include dual running the analysis with an independent system and consistency checks across the output data. The remaining parts of this section provide questions that aim to help you assess the quality assurance practices that your team are applying in their analysis. @@ -78,10 +78,10 @@ These questions aim to help you assess the design decisions at the beginning of Understanding user needs ensures that the analysis is valuable. -* There should be a plan to consult users at the beginning and throughout the development process, to ensure that their needs are being met. +* You should have a plan to consult users at the beginning and throughout the development process, to ensure that you meet their needs. * The methodology and data should be suitable for the question being asked. -* The analysis must be developed by more than one individual, to allow pair programming, peer review, and mentoring. This increases the sustainability of analysis. -* The analysis should be carried out using open-source analysis tools, wherever possible. +* More than one person should develop the analysis to allow pair programming, peer review, and mentoring. This increases the sustainability of analysis. +* You should use open-source analysis tools to carry out the analysis, wherever possible. Your team should be able to explain why they have chosen the analysis tools and why they are confident that they are fit for purpose. @@ -89,9 +89,9 @@ Your team should be able to explain why they have chosen the analysis tools and Versioning input data ensures that we can reproduce our analysis. -* Input data should be versioned, so that analysis outputs can be reproduced. -* Data should be stored in an open format (e.g. CSV or ODS), not formats that are restricted to proprietary software like SAS and Stata. -* Large or complex datasets should be stored in a database. +* You should version input data so that you can reproduce analysis outputs. +* You should store data in an open format (e.g. CSV or ODS), not formats that are restricted to proprietary software like SAS and Stata. +* You should store large or complex datasets in a database. * You should monitor the quality of data, following [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). @@ -99,11 +99,11 @@ Versioning input data ensures that we can reproduce our analysis. [Version control](version_control.md) of changes provides an audit trail. -* The code, documentation, and peer reviews should all be version controlled. Git software is most commonly used for this. -* The code should be developed on an open source code platform, like GitHub. This transparency increases your users trust in the analysis. -* There should be a clear record of every change and who made it. +* You should version control the code, documentation, and peer reviews. Git software is most commonly used for this. +* You should develop the code on an open source code platform, like GitHub. This transparency increases your users trust in the analysis. +* You should have a clear record of every change and who made it. * Each change should be linked to a reason, for example, a new requirement or an issue in the existing code. -* Reviews of changes should be stored with the version of the analysis that was reviewed. +* You should store reviews of changes with the version of the analysis that was reviewed. ## Quality assure throughout development @@ -116,17 +116,17 @@ to assess how the team are progressing towards the target quality assurance prac [Modular code](modular_code.md) makes it easier to understand, update and reuse the code. -* Logic should be written as functions, so that it can be reused consistently and tested. -* Related functions should be grouped together in the same file, so that it is easy to find them. -* Logic with different responsibilities (e.g., reading data versus transforming data) should be clearly separated. -* When code can be reused for other analyses, it should be stored and shared as a package. +* You should write logic as functions, so that it can be reused consistently and tested. +* You should group related functions together in the same file, so that it is easy to find them. +* You should clearly separate logic with different responsibilities (e.g., reading data versus transforming data). +* When code can be reused for other analyses, you should store and share this as a package. ### How easy is it to adjust the way that the analysis runs? [Configuration files](configuration.md) allow you to change the way the code runs without editing the code. -* Parts of the code that may change should be stored in separate configuration files. +* You should store parts of the code that may change in separate configuration files. * Things that often change in analysis code include input and output file paths, reference dates and model parameters. @@ -134,20 +134,20 @@ to assess how the team are progressing towards the target quality assurance prac [Code documentation](code_documentation.md) is essential for business continuity and sustainability. -* Every function should be documented in the code, so it is clear what the function is supposed to do. +* You should document every function in the code, so it is clear what each function is supposed to do. * Function documentation should include what goes in and what comes out of each function. -* Where code will be run or re-used by others, documentation should include usage examples and test data. +* Where others will run or re-used the code, documentation should include usage examples and test data. ### What are the dependencies? [Project documentation](project_documentation.md) ensures that others can reproduce our analysis. -* User instructions should be provided for running the analysis. -* Software and package versions should be documented with the code. -Typically package versions are recorded using `setup.py` and `requirements.txt` files (Python) or a `DESCRIPTION` file (R). +* You should provide user instructions for running the analysis. +* You should document software and package versions with the code. +Typically, you record package versions using `setup.py` and `requirements.txt` files (Python) or a `DESCRIPTION` file (R). * Code should not be dependent on a specific computer to run. Running it on a colleague's system can help to check this. -* When the same analysis is run multiple times or on different systems, it should give reproducible outcomes. +* When you run the same analysis multiple times or on different systems, it should give reproducible outcomes. * Container systems, like Docker, help to create reproducible environments to run code. @@ -155,8 +155,8 @@ Typically package versions are recorded using `setup.py` and `requirements.txt` Transparency of our analysis increases trust. -* Assumptions and caveats of the analysis should be recorded close to the code. -* These must be communicated to users when releasing results from the analysis. +* You should record assumptions and caveats of the analysis close to the code. +* You must communicate these to users when releasing results from the analysis. ### How has peer review been done? @@ -166,8 +166,8 @@ Transparency of our analysis increases trust. * Technical colleagues should conduct internal peer reviews of each change to the code. This will identify issues early on and makes the review process more manageable than reviewing only the final analysis. * Peer review should follow a standard procedure, so that reviews are consistent. Reviewers should check that each change follows the agreed good practices. -* There should be evidence that peer reviews are acted on. When issues or concerns are raised, they should be addressed before the analysis is used. -* If the product is high risk, external peer review should also be conducted. +* You should evidence that peer reviews are acted on and how. When issues or concerns are raised, they should be addressed before the analysis is used. +* If the product is high risk, you should arrange and conduct external peer review. ### How have you tested the code? @@ -175,12 +175,12 @@ This will identify issues early on and makes the review process more manageable [Testing](testing_code.md) assures us that the code is working correctly. * Testing means making sure that the code produces the right outputs for realistic example input data. -* Automated 'unit' testing should be applied to each function, to ensure that code continues to work after future changes to the code. -* Each function and the end-to-end analysis should be tested using minimal, realistic data. -* Testing should be more extensive on the most important or complex parts of the code. However, ideally every single function should be tested. +* You should apply automated 'unit' testing to each function to ensure that code continues to work after future changes. +* You should test each function and the end-to-end analysis using minimal, realistic data. +* Your testing should be more extensive on the most important or complex parts of the code. However, ideally you would test every single function. * Tests should account for realistic cases, which might include missing data, zeroes, infinities, negative numbers, wrong data types, and invalid inputs. * Reviewers should sign-off that there is enough testing to assure that the code is working as expected. -* Each time an error is found in the code, a test should be added to assure that the error does not reoccur. +* Each time you identify an error in the code,you should add a test to assure that the error does not reoccur. ### What happens when the analysis fails to run? From 7c9a4bb3a2e578453ec8d3146fefab6e0aaff941 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 30 Oct 2024 09:23:39 +0000 Subject: [PATCH 17/33] removed passive sentences in Principles chapter --- book/principles.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/book/principles.md b/book/principles.md index b64f8a5f..40743add 100644 --- a/book/principles.md +++ b/book/principles.md @@ -24,7 +24,7 @@ Assurance improves the average quality and includes the communication of that qu ```{admonition} Key strategies :class: admonition-strategies -Government guidance is available to help you when developing analysis. +Government guidance is available to help you when develop analysis. We recommend: 1. The [Analysis Function reproducible analytical pipelines (RAP) strategy](https://analysisfunction.civilservice.gov.uk/policy-store/reproducible-analytical-pipelines-strategy/), @@ -50,11 +50,10 @@ If you can't prove that you can run the same analysis, with the same data, and o The additional assurances of peer review, rigorous testing, and validity are secondary to being able to reproduce any analysis that you carry out in a proportionate amount of time. -Reproducible analysis relies on a transparent production process, so that anyone can follow your steps and understand your results. +A reproducible and transparent production process ensures that anyone can follow your steps and understand your results. This transparency eases reuse of your methods and results. Easy reproducibility helps your colleagues test and validate what you have done. -When reproducibility is guaranteed, -users and colleagues can focus on verifying that the implementation is correct and that the research is useful for its intended purpose. +Guaranteeing reproducibility means that users and colleagues can focus on verifying that the implementation is correct and that the research is useful for its intended purpose. Reproducibility relies on effective documentation. Good documentation should show how your methodology and your implementation map to each other. @@ -62,7 +61,7 @@ Good documentation should allow users and other researchers to reuse and adapt y Reproducible analysis supports the requirements of the [Code of Practice for Statistics](https://www.statisticsauthority.gov.uk/code-of-practice/) around quality assurance and transparency (auditability). -Wherever possible, we share the code we used to produce our outputs, along with enough data to allow for proper testing. +We share the code we used to produce our outputs wherever possible, along with enough data to allow for proper testing. ## Auditable @@ -80,8 +79,8 @@ They know who is responsible for each part of the analysis, including the assura They know exactly what changes have been made at any point. In a reproducible workflow, you must bring together the code and the data that you used to generate your results. -These are ideally published alongside your reports, with a record of analytical choices made and the responsible owners of those choices. -The transparency that this gives your work helps to increase trustworthiness. +Ideally, you would publish these alongside your reports, with a record of analytical choices made and the responsible owners of those choices. +This increases the trustworthiness of your work. More eyes examining your work can point out challenges or flaws that can help you to improve. You can be fully open about the decisions you made when you generated your outputs, so that other analysts can follow what you did and re-create them. By making your analysis reproducible, you make it easier for others to quality assure, assess and critique. @@ -89,22 +88,22 @@ By making your analysis reproducible, you make it easier for others to quality a ## Assured -Good quality analysis requires good Quality Assurance (QA). +Quality Assurance (QA) is vital for good quality analysis. If decisions are being made based on analysis then this analysis must be held to high standards. This is true for analysis carried out using any medium, including code. However, some of the analysis that we do in government doesn't bear on decisions at that level. We don't want to overburden analysts with QA procedures when they are not required. In government we advocate **proportionality** - the right quality assurance procedures for the right analysis. -Analysis can be proportionately assured through peer review and defensive design. +You can proportionately assure analysis through peer review and defensive design. We suggest following your department's guidance on what proportionate quality assurance looks like. Most departments derive their guidance from the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government). Assurance is best demonstrated through peer review. Peer reviewers must be able to understand your analytical choices and be able to reproduce your conclusions. -Dual running should be considered, particularly for high risk analysis. +Consider dual running, particularly for high risk analysis. -Guarantees of quality assurance should be published alongside any report, or be taken into consideration by decision makers. +Publish guarantees of quality assurance alongside any report, to be taken into consideration by decision makers. ## Reproducible analytical pipelines From 00fd4a10e44f43d474657c6438a4960a0eae5af8 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 30 Oct 2024 10:27:30 +0000 Subject: [PATCH 18/33] reduced passive sentences in Modular code chapter --- book/modular_code.md | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/book/modular_code.md b/book/modular_code.md index 201535b8..3e7f72f7 100644 --- a/book/modular_code.md +++ b/book/modular_code.md @@ -1,7 +1,7 @@ # Modular code The principles outlined in this chapter represent good practices for general programming and software development. -These have been tailored to a more analytical workflow. +We have tailored these to a more analytical workflow. ```{admonition} Pre-requisites :class: admonition-learning @@ -199,7 +199,7 @@ Liskov substitution strengthens this statement to '`BankAccount` is interchangea we can replace any `Account` with `BankAccount` without changing how our code runs. This is because `BankAccount` provides all of the same methods that an `Account` does, no less. -This can be similarly applied to functions. +You can apply this similarly to functions. If you were to increase the domain and range of a function, to account for new cases, then this function should observe the same interface as the previous function. In short: @@ -271,7 +271,7 @@ class SqlHandler(FileHandler): When multiple classes have a similar application programming interface (API, i.e. the methods they supply for users), we can easily switch between them. A good real-world example of this can be seen in the `scikit-learn` package, where the different linear model types are represented by different classes. Each linear model class supports a common set of methods, e.g. `fit()` and `predict()`. -As such, any model can then be used in a pipeline and swapped out with minimal effort. +As such, you can use any model in a pipeline and swap them out with minimal effort. Therefore, when thinking about how to break you code up into classes, consider the use of standardised methods across similar objects to make them interchangeable. @@ -283,7 +283,7 @@ It is easy to start to mapping nouns in system descriptions to classes, and any For example: 'the model loads the data', which implies that `Model` is a class that should have a `load_data` method. This works well for small systems, but as the complexity of your code grows you might find that one of your classes gains the majority of the underlying logic. This often leads to one class with many methods, while other classes just store data with very few methods. -This can be described as 'Data Driven Design'. +We describe this as 'Data Driven Design'. When most of your code resides in a single class, this can indicate that this class is responsible for too much of your code's logic. This class might become overly complex and hence difficult to maintain. @@ -321,12 +321,12 @@ so we've added a method to `publisher` as well as `book`; we're trading maintain To summarise: - Classes hide implementation detail from users, enabling implementation to be changed without affecting users. -- Look to use consistent methods in a group of related classes, so that you can switch between them without affecting the code using it. -Consider Python 'duck typing' or abstract classes and methods. -- Avoid storing all logic in a single class. Instead, distribute logic based on responsibilities. +- You should look to use consistent methods in a group of related classes, so that you can switch between them without affecting the code using it. +You should consider Python 'duck typing' or abstract classes and methods. +- You should avoid storing all logic in a single class. Instead, distribute logic based on responsibilities. - Be aware of trading maintainability for complexity - one large class or too many classes can be hard to understand. -- Design Patterns have solutions to many common problems and are a useful toolbox. -- Prefer encapsulation over inheritance, especially with code reuse. +- You will find Design Patterns a useful toolbox as they have solutions to many common problems. +- You should opt for encapsulation over inheritance, especially with code reuse. ### Split complex code into multiple scripts @@ -344,8 +344,6 @@ Manually executing individual lines of code allows for a slew of errors when thi Ultimately for code pipelines you will need to have some way of running your code - the humble script is the primary way of orchestrating your functions and classes in a pipeline fashion. - - ```{note} Using a script does not guarantee that your code will run reproducibly, but it does ensure that code is run in the same fashion across multiple runs. ``` @@ -366,8 +364,8 @@ you might decide that you want these functions to sit outside of your main pipel This is where modules come in, to separate reusable code into logical groups. Consider a project where an analyst has created one large data analysis script. -Upon reflection, they decide to split the logic from their pipeline into groups of functions relating to 'data processing', 'data modelling' and 'result presentation'. -They then create a file to contain each of these groups of functions: `processing.py`, `modelling.py` and `reporting.py`. +Upon reflection, they decide to split the logic from their pipeline into groups of functions relating to 'data processing', 'data modelling', and 'result presentation'. +They then create a file to contain each of these groups of functions: `processing.py`, `modelling.py`, and `reporting.py`. They decide that they want to have a pipeline script called `main`, but they want to keep this script readable and simple. In R, it's best to also use an R project file. Working within a project allows you to use relative file paths and avoid the need to refer to specific script locations. @@ -452,7 +450,7 @@ This requires users to manually alter file paths in the code, which is highly di ``` ``````{admonition} A step further -Another step that can be taken to improve clarity is to further wrap these modules into their own folder like so: +Another step that you can take to improve clarity is to further wrap these modules into their own folder like so: `````{tabs} ````{tab} Python @@ -562,5 +560,5 @@ Here are a few suggestions to consider when refactoring code from notebooks: After this, you might turn existing notebooks into HTML to send them stakeholders, or save them as is so that analytical peers can re-run and review your notebooks. The steps that you've taken to simplify your notebook code will make your code much easier to understand by readers. -Bear in mind that notebook files can still be run out of order by other analysts, and that they should not be used as the main method of actually generating outputs. +Bear in mind that other analysts can still run notebook files out of order, so they should not be used as the main method of actually generating outputs. Output generation should instead be trusted to scripts, where human decisions do not alter the order that code is run. From a152fccc89209611cc8066b1dea3871c70cefdf2 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 30 Oct 2024 12:43:58 +0000 Subject: [PATCH 19/33] reduced passive sentences in Readable Code --- book/readable_code.md | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/book/readable_code.md b/book/readable_code.md index 9cbd14c6..bb3ca3d4 100644 --- a/book/readable_code.md +++ b/book/readable_code.md @@ -50,7 +50,7 @@ That said, each individual programming language has idiomatic ways of writing co Additionally, each language usually has some form of accepted style guide. Make sure to consult the style guides for your language as first point of call. -This is an important point to stress, as these guides will capture the most up to date guidance for your language of choice, +It is important to stress this point, as these guides will capture the most up to date guidance for your language of choice, which may not be available in this document. @@ -105,7 +105,7 @@ readily available to the reader and are consistent throughout your code. Be sure to cite the source of the mathematical formula in these cases. In other cases, using variable names that contain a few (3 or so) informative words can drastically improve the readability of your code. -How these words are separated (be it `CamelCase` or `snake_case`) will depend on your language of choice. +Your language of choice will impact how you separate words (be it `CamelCase` or `snake_case`). ````{tabs} ```{code-tab} py @@ -198,24 +198,22 @@ There is a clear trade-off between the usability and informativeness of variable You'll need to use your best judgement to adapt variable names in order to keep them informative but reasonably concise. ```{note} -In languages like Python, where indentation is part of the syntax to denote code blocks, you will be much more aware of this trade-off. +You will be more aware of this trade-off in languages like Python, where indentation is part of the syntax to denote code blocks. -In practice, the PEP8 style guide for Python recommends line widths of 79 characters -and having overly descriptive names might impact your compliance with a style guide like that. +The PEP8 style guide for Python recommends line widths of 79 characters. Having overly descriptive names might impact your compliance with a style guide. ``` (naming-functions)= #### Name functions after the task that they perform -Naming functions should respect the best practices already covered in the [Naming variables](naming-variables), -however, there are a few other points worth raising that are exclusive to function and method names. +You should respect the best practices already covered in the [Naming variables](naming-variables) when naming functions. +However, there are a few other points worth raising that are exclusive to function and method names. Firstly, your user should be able to infer the purpose or action of a function from its name. -If you can't describe the overall task performed by the function in a few words, -then it may be that your function is overly complex or it requires further detail in its documentation. +Your function may be overly complex or require further detail in its documentation if you can't describe the overall task performed by the function in a few words. -Where a function performs a specific task, it can be effective to describe this task in the function name, starting with a verb: +It can be effective to describe the specigic task a function performs in its name, starting with a verb: ````{tabs} ```{code-tab} py @@ -265,7 +263,7 @@ report_data <- generate_report(model_results) ``` ```` -In cases where a function responds with a Boolean (True or False) value, it is often useful to name this function in the form of a question. +Naming a function in the form of a question is useful when in cases where a function responds with a Boolean (True or False) value. ````{tabs} ```{code-tab} py @@ -317,7 +315,7 @@ Compare this against `bp = Reader(book_data)` then `bp.fetch()`, where there is (code-style)= ``` {note} -As discussed in the [](modular_code.md) chapter, writing custom classes is more common in python than in R. As such, the examples above only apply to python. +Writing custom classes is more common in python than in R, as discussed in the [](modular_code.md) chapter. As such, the examples above only apply to python. ``` @@ -379,7 +377,7 @@ Idiomatic stands for 'using, containing, or denoting expressions that are natura In Python, idiomatic approaches to writing code are commonly referred to as 'pythonic'. This might mean simplifying complex and perhaps hard to read patterns into a simpler, but well established alternative. -In Python for example these two pieces of code are equivalent: +In Python, for example, these two pieces of code are equivalent: ````{tabs} ```{code-tab} python @@ -429,12 +427,12 @@ For example, attempting to fit too much logic into a single line of code can mak (automate-style-checks)= #### Automate style checks -It is good practice to follow a style guide from the beginning of a project. -However, it can be tedious to check that code continues to follow a particular style, or to fix code formatting when it doesn't. +Following a style guide from the beginning of a project is good practice. +However, checking that code continues to follow a particular style, or to fix code formatting when it doesn't can be tedious. Hence, automated support can be sought to speed up this work, either by providing suggestions as the code is written or by reformatting your code to comply with some style. -For further information on automating these checks see [](linters-formatters). +See [](linters-formatters) for further information on automating these checks. (software-ideas-for-analysts)= @@ -442,11 +440,11 @@ For further information on automating these checks see [](linters-formatters). It's important to remember that when we write code for analysis, we are developing software. Over many years, software engineering teams have developed good practices for creating robust software. -These practices help to make our code simple, readable and easier to maintain. +These practices help to make our code simple, readable, and easier to maintain. Analysts using code as a means to perform analysis can benefit from at least partially applying such practices in their own codebases. This chapter will try to condense key messages and guidelines from these practices, for use by analysts who write code. -That said, reading and learning more about these practices is likely to benefit the quality of your code and is highly encouraged. +Reading and learning more about these practices will likely to benefit the quality of your code and is highly encouraged. ### Keep it simple @@ -454,7 +452,7 @@ That said, reading and learning more about these practices is likely to benefit The ability to convey information in a simple and clear way matters. This is particularly true when explaining concepts that are already complex. -When writing code you are often trying to solve problems that are complex in nature. +You are often trying to solve problems that are complex in nature when writing code. You should avoid introducing extra complexity to these problems, wherever possible. Here are a few tips to make sure you keep your project nice and simple: From 54951301b42fea381e675ccf5142510f9664c46c Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Wed, 30 Oct 2024 13:50:13 +0000 Subject: [PATCH 20/33] edited passive sentences in modular code section --- book/modular_code.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/book/modular_code.md b/book/modular_code.md index 3e7f72f7..04dd397f 100644 --- a/book/modular_code.md +++ b/book/modular_code.md @@ -1,25 +1,25 @@ # Modular code -The principles outlined in this chapter represent good practices for general programming and software development. +This chapter outlines principles that represent good practices for general programming and software development. We have tailored these to a more analytical workflow. ```{admonition} Pre-requisites :class: admonition-learning -To get the most benefit from this section, you should have an understanding of core programming concepts such as: +You will get the most benefit from this section, if you have an understanding of core programming concepts such as: * storing information in variables * using control flow, such as if-statements and for-loops. * writing code as functions or classes. * using functions or classes in your code. -You can find links to relevant training in the [](learning.md) section of the book. +There are links in the [](learning.md) section of the book to relevant training. ``` ## Motivation -Code that is repetitive, disorganised, or overly complex can be difficult to understand, even for experienced programmers. +Even experienced programmers can find it difficult to understand code that is repetitive, disorganised, or overly complex. This makes assuring, testing or changing the code more burdersome. You may also find it harder to spot and fix mistakes. @@ -38,8 +38,8 @@ Because of this, modular code is fundamental to making analysis more reproducibl (modular)= ## Modular code -Breaking your code down into smaller, more manageable chunks is a sensible way to improve readability. -Regardless of the language, there are often techniques to containerise your code into self-contained parts such as modules, classes, or functions. +Improve readability often involves break your code down into smaller, more manageable chunks. +Regardless of the language, you can containerise your code into self-contained parts such as modules, classes, or functions. (functions)= @@ -71,7 +71,7 @@ In turn, this makes it easier to locate bugs in the code and assure its function When it is not possible or practical to follow these practices, you should ensure that any 'side-effects' are adequately documented for both users and developers. This may be the case where your code interacts with a file, database or an external service. -Ultimately, if you do signal where these kind of things might happen, someone trying to debug issues will know where to look. +Ultimately, signalling where these kind of things might happen, helps someone trying to debug issues know where to look. To summarise: @@ -85,7 +85,7 @@ break down your code into smaller functions and build up your functionality with Classes are fundamental parts of [object-orientated programming (OOP)](https://en.wikipedia.org/wiki/Object-oriented_programming). They create an association between data (attributes of the class) and logic (methods of the class). -As you will see in the examples below, classes can be useful when representing real objects in our code. +Classes can be useful when representing real objects in our code, as the examples below demonstrate. Although classes exist in R, [writing custom classes](https://adv-r.hadley.nz/oo.html) is less common than it is in Python (and other OOP enabling languages). Because of this, the following sub-section will focus primarily on Python classes. @@ -93,7 +93,7 @@ Because of this, the following sub-section will focus primarily on Python classe With a more complex system, OOP can help to reduce complexity by hiding low-level details from users, such as an internal state. ```{note} -The 'state' of an object is usually a set of variables that are particular to a given instance of a class. +An object's 'state' is usually a set of variables that are particular to a given instance of a class. To illustrate, imagine a bank account that is represented by an `Account` class. You can have many instances of this class (many unique bank accounts), each defined by the following internal state: @@ -102,7 +102,7 @@ You can have many instances of this class (many unique bank accounts), each defi - balance ``` -Since the end user does not need to know all of the state associated with an object, when writing classes consider marking such state as 'private'. +When writing classes consider marking such state as 'private', since the end user does not need to know all of the state associated with an object. This prevents users from accessing attributes directly, instead accessing them through class methods (functions defined with the class). ````{admonition} Method vs Function @@ -197,15 +197,15 @@ They should extend their usefulness, but retain their original functionality. If our `BankAccount` class inherits from `Account` we should consider that 'a `BankAccount` is an `Account`'. Liskov substitution strengthens this statement to '`BankAccount` is interchangeable with an `Account`'; we can replace any `Account` with `BankAccount` without changing how our code runs. -This is because `BankAccount` provides all of the same methods that an `Account` does, no less. +This is because `BankAccount` provides all of the same methods that an `Account` does. You can apply this similarly to functions. If you were to increase the domain and range of a function, to account for new cases, then this function should observe the same interface as the previous function. In short: -- Objects should be replaceable with instances of their subclasses, without altering the correctness of that program. -- Functions should be replaceable with similar functions that share the same interface. +- You should be able to replace objects with instances of their subclasses, without altering the correctness of that program. +- You should be able to replace functions with similar functions that share the same interface. ``` However, we should be wary that inheritance locks our class in to the object that it inherits from. From fddbd6d32f6474ce4a5b84818b6bfd174f256bde Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 08:38:00 +0000 Subject: [PATCH 21/33] reduced passive sentences in Project Structure chapter --- book/project_structure.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/book/project_structure.md b/book/project_structure.md index da39da7b..89ac17ea 100644 --- a/book/project_structure.md +++ b/book/project_structure.md @@ -15,7 +15,7 @@ Others are more specific, and - as with all guidelines - should not be taken as As you begin developing your project, it's a good idea to save your working code in a script file. In R these are saved as `.R` files, and in Python as `.py`. -Scripts can be used within an Integrated Development Environment (IDE) like +You can use scripts within an Integrated Development Environment (IDE) like [Visual Studio Code](https://code.visualstudio.com/), [RStudio](https://rstudio.com/), or [PyCharm](https://www.jetbrains.com/pycharm/). Inside an IDE you can usually run through your script line-by-line, or run the whole file at once. This can be an easier workflow than running code in the Python or the R console and then rewriting the same code in a script later. @@ -26,11 +26,11 @@ For example we may write a few functions that help us to calculate `mean`, `mode Then we can use those functions in our main script, saved in `main.R`. Outside of an IDE you can also run your scripts using the Python or R interpreters from the command line. -This allows other programs to use your scripts. +Other programs can then use your scripts because of this. For example you can use the `Rcmd ` command to run your R scripts or the `python ` command to run your Python scripts. Running your analysis files from end to end ensures that your code is executed in the same order each time. -It also runs the code with a clean environment, not containing variables or other objects from previous runs that can be a common source of errors. +It also runs the code with a clean environment, without variables or other objects from previous runs that can be a common source of errors. ## Keep your project structure clean @@ -48,10 +48,10 @@ Good naming practices improve your ability to locate and identify the contents o Good naming conventions include: -* Consistency, above all else -* Short but descriptive and human readable names -* No spaces, for machine readability - underscores (`_`) or dashes (`-`) are preferred -* Use of consistent date formatting (e.g. [YYYY-MM-DD](https://en.wikipedia.org/wiki/ISO_8601)) +* Consistency, above all else. +* Short but descriptive and human readable names. +* No spaces, for machine readability - underscores (`_`) or dashes (`-`) are preferred. +* Use of consistent date formatting (e.g. [YYYY-MM-DD](https://en.wikipedia.org/wiki/ISO_8601)). * Padding the left side of numbers with zeros to maintain order - e.g. `001` instead of `1`. The number of zeros should reflect the expected number of files. When using dates or times to name files, start with the largest unit of time and work towards the smallest. @@ -130,24 +130,24 @@ This will allow you to record changes to your code independent of other dependen ### Preserve raw data You should not alter raw data - treat it as read-only. -Even data cleaning should take place on a copy of the raw data, so that you can document which cleaning decisions have been made. +You should conduct data cleaning on a copy of the raw data, so that you can document which cleaning decisions have been made. -There must be an immutable store for raw data in your project structure. +Your project structure should include an immutable store for raw data. ### Check that outputs are disposable You should be able to dispose of your outputs, deleting them, without worrying. -If you are worried about deleting your outputs (i.e., results) then it is unlikely you have confidence in being able to reproduce your results. +If you are worried about deleting your outputs (i.e., results) then it is likely you lack confidence in reproducing your results. -It is good practice to delete and regenerate your outputs frequently when developing analysis. +Frequently deleting and regenerating your outputs when developing analysis is good practice. ## Structure code as modules and packages Code that is more complex, high risk or reusable between projects can benefit from being structured into a package. Modules are single files that contain one or more reusable units of code. -Multiple related modules are typically collected together within a package. +A package typically contains multiple related modules. It's likely that you've already used a package written by somebody else as part of your analysis. For example, installing additional functionality for Python using `pip install ` on the command line or @@ -166,14 +166,14 @@ See [](project_documentation.md) for a summary of common package and project doc ## Use project templates -Although project structure is flexible, you might recognise that many analysts choose to use similar structures for multiple projects. +Many analysts choose to use similar structures for multiple projects even though project structure can be flexible. Consistency in structure makes it easier to navigate unfamiliar projects. It means that members of the team can quickly orient themselves when joining an existing project or starting a new one. [Cookiecutter](https://github.com/cookiecutter/cookiecutter) is a command-line tool that creates projects from templates (cookiecutters). Using an existing cookiecutter, or creating one to meet your needs, can be a useful way to increase consistency in structure between your projects. -It can save time by creating common folder structures, laying out essential documentation or even starting off your code with a basic boilerplate. -Laying out a structure to include documentation and code testing encourages these good practices. +Increase consistency and good practice by creating common folder structures, laying out essential documentation or even starting off your code with a basic boilerplate. +This can also save you time. Laying out a structure to include documentation and code testing also encourages good practices. Useful cookiecutters include: From 09fa804e746c6760fef277c886fa684b7b60187d Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 14:35:11 +0000 Subject: [PATCH 22/33] reduced passive sentences in Documenting Code --- book/code_documentation.md | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/book/code_documentation.md b/book/code_documentation.md index 122b366c..12250a84 100644 --- a/book/code_documentation.md +++ b/book/code_documentation.md @@ -15,15 +15,13 @@ Comments are lines of text in source code files that typically aren't executed a They are small notes or annotations written by those working on the code. Often, they provide context or explain the reasoning behind implementation decisions. -Comments are essential to help those working on the code in the future to understand any non-obvious details -around how and why the code has been written in a particular way. +Comments are essential for explaining non-obvious details around around how and why the code has been written in a particular way to help those working on the code in the future. As such, when it comes to providing relevant and perhaps higher-level documentation to the end consumer on the functionality of your code, there are much more appropriate solutions such as [docstrings](docstrings). Although extremely useful, comments should be used sparingly. Excessive use of code comments often leads to redundancy and can, ironically, make your code harder to read. -It is easy for comments to not be updated as changes are made to the code. -Outdated, or irrelevant comments can confuse or mislead. +It is easy to fail to update comments as you change code. However, outdated, or irrelevant comments can confuse or mislead. ```{note} **Remember**: the only point of 'truth' is the code that is executed - if the comments are out of date compared to the actual code, it may not be immediately apparent. @@ -156,20 +154,20 @@ print("Run me!") ```` It is easy to forget which parts of code have been commented out and why they have been commented. -It introduces a human factor into the equation, which might not be accounted for if someone in the future is not aware of the commented-out code. +A human factor is then introduced into the equation, which might not be accounted for if someone in the future is not aware of the commented-out code. This is likely to produce inconsistent runs of the same piece of code. This code might quickly become out of sync with the rest of the changes in the codebase, as developers may not consider updating code that is commented out if they assume it is obsolete. You should instead use appropriate control flow (such as `if/else` statements) to determine when these sections should be run. -When changes are required between individual runs of your analysis, you should consider [defining these options in a dedicated configuration file](configuration.md). +You should consider [defining these options in a dedicated configuration file](configuration.md) when changes are required between individual runs of your analysis. In summary, you should use comments sparingly but purposefully. Make sure comments: -- explain **why** certain things are done, in order to provide context around the decisions that you have made -- do not echo what your code is already telling the reader -- are accurate and still relevant after code changes +- explain **why** certain things are done, in order to provide context around the decisions that you have made. +- do not echo what your code is already telling the reader. +- are accurate and still relevant after code changes. (docstrings)= @@ -404,13 +402,12 @@ Sphinx primarily uses the [`reStructuredText`](https://docutils.sourceforge.io/d That said, for those more familiar with `markdown` and in teams/environments where learning a new markup language is not a top priority, [`sphinx` can be extended to also support `markdown`](https://www.sphinx-doc.org/en/master/usage/markdown.html). -Sphinx supports code highlighting for multiple programming languages within a project, -however, other tools may be required to automatically collate documentation from code in languages other than Python. These are not addressed here. +Sphinx supports code highlighting for multiple programming languages within a project. However, you may require other tools to automatically collate documentation from code in languages other than Python. These are not addressed here. Sphinx also supports theming, with a [myriad of themes](https://www.writethedocs.org/guide/tools/sphinx-themes/) available out of the box. -With a little bit of extra time you can even develop and adapt the existing themes into a custom theme suitable for your work. +You can even develop and adapt the existing themes into a custom theme suitable for your work with a little bit of extra time. -As well as theming support, `sphinx` allows users to develop extensions that extend its functionality. +`Sphinx` allows users to develop extensions that extend its functionality, as well as theming support,. This GitHub repository provides a list of [useful ways to extend the functionality of `sphinx`](https://github.com/yoloseem/awesome-sphinxdoc) to suit your needs. To illustrate how this can be extremely useful, we will introduce the [doctest extension](https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html). @@ -434,8 +431,7 @@ provide a good demonstration of how you would apply it in practice. Once built, the HTML files containing your documentation can be opened in any browser. Usually this means looking for an `index.html` file in the output directory and opening it with your browser. -This is sufficient for local usage. However, in order to improve the end-user experience and remove the need to browse the files looking for `index.html`, -it is wise to host this documentation somewhere where it will be publicly available. +This is sufficient for local usage. However,it is wise to host this documentation somewhere where it will be publicly available to improve the end-user experience and remove the need to browse the files looking for `index.html`. Your version control platform might support hosting web pages already. GitHub provides this hosting via [GitHub Pages](https://pages.github.com/) and is able to host not only documentation, From cd49e182992744aee3f34cc5e9ac312b5f307afe Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 17:31:45 +0000 Subject: [PATCH 23/33] reduced passive sentences in project documentation section --- book/project_documentation.md | 39 +++++++++++++++++------------------ 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/book/project_documentation.md b/book/project_documentation.md index 0a721ecf..ab5953b3 100644 --- a/book/project_documentation.md +++ b/book/project_documentation.md @@ -1,7 +1,6 @@ # Project documentation -Whether you're developing a package or collaborating on a piece of analysis, -documenting your project will makes it much easier for others to understand your goal and ways of working. +Documenting your project will makes it much easier for others to understand your goal and ways of working, whether you're developing a package or collaborating on a piece of analysis. ## README @@ -24,11 +23,11 @@ We suggest the following for a good README: ## Contributing guidance When collaborating, it is also useful to outline the standards used within your project. -This might include particular packages that should used for certain tasks and guidance on the [code style](code-style) used in the project. -If you plan to have contributors from outside your organisation, it is useful to include a code of conduct too. +This might include particular packages required for certain tasks and guidance on the [code style](code-style) used in the project. +Consider including a code of conduct if you plan to have contributors from outside your organisation. Please [see GitHub](https://docs.github.com/en/github/building-a-strong-community/adding-a-code-of-conduct-to-your-project) for advice on creating a code of conduct. -For an example, see the CONTRIBUTING file from our [gptables package](https://github.com/best-practice-and-impact/gptables/blob/master/CONTRIBUTING.md): +See the CONTRIBUTING file from our [gptables package](https://github.com/best-practice-and-impact/gptables/blob/master/CONTRIBUTING.md) for an example: `````{tabs} @@ -143,7 +142,7 @@ The environment that your code runs in includes the machine, the operating syste It is important to record this information to ensure reproducibility. The simplest way to document which packages your code is dependent on is to record them in a text file. -This is typically called `requirements.txt`. +We typically call this text file `requirements.txt`. Python packages record their dependencies within their `setup.py` file, via `setup(install_requires=...)`. You can get a list of your installed python packages using `pip freeze` in the command line. @@ -152,15 +151,15 @@ You can get a list of your installed python packages using `pip freeze` in the c Packages are listed under the `Imports` key. You can get a list of your installed R packages using the `installed.packages()` function. -Environment management tools, such as +You will find environment management tools, such as [`renv`](https://rstudio.github.io/renv/articles/renv.html) for R or -[`pyenv`](https://github.com/pyenv/pyenv) for python, are very useful for keeping track of software and package versions used in a project. +[`pyenv`](https://github.com/pyenv/pyenv) for python useful for keeping track of software and package versions used in a project. ## Citation -For research or analytical code that is likely to be referenced by others, it can be helpful to provide a citation template. -This can be included in your code repository as a `CITATION` file or part of your `README`. +It can be helpful to provide a citation template for research or analytical code that is likely to be referenced by others. +You can include this in your code repository as a `CITATION` file or part of your `README`. For example, the R package `ggplot2` provides the following: ```none @@ -181,8 +180,8 @@ A BibTeX entry for LaTeX users is } ``` -This might include multiple citations, if your project includes multiple datasets, pieces of code or outputs with their own -[DOIs](https://en.wikipedia.org/wiki/Digital_object_identifier). +If your project includes multiple datasets, pieces of code or outputs with their own +[DOIs](https://en.wikipedia.org/wiki/Digital_object_identifier), this might include multiple citations. See this [GitHub guide for more information on making your public code citable](https://guides.github.com/activities/citable-code/). @@ -191,14 +190,14 @@ See this [GitHub guide for more information on making your public code citable]( Vignettes are a form of supplementary documentation, containing applied examples that demonstrate the intended use of the code in your project or package. Docstrings may contain examples applying individual functional units, while vignettes may show multiple units being used together. -The term vignette is usually used with reference to R packages, for example this introduction to the +We use the term vignette with reference to R packages, for example this introduction to the [`dplyr` package](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) for data manipulation. However, the same long-form documentation is beneficial for projects in any programming language - for instance the [`pandas` basics guide](https://pandas.pydata.org/docs/user_guide/basics.html). We've seen that [docstrings](docstrings) can be used to describe individual functional code elements. -Vignettes provide a demonstration of the intended use for these classes and functions, in a realistic context. -This can help users to understand how different code elements interact, and how they might use your code in their own program. +Vignettes demonstrate the intended use for these classes and functions in a realistic context. +This shows users how different code elements interact and how they might use your code in their own program. Another good example is this vignette describing [how to design vignettes](http://r-pkgs.had.co.nz/vignettes.html) in Rmarkdown. You can produce this type of documentation in any format, though Rmarkdown is particularly effectively at combining sections of code, @@ -211,14 +210,14 @@ You might also consider providing examples in an interactive notebook that users Documenting the version of your code provides distinct points of reference in the code's development. Recording the version of code used for analysis is important for reproducing your work. -When used in combination with [](version_control.md), versioning allows you to recover the exact code used to run your analysis and thus reproduce the same results. +Combining versioning with [](version_control.md) allows you to recover the exact code used to run your analysis, and thus reproduce the same results. [Semantic versioning](https://semver.org/) provides useful rules for versioning releases of your code. -Following these rules also helps other users of your code to understand how changes in your code may affect their software. +Follow these rules to help other users of your code understand how changes in your code may affect their software. Each level of version number indicates the extent of changes to the application programming interface (API) of your code, -i.e. the part of the code that a user interacts with directly. +i.e., the part of the code that a user interacts with directly. Changes to the major version number indicate changes to the API that are not compatible with use of previous versions of the code. -While changes is the minor and patch numbers indicate changes that are either compatible or have no effect on the use of the code, respectively. +While changes to the minor and patch numbers indicate changes that are either compatible or have no effect on the use of the code, respectively. ```{figure} ./_static/semantic_versioning.png --- @@ -239,7 +238,7 @@ As this guidance will change over time, this version number provides users with ## Changelog -A changelog records the major changes that have occurred to a project or package, between versioned releases of the code. +A changelog records the major changes that have occurred to a project or package between versioned releases of the code. ```{code-block} # Changelog From 203cfbaea77f574eac5f05657336cf66b826fd35 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 18:20:05 +0000 Subject: [PATCH 24/33] reduced passive sentences in Documenting Projects section --- book/project_documentation.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/book/project_documentation.md b/book/project_documentation.md index ab5953b3..e858c9b2 100644 --- a/book/project_documentation.md +++ b/book/project_documentation.md @@ -124,7 +124,7 @@ which is formatted into HTML when viewed on our repository. ## User desk instructions If your project is very user focused for one particular task, -for example developing a statistic production pipeline for other analysts to execute, +for example, developing a statistic production pipeline for other analysts to execute, it is very important that the code users understand how to appropriately run your code. These instructions should include: @@ -263,8 +263,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - bug in `multiple_each_in_list()`, where output was not returned ``` -Similarly to versioning, a changelog is useful for users to determine whether an update to your code is compatible with their work, which may depend on your code. -It can also document which parts of your code will no longer be supported in future version and which bugs in your code have been addressed. +Similarly to versioning, users find a changelog is useful to determine whether an update to your code is compatible with their work, which may depend on your code. +It can also document which parts of your code will no longer be supported in future version and which bugs in your code you have addressed. Your changelog can be in any format and should be associated with your code documentation, so that it is easy for users and other contributors to find. [keep a changelog](https://keepachangelog.com/en/1.0.0/) provides a simple but effective template for recording changes to your code. @@ -305,9 +305,9 @@ For example, an MIT LICENSE file might look like: In government, we [support and promote open source](https://www.gov.uk/service-manual/service-standard/point-12-make-new-source-code-open) whenever possible. [Open source](https://opensource.com/resources/what-open-source) software is software with source code that anyone can freely inspect, modify and enhance. -As a government analyst, you should aim to make all new source code open, unless justification can be provided for withholding part of your source code. +As a government analyst, you should aim to make all new source code open, unless you can justify witholding part of your source code. -Open sourcing code benefits yourself, other government analysts and the public. +Open sourcing code benefits you, other government analysts, and the public. Personal benefits from open sourcing include: @@ -324,5 +324,5 @@ While the public benefit from: Please see the [Government Data Service (GDS) guidance](https://www.gov.uk/government/publications/open-source-guidance/when-code-should-be-open-or-closed) for help deciding when code should be open or closed. -Security concerns for coding in the open are also addressed in further -[GDS guidance](https://www.gov.uk/government/publications/open-source-guidance/security-considerations-when-coding-in-the-open). Additional guidance on deciding when and how to open source, the benefits, and good practice is available from the [Analysis Function](https://analysisfunction.civilservice.gov.uk/policy-store/open-sourcing-analytical-code/). \ No newline at end of file +Further +[GDS guidance](https://www.gov.uk/government/publications/open-source-guidance/security-considerations-when-coding-in-the-open) addresses security concerns for coding in the open. Additional guidance on deciding when and how to open source, the benefits, and good practice is available from the [Analysis Function](https://analysisfunction.civilservice.gov.uk/policy-store/open-sourcing-analytical-code/). \ No newline at end of file From c8315196cc61c84fc0e45916614dfe4a6d41c791 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 19:39:26 +0000 Subject: [PATCH 25/33] editing passive sentences in version control section --- book/version_control.md | 70 ++++++++++++++++++++--------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/book/version_control.md b/book/version_control.md index 228289eb..e9f5d563 100644 --- a/book/version_control.md +++ b/book/version_control.md @@ -37,12 +37,12 @@ Documents, from [xkcd](https://xkcd.com/1459/). This work is licensed under a [Creative Commons Attribution-NonCommercial 2.5 License](https://creativecommons.org/licenses/by-nc/2.5/). ``` -When used effectively, version control helps us to identify which changes have negatively impacted our work and remove them. +Effective use of version control helps us to identify which changes have negatively impacted our work and remove them. Furthermore, a detailed audit trail allows us to refer to specific versions of our code that have been used to produce outputs, which is important for reproducing our analysis. Git is invaluable when recording and combining changes from multiple developers, as merging allows us to resolve conflicts between overlapping changes. -Using a remote Git repository maintains a single source of truth, even when multiple individuals are working on a project. +A remote Git repository maintains a single source of truth, even when multiple individuals are working on a project. Additionally, version control platforms, like GitHub and GitLab, can make it easier to track and review ongoing work. This avoids duplication of effort and keeps review embedded in the development workflow. @@ -160,15 +160,15 @@ As such, we'll need to invest more effort to work out which change causes the bu ### Use branching appropriately Branches are independent copies of a project's history, copied from a parent branch at a specific point in its history. -A new branch is typically created to make a change to your code, which might be building a new feature or fixing a bug in your analysis. -Multiple branches can be created from a single parent branch when multiple changes are being worked on independently. +You typically create a new branch to make a change to your code, which might be building a new feature or fixing a bug in your analysis. +You can create multiple branches from a single parent branch when multiple changes are being worked on independently. This is especially useful when multiple analysts are working collaboratively and are making changes in parallel. Once changes have been implemented and sufficiently quality assured (think documentation and testing), the branch containing the changes can be merged onto another branch. The target branch for this merging is typically the parent branch, which the branch was branched from. -Complete changes should eventually be merged onto the `main` branch, which is the public-facing branch that is ready for use. -During merging any overlapping or 'conflicting' changes between the current and target branches must be resolved. +You should eventually merge complete changes onto the `main` branch, which is the public-facing branch that is ready for use. +You must resolve any overlapping or 'conflicting' changes between the current and target branches during merging. It is important to note that your approach to branching within a project should be proportional to number of collaborators and the complexity of the development work. @@ -188,8 +188,8 @@ Commits along a single "main" Git branch. More complex projects may warrant using branching. When using branches, the `main` branch should be considered as the most 'stable' branch in the repository - meaning that the code on this branch builds successfully and executes as expected. -When making changes to code, changes may initially be less stable or reliable. -As such, you should make these changes on a new branch so that the working code on the `main` branch is unaffected. +When you make changes to code, they may initially be less stable or reliable. +Therefore, you should make these changes on a new branch so that the working code on the `main` branch is unaffected. As the changes to the code are refined, it becomes safer to merge these changes onto a higher level branch such as `main`. For example, when the code has been reviewed and suitably tested. You should aim to only merge onto a more stable branch when you don't expect it to break the working code on the target branch. @@ -205,7 +205,7 @@ Working on changes on a single `feature` branch. Here we show a single `feature` branch being created from the `main` branch. Changes are initially quite experimental, but are refined over a few commits. -Finally, the complete, working feature is merged back onto the `main` branch. +Finally, we merge the complete, working feature back onto the `main` branch. Many small scale projects iteratively work on individual feature or development branches in this way. The [GitHub flow branching strategy](https://guides.github.com/introduction/flow/) uses this approach in combination with [Pull Requests](pull-requests), @@ -235,7 +235,7 @@ Working on multiple parallel branches. Here we create two feature branches from `main`. Work on each feature is carried out independently of the other feature and can be merged onto `main` once it is complete. If changes from separate branches affect the same files, merging these branches to `main` may lead to merge conflicts. -In these cases you should ensure that you resolve the conflicts to keep the desired overall change. +In these cases you should resolve the conflicts to keep your desired overall change. ```{note} If you are able to break up your work into independent features that are not expected to affect the same files, you should do this. @@ -260,13 +260,13 @@ In this example, we have created a `feature` branch. Early in development of the feature we want to fix a bug that has been created, but this work can be carried out independently to the remaining development of the feature. As such, we create another, deeper branch to carry out the bug fix. -Once the bug is fixed, we merge the `bug-fix` onto our `feature` branch. And finally, the finished `feature` can be merged back onto `main`. +Once we fixed the bug, we merge the `bug-fix` onto our `feature` branch. And finally, the finished `feature` can be merged back onto `main`. The [Git flow branching strategy](https://nvie.com/posts/a-successful-git-branching-model/) describes an alternative to progressively merging our changes onto `main`. -Development work is instead branched from a `develop` branch. -Merges from `develop` onto the `main` branch are only used to release a new version of the code. -This approach can be useful when code from the `main` branch is deployed directly into production, -however, analysts should opt to use the most simple and beneficial approach to branching depending on their project. +Instead, development work is branched from a `develop` branch. +You only use merges from `develop` onto the `main` branch to release a new version of the code. +This approach can be useful when deploying code directly into production from the `main` branch. +However, analysts should use the most simple and beneficial approach to branching depending on their project. ```{note} Although we have used very simple branch names in the examples above, it's important that you use informative names for your branches in practice. @@ -295,7 +295,7 @@ Changes on the branch being merged The row of equals signs divides the old from the new. The contents in the first division are the changes in the current branch. Changes in the second division are from the new branch. -You can choose to keep one, both or neither. +You can choose to keep one, both, or neither. To resolve the merge conflict, you will need to make the necessary changes and delete the relevant symbols that git added to the text. Once you have resolved all conflicting text manually (there may be more than one), then you can add and commit the changes to resolve the merge conflicts. @@ -306,7 +306,7 @@ If this is difficult to do, it may be that your scripts are too monolithic and s ### Versioning large files -When versioning your repository, Git stores compressed copies of all previous versions of each file. +Git stores compressed copies of all previous versions of each file when versioning your repository. Despite the file compression, this means that versioning very large or binary files quickly increase the size of your repository's history, especially if there are multiple versions of them. The size of your Git history determines how long it takes to `clone` or `pull` and `push` changes to and from your remote repository. @@ -327,7 +327,7 @@ Other tools, including [git-annex](https://git-annex.branchable.com/) can be use Despite this support for large files, we recommend that remote Git repositories are not used to store data. Versioning of your data could instead be handled independently to your code; the version of your code should not be influenced directly by changes in the data and vice versa. -This separation can be achieved using a tool like [DVC](https://dvc.org/), which allows you to specify where data versions are store (locally or on the cloud). +You can achieve this separation using a tool like [DVC](https://dvc.org/), which allows you to specify where data versions are store (locally or on the cloud). Alternative, third party storage (e.g., cloud-based 'buckets' or databases) can provide easy data storage with varying levels of version control capability. @@ -335,15 +335,15 @@ Alternative, third party storage (e.g., cloud-based 'buckets' or databases) can Regularly `commit`ing changes using Git helps us to create a thorough audit trail of changes to our project. However, there may be discrete points in the history of the project that we want to mark for easier future reference. -This is incredibly useful, commit hashes like `121b5b4` serve as really poor identifiers for human users. +This is incredibly useful, as commit hashes like `121b5b4` serve as really poor identifiers for human users. -To reference specific points in project's history, Git allows us to create "tags". -These tags essentially act as an alias for a particular commit hash, allowing us to refer to it by an informative label. -In analytical projects, we might use tags to mark a particular model version or an important stage of our analysis. +Git allows us to create "tags" to reference specific points in project's history. +These tags essentially act as an alias for a particular commit hash, so we can refer to it by an informative label. +We might use tags to mark a particular model version in an analytical project or an important stage of our analysis. For example, we might tag code that has been used to generate a particular output so that it can easily be accessed to reproduce that output in future. If developing a package as part of your analysis, these tags are also commonly used to indicate new package versions. -By default, tags will reference the current position in history (i.e. the latest commit or `HEAD`). +By default, tags will reference the current position in history (i.e., the latest commit or `HEAD`). An annotated tag might be created for a new model version like so: ```{code-block} @@ -351,7 +351,7 @@ git tag -a v0.2.7 -m "Model version 0.2.7" git push origin v0.2.7 ``` -You can also retrospectively tag an older commit, by providing that `commit`'s hash: +You can also retrospectively tag an older commit by providing that `commit`'s hash: ```{code-block} git tag -a v0.1.0 -m "Model version 0.1.0" 9fceb02 @@ -382,7 +382,7 @@ This might be in the form of credentials, used to access a service, or data that In these cases, we need to minimise the risk of inadvertently sharing this information with our code. This subsection suggests how you might mitigate this risk in your analysis. -In the case of passwords or credentials that are used in your code, you should ensure that these are stored in [environment variables](environment-variables) +You should ensure that passwords or credentials that are used in your code are stored in [environment variables](environment-variables) and are not written directly into code. This includes in the early stages of development, as your version control history will retain copies of these. @@ -430,7 +430,7 @@ As such, we must be careful not to include this information in commits to remote Similarly, R stores a record of all commands executed in your session in a `.Rhistory` file. If you have referenced sensitive information directly in your R commands, then this information will be written to the `.Rhistory` file. -To handle this, we can exclude these files (via `.gitignore`) or prevent R from generating them. +We can exclude these files (via `.gitignore`) or prevent R from generating them to handle this. If you do not use these files, it is safest not to generate them in the first place. In Rstudio, you can disable writing of these files via `Tools > Global options`: @@ -449,18 +449,18 @@ AlwaysSaveHistory: No ### Avoid saving data from Jupyter Notebooks -By default, Jupyter Notebooks save a copy of the data that is used to create cell outputs. +Jupyter Notebooks save a copy of the data that is used to create cell outputs by default. For example, the data used to generate a chart, table or print a section of a dataframe. This can be useful for sharing your code outputs without others needing to re-execute the code cells. If you work with sensitive datasets in notebooks, this means that your notebooks may store sensitive data to display cell outputs. -If these notebooks are subsequently shared in code repositories, you may be making sensitive data available to unauthorised individuals. +You may be making sensitive data available to unauthorised individuals if these notebooks are subsequently shared in code repositories, . It is [not currently possible to prevent the notebooks from retaining cell outputs](https://github.com/ipython/ipython/issues/1280). The best way to handle this situation is to clear the outputs from your notebooks before committing them to Git repositories. -This can be done from the notebook itself, by going to the menu `Cell > All > Output > Clear` and then saving your notebook. -Alternatively, this can be done from the command line by running this command with your notebook file path: +You can do this from the notebook itself, by going to the menu `Cell > All > Output > Clear` and then saving your notebook. +Alternatively, you can do this from the command line by running this command with your notebook file path: ```none jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace @@ -518,11 +518,11 @@ Restricting access can help to limit the risk of releasing sensitive information On GitHub, Organizations provide an area for collating multiple repos that are associated with a particular team or department. Within Organizations, Teams can be also created to manage view and contribution permissions for specific projects within the Organization. -External collaborators can also be added to projects, to allow direct contribution from those outside of the Organization. +External collaborators can also be added to projects to allow direct contribution from those outside of the Organization. Detailed setup and management of Organizations and Teams are described in the [relevant section of the GitHub documentation](https://docs.github.com/en/free-pro-team@latest/github/setting-up-and-managing-organizations-and-teams). -Regardless of the visibility status of a repo, only Organization members and collaborators may make direct contributions to the code in the repo. +Only Organization members and collaborators may make direct contributions to the code in the repo regardless of the visibility status of a repo,. Others can contribute by Forking and using Pull Requests. @@ -560,7 +560,7 @@ If you are able to install them in your department, you may use one of the simpl ## Using GitHub -A number of version control platforms extend the functionality of Git, to further improve collaborative working. +A number of version control platforms extend the functionality of Git to further improve collaborative working. Here we describe some of the beneficial features supported by [GitHub](https://github.com/), the world's leading software development platform. GitHub provides additional tools for better management of collaborative work. Many of these tools are also discussed in detail on the @@ -572,7 +572,7 @@ but we will describe how they may be applied in analytical workflows here. ### Use Git issues to plan and discuss development of your project Issues offer a method for requesting or recording tasks, including enhancements and bug fixes in a project. -They act as a collaborative todo list, which users and developers can easily contribute to. +They act as a collaborative todo list which users and developers can easily contribute to. When combined with [Project boards](https://docs.github.com/en/free-pro-team@latest/github/managing-your-work-on-github/about-project-boards), the issues system works very similarly to other tools like Trello and Jira. @@ -585,7 +585,7 @@ The basic elements of an issue are the: Within an issue's description and comments, you can reference other issues both within (e.g., `#12`) and between repos, and tag individuals to notify them of your comments (`@whatstheirface`). -Similarly, issues can be linked to specific [changes that will be merged to resolve or help to solve the issue](pull-requests). +Similarly, you can link issues to specific [changes that will be merged to resolve or help to solve the issue](pull-requests). This makes them useful for discussing bugs and new features within a team. When a GitHub repo is publicly visible, the issues are also open and can be contributed to by others, including users. From 1215d1e607d35adad247dce66e8cad5e34781bcb Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Thu, 31 Oct 2024 19:46:14 +0000 Subject: [PATCH 26/33] editing passive sentences in version control (done to line 600) --- book/version_control.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/version_control.md b/book/version_control.md index e9f5d563..92c6aa19 100644 --- a/book/version_control.md +++ b/book/version_control.md @@ -593,7 +593,7 @@ Open source projects benefit from this transparency by providing users with a pl In turn, developers in the community can then address these issues to improve the project. Analytical projects might use issues to plan and discuss the steps involved in developing the project. -Where additional help is required, collaborators might be tagged or assigned to the task. +Collaborators might b[](vscode-file://vscode-app/c%3A/Program%20Files/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-sandbox/workbench/workbench.html)e tagged or assigned to the task where additional help is required. If your analysis code is widely useful, others that use your code may also suggest improvement and offer to contribute to the project via these issues. [Setting issue templates](https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/configuring-issue-templates-for-your-repository) From 227b7736f55041c1a72306e4c41ebdbdb5d629fc Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 4 Nov 2024 13:42:12 +0000 Subject: [PATCH 27/33] reduced passive sentences in version control --- book/version_control.md | 41 ++++++++++++++++++++--------------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/book/version_control.md b/book/version_control.md index 92c6aa19..7bff6a9d 100644 --- a/book/version_control.md +++ b/book/version_control.md @@ -599,27 +599,27 @@ If your analysis code is widely useful, others that use your code may also sugge [Setting issue templates](https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/configuring-issue-templates-for-your-repository) for your project can be an effective way of encouraging users and collaborators to use informative descriptions when creating issues. For example, a bug issue should include simple instructions to help maintainers reproduce the problem. -While feature requests might include information on how the user expects the new feature to work and details what problem it will help them to overcome. +Feature requests might include information on how the user expects the new feature to work and details what problem it will help them to overcome. (pull-requests)= ### Use pull requests for reviewing changes Once changes have been implemented, perhaps to meet the requirements of an issue, -Pull Requests (PRs) provide a useful interface for incorporating those changes into the main project. -PRs are typically used to merge a development branch (the source branch) onto a more stable branch in the main project (the target branch). +Pull Requests (PRs) provide a useful interface for incorporating changes into the main project. +We typically use PRs to merge a development branch (the source branch) onto a more stable branch in the main project (the target branch). The development branch here may be within the same repo, a [Fork](forking) of this project, or even a completely separate project. -The initial description of the PR should include the high level changes that have been made and might point to any relevant issues that it resolves. -Much like issues, PRs can be linked to other issues and PRs, providing a coherent narrative of development work. -[Keywords can be used when linking an issue (e.g., 'fixes #42')](https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) +You should include the high level changes that have been made in the initial description of the PR, and you might point to any relevant issues that it resolves. +You can link PRs to other issues and PRs, and should provide a coherent narrative of development work much like issues. +[You can use Keywords when linking an issue (e.g., 'fixes #42')](https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) to trigger the issue to close the PR is merged. -Contributors can also be assigned or tagged in discussion, which can be useful for requesting help or review of a group of changes. +You can assign contributors or tag them in discussion, which can be useful for requesting help or review of a group of changes. Put checklists into your pull request comments. These come with check boxes that can be ticked. This is useful when breaking down tasks into smaller chunks. -When requesting a review, use checklists to help the reviewer review all the relevant parts of the pull request. +Use checklists to help the reviewer review all the relevant parts of the pull request when requesting a review. As a reviewer, use checklists to list changes needed before the pull request can be approved. [The Government Digital Service have good examples of what this can look like.](https://github.com/alphagov/github-organisation-administration/pull/1) @@ -639,18 +639,18 @@ In the "Files changed" section of a PR (shown above), altered sections of files Where changes have deleted lines of code, these lines are highlighted in red on the left panel. And changes that add lines of code to the file are shown on the right. This highlighted summary of changes provides a useful interface for [peer review](peer_review.md). -When carrying out a review using this view, comments can added to specific lines of code and these comments can include suggested changes. +When carrying out a review using this view, you can add comments to specific lines of code and these comments can include suggested changes. All comments made using this view are also shown in the main Conversation view. -When completing a review, a reviewer can indicate whether the PR should be merged or additional changes are required. +When completing a review, a reviewer can indicate whether the PR should be merged or if additional changes are required. Once a PR has been reviewed and the reviewer is happy with the changes, -the Conversation view can be used by internal developers to merge the PR changes onto the target branch. +internal developers can use the Conversation view to merge the PR changes onto the target branch. ```{note} The repository settings can be adjust to project branches against specific actions. -To enforce peer review, you should consider preventing merging onto the `main` branch without an approved Pull Request. -Combining this with a [Pull Request template](https://docs.github.com/en/github/building-a-strong-community/creating-a-pull-request-template-for-your-repository) -ensures that a standard peer review process is followed for all changes. +You should consider preventing merging onto the `main` branch without an approved Pull Request to enforce peer review. +You can combine this with a [Pull Request template](https://docs.github.com/en/github/building-a-strong-community/creating-a-pull-request-template-for-your-repository) +to ensure that a standard peer review process is followed for all changes. ``` @@ -665,7 +665,7 @@ You might fork a repository when you want to: * Contribute to a project as an external collaborator * Make changes to a project for your own use, or to maintain a copy that is independent to the original -In the first case, lets consider that an issue describes a bug in a project's code. +In the first case, lets consider an example where an issue describes a bug in a project's code. Looking at the code, you think that you know where the source of the bug is. You create a fork of the project and clone your copy of the project locally. Here you make commits that include changes to fix the bug and test that these changes work. @@ -677,10 +677,9 @@ In the second case, perhaps you want to reuse or adapt code from an existing pro If the changes that you wish to make to the code are not in line with the aim of the original project or the project is no longer actively maintained, then you might create a fork to contain these changes. -Note that forks do not automatically synchronise with the original repo. -This means that changes to the original repo, after you create a fork, need to be manually synchronised if you want to include them in your repo. -When you would like to offer to contribute your changes to the original project (see [Pull Requests](pull-requests)), -you should ensure that you synchronise your branch with any new changes first. +You should note that forks do not automatically synchronise with the original repo. +This means that after you create a fork, changes to the original repo need to be manually synchronised if you want to include them in your repo. +You should ensure that you synchronise your branch with any new changes before offering to contribute your changes to the original project (see [Pull Requests](pull-requests)). See the GitHub documentation for [instructions on forking a repo and keeping your fork up to date](https://docs.github.com/en/free-pro-team@latest/github/getting-started-with-github/fork-a-repo) @@ -699,9 +698,9 @@ Continuous Integration is discussed in more detail in [](continuous_integration. offer project management features through a [Kanban-style board](https://en.wikipedia.org/wiki/Kanban_board). These boards can be used to track assignment and progress of specific tasks. -This is aided by linking tasks to specific issues and pull requests. +Linking tasks to specific issues and pull requests aids the use of boards. [GitHub Pages](https://pages.github.com/) offers hosting of static web content, which can be useful for code documentation. -GitHub Actions can be used to generate this documentation from the code and deploy directly to the Pages site. +You can use GitHub Actions to generate this documentation from the code and deploy directly to the Pages site. Alternatively, [project Wikis](https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/about-wikis) can be used to manually document your project using [Markdown](https://www.markdownguide.org/basic-syntax/). From 442f3399f862c017b93a0ccba861bcf4bdae4c7e Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 4 Nov 2024 14:32:52 +0000 Subject: [PATCH 28/33] reduced passive sentences in configuration section --- book/configuration.md | 60 +++++++++++++++++++++---------------------- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/book/configuration.md b/book/configuration.md index 6665f8ca..0ac35c96 100644 --- a/book/configuration.md +++ b/book/configuration.md @@ -60,22 +60,22 @@ Here we're reading in some data and splitting it into subsets for training and t We use one subset of variables and outcomes to train our model and then use the subset to test the model. Finally, we write the model's predictions to a `.csv` file. -The file paths that are used to read and write data in our script are particular to our working environment. +The file paths we use to read and write data in our script are particular to our working environment. These files and paths may not exist on an other analyst's machine. -As such, other analysts would need to read through the script and replace these paths in order to run our code. +As such, to run our code, other analysts need to read through the script and replace these paths. As we'll demonstrate below, collecting flexible parts of our code together makes it easier for others to update them. When splitting our data and using our model to make predictions, we've provided some parameters to the functions that we have used to perform these tasks. Eventually, we might reuse some of these parameters elsewhere in our script (e.g., the random seed) and we are likely to adjust these parameters between runs of our analysis. -To make it easier to adjust these consistently throughout our script, we should store them in variables. +We should store them in variables to make it easier to adjust these consistently throughout our script. We should also store these variables with any other parameters and options, so that it's easy to identify where they should be adjusted. Note that in this example we've tried our model prediction twice, with different parameters. We've used comments to switch between which of these lines of code runs. This practice is common, especially when we want to make a number of changes when developing how our analysis should run. However, commenting sections of code in this way makes it difficult for others to understand our code and reproduce our results. -Another analyst would not be sure which set of parameters was used to produce a given set of predictions, so we should avoid this form of ambiguity. +We should avoid this form of ambiguity because another analyst would not be sure which set of parameters was used to produce a given set of predictions. Below, we'll look at some better alternatives to storing and switching our analysis parameters. ````{tabs} @@ -144,32 +144,30 @@ We're able to use basic objects (like lists and dictionaries) to group related p We then reference these objects in the analysis section of our script. Our configuration could be extended to include other parameters, including which variables we're selecting to train our model. -However, it is important that we keep the configuration simple and easy to maintain. -Before moving aspects of code to the configuration it's good to consider whether it improves your workflow. -If it is something that is dependent on the computer that you are using (e.g. file paths) or is likely to change between runs of your analysis, -then it's a good candidate for including in your configuration. +However, we must keep the configuration simple and easy to maintain. +Before moving aspects of code to the configuration, consider whether it improves your workflow. +You should include things that are dependent on the computer that you are using (e.g., file paths) or are likely to change between runs of your analysis, in your configuration. ## Use separate configuration files -We can use independent configuration files to take our previous example one step further. +We can take our previous example one step further using independent configuration files. We simply take our collection of variables, containing parameters and options for our analysis, and move them to a separate file. -As we'll describe in the following subsections, these files can be written in the same language as your code or other simple languages. +These files can be written in the same language as your code or other simple languages, as we'll describe in the following subsections. Storing our analysis configuration in a separate file to the analysis code is a useful separation. It means that we can version control our code based solely on changes to the overall logic - when we fix bugs or add new features. We can then keep a separate record of which configuration files were used with our code to generate specific results. -We can easily switch between multiple configurations, by providing our analysis code with different configuration files. +We can easily switch between multiple configurations by providing our analysis code with different configuration files. -You may not want to version control your configuration file, -for example if it includes file paths that are specific to your machine or references to sensitive data. -In this case, you should include a sample or example configuration file, so that others can use this as a template to configure the analysis for their own environment. -It is key that this template is kept up to date, so that it is compatible with your code. +You may not want to version control your configuration file if it includes file paths that are specific to your machine or references to sensitive data. +In this case, include a sample or example configuration file, so others can use this as a template to configure the analysis for their own environment. +It is key to keep this template up to date, so that it is compatible with your code. ### Use code files for configuration -To use another code script as our configuration file, we can copy our parameter variables directly from our scripts. +We can copy our parameter variables directly from our scripts to use another code script as our configuration file. Because these variables are defined in the programming language that our analysis uses, it's easy to access them in our analysis script. In Python, variables from these config files can be imported into your analysis script. In R, your script might `source()` the config file to read the variables into the R environment. @@ -180,7 +178,7 @@ In R, your script might `source()` the config file to read the variables into th Many other file formats can be used to store configuration parameters. You may have come across data-serialisation languages (including YAML, TOML, JSON and XML), which can be used independently of your programming language. -If we were to represent our example configuration from above in YAML, this would look like: +If we represent our example configuration from above in YAML, it would look like this: ```yaml input_path: "C:/a/very/specific/path/to/input_data.csv" @@ -194,8 +192,8 @@ prediction_parameters: max_v: 1000 ``` -Configuration files that are written in other languages may need to be read using relevant libraries. -The YAML example above could be read into our analysis as follows: +You can use relevant libraries to read configuration files that are written in other languages. +For example, we could read the YAML example into our analysis like this: ````{tabs} @@ -221,7 +219,7 @@ data <- read.csv(config$input_path) Configuration file formats like YAML and TOML are compact and human-readable. This makes them easy to interpret and update, even without knowledge of the underlying code used in the analysis. Reading these files in produces a single object containing all of the `key:value` pairs defined in our configuration file. -In our analysis, we can then select our configuration parameters using their keys. +We can then select our configuration parameters using their keys in our analysis. ## Use configuration files as arguments @@ -231,8 +229,8 @@ Although this allows us to separate our configuration from the main codebase, we This is not ideal, as for the code to be run on another machine the configuration file must be saved on the same path. Furthermore, if we want to switch the configuration file that the analysis uses we must change this path or replace the configuration file at the specified path. -To overcome this, we can adjust our analysis script to take the configuration file path as an argument when the analysis script is run. -This can be achieved in a number of ways, but we'll discuss a minimal example here: +We can adjust our analysis script to take the configuration file path as an argument when the analysis script is run to overcome this. +We can achieve this in a number of ways, but we'll discuss a minimal example here: ````{tabs} @@ -283,7 +281,7 @@ This means that we don't need to change our code to account for changes to the c ```{note} It is possible to pass configuration options directly as arguments in this way, instead of referencing a configuration file. -However, use of configuration files should be preferred as they allow us to document which configuration +However, you should use configuration files as they allow us to document which configuration has been used to produce our analysis outputs, for reproducibility. ``` @@ -295,15 +293,15 @@ Environment variables are variables that are available in a particular environme In most analysis contexts, our environment is the user environment that we are running our code from. This might be your local machine or dedicated analysis platform. -If your code depends on credentials of some kind, these must not be written in your code. -Passwords and keys could be stored in configuration files, but there is a risk that these files may be included in [version control](version_control.md). -To avoid this risk, it is best to store this information in local environment variables. +If your code depends on credentials of some kind, do not write these in your code. +You can store passwords and keys in configuration files, but there is a risk that these files may be included in [version control](version_control.md). +To avoid this risk, store this information in local environment variables. Environment variables can also be useful for storing other environment-dependent variables. For example, the location of a database or a software dependency. -This might be preferred over a configuration file if very few other options are required by the code. +We might prefer this over a configuration file the code requires very few other options. -In Unix systems (e.g. Linux and Mac), environment variables can be set in the terminal using `export` and deleted using `unset`: +In Unix systems (e.g., Linux and Mac), you can set environment variables in the terminal using `export` and delete them using `unset`: ```none export SECRET_KEY="mysupersecretpassword" @@ -317,9 +315,9 @@ setx SECRET_KEY "mysupersecretpassword" reg delete HKCU\Environment /F /V SECRET_KEY ``` -These can alternatively be defined using a graphical interface under `Edit environment variables for your account` in your Windows settings. +You can alternatively define them using a graphical interface under `Edit environment variables for your account` in your Windows settings. -Once stored in environment variables, these variables will remain available in your environment until they are deleted. +Once stored in environment variables, these variables will remain available in your environment until you delete them. You can access this variable in your code like so: @@ -337,4 +335,4 @@ my_key <- Sys.getenv("SECRET_KEY") ```` -It is then safer for this code to be shared with others, as it is not possible to acquire your credentials without access to your environment. +It is then safer for this code to be shared with others, as they can't acquire your credentials without access to your environment. From 7df59bc9c951efe653a81ec15672c01b261bc5df Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 4 Nov 2024 15:43:25 +0000 Subject: [PATCH 29/33] reduced passive sentences in data management section --- book/data.md | 58 +++++++++++++++++++++++++--------------------------- 1 file changed, 28 insertions(+), 30 deletions(-) diff --git a/book/data.md b/book/data.md index 5d1c57be..2f465f17 100644 --- a/book/data.md +++ b/book/data.md @@ -1,11 +1,11 @@ # Data management Data management covers a broad range of disciplines, including organising, storing and maintaining data. -This management is typically handled by dedicated data architects and engineers, however, we appreciate that analysts are often expected to manage their own data. +Dedicated data architects and engineers typically handle data management. However, we appreciate that analysts are often expected to manage their own data. This section aims to highlight good data management practices, so that you can either appreciate how your organisation handles its data or implement your own data management solutions. -In order to reproduce a piece of analysis we need to be able to identify and access the same data that our analysis used. +We need to be able to identify and access the same data that our analysis used if we are to reproduce a piece of analysis . This requires suitable storage of data, with documentation and versioning of the data where it may change over time. ```{admonition} Key strategies @@ -22,10 +22,10 @@ The Office for Statistics Regulation provides a standard for ## Data storage -It is assumed that most data are now stored digitally. +We assume that most data are now stored digitally. Digital data risk becoming inaccessible as technology develops and commonly used software changes. -Long term data storage should use open or standard file formats. +Use open or standard file formats for long term data storage. There are [recommended formats](https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats.aspx) for storing different data types, though we suggest avoiding formats that depend on proprietary software like SPSS, STATA, and SAS. @@ -47,13 +47,13 @@ alt: A comic strip describing someone sending "data" in the form of a screenshot .NORM Normal File Format, from [xkcd](https://xkcd.com/2116/) ``` -Spreadsheets are not suitable for storage of data (or statistics production and modelling processes). -Issues when using spreadsheets for data storage include: +Do not use spreadsheets for storage of data (or statistics production and modelling processes). +Using spreadsheets to store data introduces usses such as: * Lack of audibility - changes to data are not recorded. * Multiple users can't work with a single spreadsheet file at once. * They are error prone and have no built in quality assurance. -* Files become cumbersome when they are large. +* Large files become cumbersome. * Automatic "correction" of grammar and data type, which silently corrupts your data. * Converting dates to a different datetime format. * Converting numbers or text that resemble dates to dates. @@ -67,7 +67,7 @@ See the European Spreadsheet Risks Interest Group document Databases are collections of related data, which can be easily accessed and managed. Each database contains one or more tables, which hold data. Database creation and management is carried out using a database management system (DBMS). -These usually support authorisation of access to the database and support multiple users accessing the database concurrently. +DBMSs usually support authorisation of access to the database and support multiple users to concurrently access the database. Popular open source DBMS include: * SQLite @@ -75,12 +75,12 @@ Popular open source DBMS include: * PostgreSQL * Redis -The most common form of database is a relational database. -Data in the tables of a relational database are linked by common keys (e.g., unique identifiers). +Relational databases are the most common form of database. +Common keys (e.g., unique identifiers) link data in the tables of a relational database. This allows you to store data with minimal duplication within a table, but quickly collect related data when required. Relational DBMS are called RDBMS. -Most DBMS use structured query language (SQL) to communicate with databases. +Most DBMS communicate with databases using structured query language (SQL). ```{admonition} Key Learning :class: admonition-learning @@ -121,7 +121,7 @@ may be a useful introduction to working with databases from Python. ## Documenting data -Without documentation, it is difficult to understand and work with a new dataset. +It is difficult to understand and work with a new dataset without documentation. For our analysis, we should be able to quickly grasp: @@ -131,7 +131,7 @@ For our analysis, we should be able to quickly grasp: * Have these data been validated or manipulated? * How am I ethically and legally permitted to use the data? -This information should be created by data providers and analysts, in the form of documentation. +Data providers and analysts should create this information in the form of documentation. ### Data dictionary @@ -164,8 +164,7 @@ An information asset register (IAR) documents the information assets within your Your department should have an IAR in place, to document its information assets. As an analyst, you might use the register to identify contacts for data required for your analyses. -This form of documentation may not contain detailed information on how to use each data source (provided by data dictionaries), -but an IAR does increase visibility of data flows. +Although this form of documentation may not contain detailed information on how to use each data source (provided by data dictionaries), an IAR does increase visibility of data flows. An IAR may include: * The owner of each dataset. @@ -179,30 +178,29 @@ GOV.UK provides [IAR templates](https://www.gov.uk/government/publications/infor ## Version control data -A key requirement for reproducing your analysis is the ability to identify the data that you used. +To reproduce your analysis, you need to be able to identify the data that you used. Data change over time; Open data and other secondary data may be revised over time or cease to be available with no notice. -The owners of these data can't always be relied on to provide historical versions of their data. +You can't rely on owners of data to provide historical versions of their data. -As an analyst, it is your responsibility to ensure that the exact data that you have used can be identified. +As an analyst, it is your responsibility to ensure you identify the exact data that you have used. -Whether using a primary or secondary data source, you should version and document all changes to the data that you use. -Documentation for data versions should include the reason why the version has changed. +You should version and document all changes to the data that you use whether using a primary or secondary data source,. +You should include the reason why the version has changed in documentation for data versions. For example, if an open data source has been recollected, revisions have been made to existing data, or part of the data has been removed. You should be able to generate your analytical outputs reproducibly and, as such, treat them as disposable. -If this is not the case, you should also version outputs so that they can be easily linked to the versioned input data and analysis code. +If this is not the case, version outputs so that you can easily link them to the versioned input data and analysis code. To automate the versioning of data, you might use the Python package [DVC, which provides Git-like version control of data](https://dvc.org/). This tool can also relate the data version to the version of analysis code, further facilitating reproducibility. -Git can be used to version data, but you should be mindful of where your remote repository stores this data. +You can use Git to version data, but you should be mindful of where your remote repository stores this data. The [`daff` package summarises changes in tabular data files](https://github.com/paulfitz/daff), which can be integrated with Git to investigate changes to data. -You might alternatively version your data manually. -For example, by creating new database tables or files for each new version of the data. +You might alternatively version your data manually e.g., by creating new database tables or files for each new version of the data. It must be possible to recreated previous versions of the data, for reproducibility. -As such, it is important that data file versions are named uniquely, for example, using incrementing numbers and/or date of collection. -Additionally, file versions must not be modified after they have been used for analysis - they should be treated as read-only. +As such, it is important that you name data file versions uniquely, for example, using incrementing numbers and/or date of collection. +Additionally, do not modify file versions after they have been used for analysis - they should be treated as read-only. All modifications to the data should result in new versions. ```{todo} @@ -211,14 +209,14 @@ Diagram of good manual data versioning workflow. [#22](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues/22) ``` -Finally, for this to be effective, your analysis should record the version of data used to generate a specified set of outputs. -This might be documented in analysis reports or automatically logged by your code. +Finally, for this to be effective, your analysis should record the version of data you used to generate a specified set of outputs. +You might document this in analysis reports or automatically logged by your code. ## Use these standards and guidance when publishing data You should use the [5-star open data standards](https://5stardata.info/en/) to understand and improve the current utility of your published data. -The [CSV on the Web (CSVW) standard](https://csvw.org/) is recommended for achieving the highest ratings of open data. +We recommend the [CSV on the Web (CSVW) standard](https://csvw.org/) for achieving the highest ratings of open data. When publishing statistics you should follow government guidance for [releasing statistics in spreadsheets](https://analysisfunction.civilservice.gov.uk/policy-store/releasing-statistics-in-spreadsheets/). @@ -227,4 +225,4 @@ When publishing or sharing tabular data, you should follow the [GOV.UK Tabular d Analysts producing published statistics may also be interested in [Connected Open Government Statistics (COGS)](https://analysisfunction.civilservice.gov.uk/the-gss-data-project/) and [the review of government data linking methods](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods). -Guidance from the UK Data Service describes [data security considerations](https://www.ukdataservice.ac.uk/manage-data/store/security). +The UK Data Service provides guidance on [data security considerations](https://www.ukdataservice.ac.uk/manage-data/store/security). From 5e156ee24e14f5434ef1016d1ece6afa20100b2f Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 4 Nov 2024 18:08:39 +0000 Subject: [PATCH 30/33] reduced passive sentences in peer review chapter --- book/peer_review.md | 80 ++++++++++++++++++++++----------------------- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/book/peer_review.md b/book/peer_review.md index 3b61c7d5..3f67cf3b 100644 --- a/book/peer_review.md +++ b/book/peer_review.md @@ -15,9 +15,9 @@ and the degree of validation and verification to which it has been subjected. ``` [The Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) -tells us that quality assurance of our analysis should be proportional to the complexity and business risk of the analysis. -This means that both internal and external peer review may be required to adequately assure your analysis. -External review is recommended if your analysis uses novel or complex techniques, +tells us that we should proportionatelly quality assurance our analysis depending on the complexity and business risk of the analysis. +This means that you may require both internal and external peer review to adequately assure your analysis. +We recommend External review if your analysis uses novel or complex techniques, if comparison with other analyses cannot be used to challenge your results, or if the analysis is business critical. ```{epigraph} @@ -45,20 +45,20 @@ In more depth: * Is there duplication in the code that could be simplified by refactoring into functions and classes? * Are functions and class methods simple, using few parameters? -As we discussed in [](readable_code.md), good quality code is easier to read, understand and maintain. -Peer review improves the quality of our code through the constructive challenges of the reviewer. -As a reviewer, you might do this by suggesting alternative ways to represent the analysis or +As we discussed in [](readable_code.md), good quality code is easier to read, understand, and maintain. +Peer review improves the quality of our code through the reviewer's constructive challenges. +You might do this as a reviewer by suggesting alternative ways to represent the analysis or by asking about decisions that have been made in the approach to the analysis. Another benefit, particularly of internal review, is knowledge transfer. Both the reviewer and reviewee are exposed to new ideas. -The reviewer must gain a low-level understanding of what the code is doing, in order to validate that it works as expected. +The reviewer must understand what the code is doing to validate that it works as expected. This may provide your team members with the understanding required to use and maintain your code in the future. ### Is the required functionality tested sufficiently? -If there are not tests for each part of the code, then we can't be sure that it works as expected. +If we do not test each part of the code, then we can't be sure that it works as expected. As a reviewer, you should ask whether the amount of testing is proportionate given the risk to the analysis if the code does not work. @@ -72,8 +72,8 @@ In more depth: Most analysis stems from some form of customer engagement. Throughout design, implementation and review of analysis we must continue to assess whether our analysis is fit for purpose: Does it meet the needs of the customer? -A project should document the scope of your analysis and any requirements, to make this assessment as easy as possible. -Additional documentation that supports the auditability of your analysis includes assumption logs, +Document the scope of your analysis and any requirements to make this assessment as easy as possible. +Support the auditability of your analysis with additional documentation, including assumption logs, technical reports describing the analysis and documentation on any verification or validation that has already been carried out. @@ -84,17 +84,17 @@ In more depth: * Have dependencies been sufficiently documented? * Is the code version, input data version and configuration recorded? -A key aspect of review is checking that the same results are acquired when running your code with the same input data. -Assurance that analysis can be reproduced increases the trust in the results. +A key aspect of review is checking that you get the same results when running your code with the same input data. +Assurance that analysis can be reproduced increases the trust in the results. Note that each of these example questions focuses on important quality assurance practices in the code, rather than minor issues like code layout and style. ## Give practical feedback -Feedback should be practical and constructive. +Provide practical and constructive feedback. For example, you should suggest an improvement or alternative that the developer may consider and learn from. -Although it may be necessary to highlight specific examples, you should avoid making feedback personal. +You should avoid making feedback personal, although it may be necessary to highlight specific examples. The CEDAR feedback model can be a useful framework for structuring review comments. This model breaks review down into five sections: @@ -113,11 +113,11 @@ This approach has been designed from a coaching or mentoring perspective, and ca When you identify issues with code functionality or quality you should suggest improvements to solve these issues. The code developer may respond to your comments to justify why they have taken a certain approach, otherwise they should act on your feedback before using the analysis. -It may be necessary to re-review changes that have resulted from your initial review, to confirm that your feedback has been actioned appropriately. +You may need to re-review changes that have resulted from your initial review, to confirm that your feedback has been actioned appropriately. ```{important} -It is essential that you document what you have considered in your review and how the developer has responded to your feedback. -This documentation should be kept close to the relevant version of the code, so that others can see what has been reviewed. +Document what you have considered in your review and how the developer has responded to your feedback. +You should keep this documentation close to the relevant version of the code, so that others can see what has been reviewed. The easiest way to manage this is by using the Merge Request or Pull Request feature on your version control platform to conduct the review. See [](version_control.md) for more information. ``` @@ -164,23 +164,23 @@ These might include, but not exclusively: - additional tests that should be implemented (do the tests effectively assure that it works correctly?) - code style improvements (could the code be written more clearly?) -Your suggestions should be tailored to the code that you are reviewing. +Tailor your suggestions to the code that you are reviewing. Be critical and clear, but not mean. Ask questions and set actions. ``` -Internal review should be carried out regularly within the development team. +Carry out internal review regularly within the development team. Reviewing code written by those with more and less experience than you is beneficial for both reviewer and developer. -Similar questions can be asked from both perspectives, for the reviewer to get a good understanding of the approach and decisions behind the analysis. +You can ask similar questions from both perspectives, for the reviewer to get a good understanding of the approach and decisions behind the analysis. ## Give timely feedback -It's important that feedback is provided in good time, so that the review process does not hold up development of the code +Provide feedback in good time so the review process does not hold up development of the code and that issues can be addressed before more code is written using the same practices. We strongly recommend applying pair programming for code review, as the most timely and practical method. -However, a separate review process may be necessary when multiple developers are not available to work at the same time. +However, you may find a separate review process necessary when multiple developers are not available to work at the same time. ### Review through pair programming @@ -193,16 +193,16 @@ Here, two or three developers work together to write a single piece of code. Each developer takes turns to actively author parts of the code, while others provide real-time feedback on the code being written. This practice encourages developers to consider why they are writing code in a particular way and to vocalise this ("programming out loud"). -Additionally, it gives reviewers a chance to suggest improvements and question the author's approach as the code is written. -Working in this way can be more efficient than reviewing code separately - issues are identified sooner, so they are easier to fix. +Additionally, it gives reviewers a chance to suggest improvements and question the author's approach as they write the code. +Working in this way can be more efficient than reviewing code separately - you identify issues sooner, so they are easier to fix. Despite the upfront cost of two individuals writing the code, the resulting code is often higher quality and contains fewer bugs. The rotational aspect of pair programming ensures that all team members gain experience from both the author and review perspective. From both angles, you'll learn new programming and communication techniques. -In addition to this, sharing knowledge of how the code works across the team prevents too much risk being put on individuals. +Additionally, sharing knowledge of how the code works across the team prevents putting too much risk on individuals. -Developers working in pairs can approve changes to code as it is written, however, -key discussions from pair programming sessions should still be documented to demonstrate which aspects of the code have been reviewed and discussed. +Developers working in pairs can approve changes to code as it is written. However, you should still document +key discussions from pair programming sessions to demonstrate which aspects of the code have been reviewed and discussed. This blog post from the Government Digital Service provides [more detailed steps to apply pair programming](https://gds.blog.gov.uk/2018/02/06/how-to-pair-program-effectively-in-6-steps/). @@ -222,22 +222,22 @@ Similarly to pair programming, reviewing small changes to code allows you to cat If a project is only reviewed when all of the code has been written, this significantly reduces the benefit of review. This creates a much larger burden on the reviewer. -Additionaly, any issues that are identified may take a lot of time to fix across the whole project. +Additionally, fixing any issues that are identified may take a lot of time to fix. A reviewer might highlight that certain quality assurance practices have not been used - for example, there has not been enough documentation or automated testing in the project. -It would take a substantial amount of effort to add documentation and testing for the whole project. +Adding documentation and testing for the whole project would take substantial effort. If this was identified earlier, the improved practices could be applied as the remaining code is developed. ``` -When you must carry out a review of larger or complete pieces of work, it may be worth reviewing different aspects of the code in separate sessions. -For example, focussing on documentation in one session and functionality in the next. +When you must carry out a review of larger or complete pieces of work, consider reviewing different aspects of the code in separate sessions. +For example, reviewing documentation in one session and functionality in the next. The thought of someone else reviewing your code in this way encourages good practices from the outset: -* Clear code and documentation - so that others with no experience can use and test your code. -* Usable dependency management - so that others can run your code in their own environment. +* Clear code and documentation - so others with no experience can use and test your code. +* Usable dependency management - so others can run your code in their own environment. -Most version control platforms have features that can aid separate review. See [](version_control.md) for more information. +Separate review is aided by version control platforms' features. See [](version_control.md) for more information. #### Case study - rOpenSci review @@ -254,13 +254,13 @@ The [`goodpractice` R package](http://mangothecat.github.io/goodpractice/) is us is commonly used to carry out automated checks on code repositories. The reports from these checks can save reviewers time, by providing indicators of things like code complexity and test coverage. -Two detailed external reviews are then conducted before the package is accepted - these reviews include additional checks for common aspects of code packages, -like documentation, examples and automated testing. -Perhaps the most informative part of these reviews, however, is the detailed bespoke comments. -Here the reviewers highlight problems, ask questions to clarify aspects of the package design and suggest improvements to the implementation of the code +Two external reviews conduct reviews before the package is accepted - these reviews include checking common aspects of code packages, +like documentation, examples, and automated testing. +Perhaps the most informative part of these reviews is the detailed bespoke comments. +Here the reviewers highlight problems, ask questions to clarify aspects of the package design, and suggest improvements to the implementation of the code (with examples). -Following the reviews, additional comments describe how the reviewers requested changes have been addressed. -And finally, there is a sign off to confirm that the reviewers are satisfied with the package. +Following the reviews, the authors wrote comments describing how the reviewers requested changes were addressed. +And finally, a sign off confirmed that the reviewers are satisfied with the package. Although this review looked at an entire, mature package, you can apply parts of this review process to smaller pieces of code as required. From c6985b377acb8649709d9115545cac9eccb7d016 Mon Sep 17 00:00:00 2001 From: Osborn Date: Wed, 20 Nov 2024 15:47:01 +0000 Subject: [PATCH 31/33] edited passive voice and simplified some parts in automating qa chapter --- book/continuous_integration.md | 54 +++++++++++++++++----------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/book/continuous_integration.md b/book/continuous_integration.md index 609003d1..367d9e33 100644 --- a/book/continuous_integration.md +++ b/book/continuous_integration.md @@ -2,21 +2,21 @@ ## Motivation -There are various tasks which can be automated to increase the quality of code and make development easier and less tedious. -Automating the running of unit tests is especially important to ensuring trust in your pipeline or package, by ensuring that all unit tests pass before every merge. +You can automate various tasks to increase the quality of code and make development easier and less tedious. +Automating the running of unit tests is especially important to establish trust in your pipeline or package, by ensuring that all unit tests pass before every merge. (automate-tests)= ### Automate tests to reduce risk of errors -Tests should be run whenever you make changes to your project. +You should run tests whenever you make changes to your project. This ensures that changes do not break the existing, intended functionality of your code. However, it is easy to forget to run your tests at regular intervals. -"Surely this could be automated too?" +"Surely I can automate this too?" -Absolutely! Automatic testing, amongst other quality assurance measures, can be triggered when changes are made to your remote version control repository. -These tools can be used to ensure that all changes to a project are tested. +Absolutely! Automatic testing, amongst other quality assurance measures, can be triggered when you make changes to your remote version control repository. +You can use these tools to ensure that all changes to a project are tested. Additionally, it allows others, who are reviewing your code, to see the results of your tests. @@ -24,7 +24,7 @@ Additionally, it allows others, who are reviewing your code, to see the results ## Commit small changes often Committing small changes regularly is often referred to as Continuous Integration (CI). -This can be achieved easily through the use of [version control](version_control.md), such as Git. +You can achieve this easily through the use of [version control](version_control.md), such as Git. You should commit every time you make a working change. Fixed a typo? Commit. Fixed a bug? Commit. Added a function? Commit. Added a test? Commit. @@ -32,22 +32,22 @@ As a very rough guide, you should expect to commit a few times each hour and pus If the task is unfinished at the end of the day, you should consider if the task has been sufficiently broken down. CI should be underpinned by automating routine code quality assurance tasks. -This quality assurance includes verifying that your code successfully builds or installs and that your [code tests](testing_code.md) run successfully. -these can be achieved in a number of ways such as use of Git hooks and workflows. +This quality assurance includes verifying that your code successfully builds or installs. It also ensures your [code tests](testing_code.md) run successfully. +You can achieve this in a number of ways such as use of Git hooks and workflows. ## Use Git hooks to encourage good practice [Git hooks](https://git-scm.com/docs/githooks) are scripts that can be set to run locally at specific points in your Git workflow, such as pre-commit, pre-push, etc. -They can be used to automate code quality assurance tasks, e.g., run tests, ensure style guides are followed, or enforce commit standards. +You can use them to automate code quality assurance tasks, e.g., run tests, follow style guides, or enforce commit standards. -For example, we might set up a `pre-commit` or `pre-push` hook that runs our tests before we make each commit or push to the remote repository. -This might stop our commit/push if the tests fail, so that we don't push breaking changes to our remote repository. +For example, you could set up a `pre-commit` or `pre-push` hook that runs your tests before you make each commit or push to the remote repository. +This might stop your commit/push if the tests fail, so that you won't push breaking changes to your remote repository. ```{note} -If your code is likely to be run on a range of software versions or operating systems, you might want to test on a variety of these. -Tools exists to support local testing of combinations software versions and package dependency versions: +If your code is likely to be run on a range of software versions or operating systems, you can test on a variety of these. +Tools exist to support local testing of combination software versions and package dependency versions: * [tox](https://tox.readthedocs.io/en/latest/) or [nox](https://nox.thea.codes/en/stable/) for Python * [rhub](https://r-hub.github.io/rhub/) for R @@ -59,7 +59,7 @@ Tools exists to support local testing of combinations software versions and pack Style guides are important in the readability and clarity of your code and should form part of your quality assurance process. However, as discussed in [](automate-style-checks), the process of checking and fixing code for style and formatting is tedious. -Automation can speed up this work, either by providing suggestions as the code is written or by reformatting your code to comply with your chosen style. +Automation can speed up this work, either by providing suggestions as you write the code or by reformatting your code to follow your chosen style. Two main types of tool exist for these tasks: @@ -84,18 +84,18 @@ Two main types of tool exist for these tasks: - ``` -These tools can be used locally (in the command line) or as git pre-commit hooks. -As described above, using pre-commit hooks allows you to run these automatically every time there are changes, -thus reducing the burden on developers and reviewers in checking that code conforms to style guides. +You can use these tools locally (in the command line) or as git pre-commit hooks. +As described above, using pre-commit hooks allows you to run these automatically every time there are changes; +this can reduce the burden on developers and reviewers checking that code conforms to style guides. -Be sure to read the documentation for any of these tools, to understand what they are checking or changing in your code. -Some can be configured to ignore or detect specific types of formatting error. -You can run multiple of these, to catch a broader range of stylistic or programmatic errors. +Ensure you read the documentation for any of these tools to understand what they are checking or changing in your code. +You can configure some of them to ignore or detect specific types of formatting error. +You can also run multiples of these to catch a broader range of stylistic or programmatic errors. ## Workflows -GitHub Actions and GitLab Pipelines both offer the ability to define custom workflows using YAML. -Workflow refers to a defined sequence of steps and actions that need to be performed +GitHub Actions and GitLab Pipelines are both able to define custom workflows using YAML. +Workflow refers to a defined sequence of steps and actions that you need to perform to complete a specific task or process. Workflows are commonly used in software development to automate repetitive or complex tasks, such as building and deploying software, testing code, and managing code reviews. @@ -147,16 +147,16 @@ jobs: pytest ``` -The first section of this example describes when our workflow should be run. +The first section of this example describes when we should run our workflow. In this case, we're running the CI workflow whenever code is `push`ed to the `master` branch or where any Pull Request is created. In the case of Pull Requests, the results of the CI workflow will be report on the request's page. If any of the workflow stages fail, this can block the merge of these changes onto a more stable branch. Subsequent commits to the source branch will trigger the CI workflow to run again. -Below `jobs`, we're defining what tasks we would like to run when our workflow is triggered. +Below `jobs`, we're defining what tasks we would like to run when we trigger our workflow. We define what operating system we would like to run our workflow on - the Linux operating system `ubuntu` here. The `matrix` section under `strategy` defines parameters for the workflow. -The workflow will be repeated for each combination of parameters supplied here - in this case 4 recent Python versions. +We will repeat the workflow for each combination of parameters supplied here - in this case 4 recent Python versions. The individual stages of the workflow are defined under `steps`. `steps` typically have an informative name and run code to perform an action. @@ -170,7 +170,7 @@ This workflow will report whether our test code ran successfully for each of the #### Configure GitHub actions to build and deploy documentation -It is important to maintain the documentation relating to your project to ensure contributors and users can understand, maintain and use your product correctly. +It is important to maintain the documentation relating to your project to ensure contributors and users can understand, maintain, and use your product correctly. One basic way of doing this is maintaining markdown files within a GitHub repository. However, multiple tools exist that can transform these markdown files into HTML content. A popular tool for building and deploying HTML documentation is [Sphinx](https://www.sphinx-doc.org/en/master/). From 71ec24361b20f43bce6cb677c25479c2d73cc8cb Mon Sep 17 00:00:00 2001 From: Osborn Date: Wed, 15 Jan 2025 15:16:26 +0000 Subject: [PATCH 32/33] Reduced passive voice in testing chapter. Some grammar edits and simplifcation, links tested --- .Rhistory | 3 + book/testing_code.md | 199 +++++++++++++++++++++---------------------- 2 files changed, 102 insertions(+), 100 deletions(-) create mode 100644 .Rhistory diff --git a/.Rhistory b/.Rhistory new file mode 100644 index 00000000..9b7e4e96 --- /dev/null +++ b/.Rhistory @@ -0,0 +1,3 @@ +set wd ("D:/repos/qa-of-code-guidance") +set wd (D:/repos/qa-of-code-guidance) +setwd ("D:/repos/qa-of-code-guidance") diff --git a/book/testing_code.md b/book/testing_code.md index efe72cb6..cdca7c65 100644 --- a/book/testing_code.md +++ b/book/testing_code.md @@ -2,11 +2,11 @@ Code documentation helps others to understand what you expect your code to do and how to use it. Code tests verify that your analytical code is working as expected. -Without carrying out tests we cannot confirm that our code works correctly, so we cannot be confident that our analysis is fit for purpose. +You cannot confirm your code works correctly if you don’t carry out tests, so you cannot be confident that your analysis is fit for purpose without them. Good tests tell a story - given this data, having run this code, we expect this output. Testing brings strong benefits. It helps you assure your code quality and makes developing your code more efficient. -Code that has not been tested is more likely to contain errors and require more maintenance in the future. +Code that has not been tested is more likely to contain errors and need more maintenance in the future. ## What should I test? @@ -16,12 +16,12 @@ How can I demonstrate that my code does what it is supposed to do? As the developer of the code, you are best placed to decide what tests you need to put in place to answer that question confidently. -Take a risk-based approach to testing. Tests should be used proportionately for your analysis. This usually means writing more tests for parts of your code that are very new, more complex or carry more risk. +Take a risk-based approach to testing. Tests should be used proportionately for your analysis. This usually means writing more tests for parts of your code that are very new, more complex, or carry more risk. When you are developing your tests, here are some points to think about: -1. You don't need to test everything. It is usually reasonable to assume that third party functions and tools which are sufficiently quality assured (and you can verify this) work as intended. For example, if you use R you would not expect to write tests to verify that simple arithmetic, base R or packages published on [CRAN](https://cran.r-project.org/) operate correctly, because there is already sufficient assurance. You may be less confident about very new functionality from third parties, or experimental tools. Here, you might decide you do need to do some extra validation. -2. Think carefully about whether third party tools really do what is needed for your particular context. For example, the base R `round()` function intentionally behaves differently to the rounding function in Excel. While we can be confident that `round()` works as specified, does it produce what you need? +1. You don't need to test everything. It is realistic to assume that third party functions and tools which are adequately quality assured (and you can verify this) work as intended. For example, if you use R you would not expect to write tests to verify that simple arithmetic, base R, or packages published on [CRAN](https://cran.r-project.org/) operate correctly, because there is already sufficient assurance. You may be less confident about very new functionality from third parties, or experimental tools. Here, you might decide you do need to do some extra validation. +2. Think carefully about whether third party tools really do what you need for your particular context. For example, the base R `round()` function intentionally behaves differently to the rounding function in Excel. While we can be confident that `round()` works as specified, does it produce what you need? 3. Testing is a great way to verify that your approach is the right one. By thinking about what to test, you challenge your own assumptions and the way you have done things. This can reveal issues or scenarios that you had not considered. It means the code you write should be more resilient. 4. Be guided by the risks you need to mitigate. For example, if inputs are invalid or unusual, do you want the code to stop with an error message or do something else? Use tests to check that the code does the right thing at the right time. @@ -29,49 +29,49 @@ When you are developing your tests, here are some points to think about: Tests come in many shapes and sizes, but usually follow the pattern: -1. Arrange - set up any objects needed for your test, e.g. example input data and expected output data. +1. Arrange - set up any objects needed for your test, for example sample input data and expected output data. 2. Act - run the code that you are testing (one or more functions or methods). -3. Assert - verify that the code performed the expected action, e.g. that the output matched the expected output. +3. Assert - verify that the code performed the expected action, for example that the output matched the expected output. ```{admonition} Key Learning :class: admonition-learning -You should follow the [Introduction to Unit Testing course](https://learninghub.ons.gov.uk/course/view.php?id=1171) for applied examples in Python and R. +Follow the [Introduction to Unit Testing course](https://learninghub.ons.gov.uk/course/view.php?id=1171) for applied examples in Python and R. This course also covers writing and documenting functions, and error handling. Other useful learning resources include: -* [`pytest` getting started](https://docs.pytest.org/en/3.0.1/getting-started.html) +* [`pytest` getting started](https://docs.pytest.org/en/stable/getting-started.html) * Real Python [Getting Started With Testing in Python](https://realpython.com/python-testing/) * Hadley Wickham's [testthat: getting started with testing](https://vita.had.co.nz/papers/testthat.pdf) and [testing design in R](https://r-pkgs.org/testing-design.html) ``` -In this section we assume that you are using a testing framework to run your tests (e.g. `pytest` for python or `testthat` for R) and have your code in a package. -Code that is not in a package can be more difficult to test and follow the testing good practices described here. +In this section we assume that you are using a testing framework to run your tests (for example `pytest` for python or `testthat` for R) and have your code in a package. +It is more difficult to test code that is not in a package and therefore follow the testing good practices described here. ## Write reproducible tests -As analysts we routinely check that our analysis is carried out correctly. -This might be done informally by running all or part of our analysis with example data or subsets of real data. +As an analyst, you routinely check that your analysis is carried out correctly. +You might do this informally by running all or part of your analysis with example data or subsets of real data. -These tests give us confidence that our analysis is correct. -However, it's important that we are able to reproduce the same checks against our code reproducibly. -Code changes over time, so we need to be able to repeat these checks against the updated code. +These tests give you confidence that your analysis is correct. +However, it's important you are able to produce the same checks against your code reproducibly. +Code changes over time, so you need to be able to repeat these checks against the updated code. Additionally, other analysts should be able to carry out the same checks and get the same results. -Representing our tests as code allows us to consistently repeat the same steps. -This lets us or another analyst carry out the same verification again to get the same results. +Representing your tests as code allows you to consistently repeat the same steps. +This lets you or another analyst carry out the same verification again to get the same results. When you have carried out a test manually, you should ensure that you add a code test to reproduce this. -Code that we write for testing should also follow the good practices described earlier on in this book, in particular [](readable_code). +Code that you write for testing should also follow the good practices described earlier on in this book, in particular [](readable_code). ## Write repeatable tests -For us to be able to trust the results of our tests we need them to be repeatable. -That is for them to give the same outcome if we run them more than once against the same version of our analysis code. +For you to be able to trust the results of your tests you need them to be repeatable. +This means they should give the same outcome if you run them more than once against the same version of your analysis code. For tests to run repeatably each test must be independent. There should not be a shared state between tests, for example a test should not depend on another test having already run. -Many test runners will intentionally randomise the order that tests are executed to encourage this. +You could intentionally randomise the order that tests are executed to encourage this. Where possible, tests should be deterministic. As such, the only reason for a test to fail should be that the code being tested is incorrect. @@ -79,17 +79,17 @@ Where your code relies on randomness tests should reuse the same random seed eac Where this is not possible/logical for the scenario that you are testing, you may want to run the test case multiple times and make an assertion about the distribution of the outcomes instead. -For example, if we are testing a function that simulates a coin flip we might run it 100 times and +For example, if you are testing a function that simulates a coin flip you might run it 100 times and check the proportion of heads versus tails is close to half (within a reasonable range). ## Run all tests against each change to your code -All tests should be run whenever you make changes to your analysis. +Run **all** tests whenever you make changes to your analysis. This ensures that changes do not break the existing, intended functionality of your code. Running the entire collection of tests has the added benefit of detecting unexpected side-effects of your changes. -For example, you might detect an unexpected failure in part of your code that you didn't even change. +For example, you might detect an unexpected failure in part of your code that you didn't change. -Running tests regularly allows you to fix any issues before changes are added to a stable or production version of your code (e.g. the `main` Git branch). +Running tests regularly allows you to fix any issues before changes are added to a stable or production version of your code (e.g. the `main` Git branch). If you have altered the functionality of your code, this will likely break existing tests. Failing tests here act as a good reminder that you should update your tests and documentation to reflect the new functionality. @@ -97,15 +97,14 @@ Many testing frameworks support writing tests as examples in the function docume It's not easy to remember to run your tests manually at regular intervals. And you're right to think "surely this could be automated too?". - You should use [continuous integration](continuous-integration) to automate -the running of tests. -Using these tools, tests can be triggered to run when any changes are made to your +Use [continuous integration](continuous-integration) to automate +the running of tests. This way, tests can be triggered to run when any changes are made to your remote version control repository. ## Record the outcomes of your tests -For auditability, it is important that the outcome from running tests is recorded. -You should record the test outcomes with the code, so that +For auditability, it is important you record the outcome from running tests. +Record the test outcomes with the code, so that it is clear which tests have been run successfully for a given version of the code. As mentioned above, automating the running of tests on a version control platform is the simplest and @@ -155,7 +154,7 @@ def test_sum_columns(): assert_frame_equal(expected_output, actual_output) ``` -However, we can still conduct the same test with much less data like so: +However, you can still conduct the same test with much less data like so: ```{code-block} python ... @@ -176,19 +175,19 @@ def test_sum_columns(): assert_frame_equal(expected_output, actual_output) ``` -Using minimal and general data in the test has made it clearer what is being tested, and also avoids any unnecessary disclosure. -In this case our function is very generic, so our test doesn't need to know the names of real columns in our data or even have similar values in the data. +Using minimal and general data in the test has made it clearer what is being tested, and also avoids any unnecessary disclosure. +In this case the function is very generic, so the test doesn't need to know the names of real columns in our data or even have similar values in the data. The test data are focussed on testing specific, realistic cases. This makes it easy to see that this function works correctly with positive, negative and zero values. -Note that the way we write our test affects how the function is implemented. +Note that the way you write your test affects how the function is implemented. Using minimal, generalised data encourages you to follow good practices when designing your function. This function doesn't know the name of the columns that it will use in advance, so they are passed as parameters. This makes the function reusable. -We might have named the original function `sum_sales_columns`, but the more general name used here makes it clear -that we could use this to sum columns in any other context. +You might have named the original function `sum_sales_columns`, but the more general name used here makes it clear +that you could use this to sum columns in any other context. -We used a single test function above, but could have created separate tests for each scenario and included tests for more than two input columns, for example. +The example above is a single test function, but could have created separate tests for each scenario and included tests for more than two input columns, for example. ## Structure test files to match code structure @@ -249,7 +248,7 @@ project/ ````` The Python example above has one file containing unit tests for each module (group of related functions and classes). -When using this structure you may want to also group multiple test functions into test classes. +When using this structure, you may want to also group multiple test functions into test classes. Having one test class per function/class that you are testing will make it clear that the group of tests relates to that function or class in your source code. ```{code-block} python @@ -275,13 +274,13 @@ class TestSum: ... ``` -Using classes for unit tests has a number of additional benefits, this allows us to reuse the same logic either by class inheritance, or through fixtures. +Using classes for unit tests has many additional benefits, allowing reuse of the same logic either by class inheritance, or through fixtures. Similar to fixtures, -we are able to use the same pieces of logic through class inheritance in python. -However, it should be noted that it is easier to mix up and link unit tests when using class inheritance. +you are able to use the same pieces of logic through class inheritance in python. +Note that it is easier to mix up and link unit tests when using class inheritance. The following code block demonstrates an example of class inheritance which will inherit both the variable and the `test_var_positive` unit test, meaning three unit tests are run. -We are able to overwrite the variable within the subclass at any time, but will still inherit defined functions/tests from the parent class. +You can overwrite the variable within the subclass at any time, but will still inherit defined functions/tests from the parent class. ```{code-block} python class TestBase: @@ -309,7 +308,7 @@ Use the approach that makes it easiest for developers to identify the relationsh Note that some test frameworks allow you to keep the tests in the same file as the code that is being tested. This is a good way of keeping tests and code associated, -but you should ensure that good modular code practices are followed to separate unrelated code into different files. +but you should follow good modular code practices to separate unrelated code into different files. Additional arguments are made to separate tests and functions when you are packaging your code. If unit tests and code are located together in the same file, the unit tests would also be packaged and installed by additional users. @@ -317,37 +316,37 @@ Therefore when packaging code, the unit tests should be moved to an adjacent test folder as users will not need to have unit tests installed when installing the package. When separating unit tests into main package and testing scripts, it is important to import your package to ensure the correct functions are being unit tested. -For the module structure outlined previously, we would use `from src.math import my_math_function`. +For the module structure outlined previously, use `from src.math import my_math_function`. For R you need to specify the name of your package within the `testthat.R` file within your tests folder. ## Structuring tests -In order to maintain a consistency across modules you develop, you should follow [PEP8](https://www.python.org/dev/peps/pep-0008/) (python) +To maintain a consistency across modules you develop, you should follow [PEP8](https://peps.python.org/pep-0008) (python) or [Google](https://google.github.io/styleguide/Rguide.html) / [tidyverse](https://style.tidyverse.org/) (R) standards when structuring unit tests. For python this involves importing all needed functions at the beginning of the test file. -To ensure that the correct functions are imported from your module, +To ensure you import the correct functions from your module, it is also recommended to install a local editable version into your virtual environment. -This is done by running `pip install -e .` and any changes made to your +Run `pip install -e .` and any changes made to your module functions will also be updated in your python environment. Following this it is recommended to define fixtures, classes and then test functions. -An example of this can be found below. +An example of this is below. More information can be found in Real Python [Getting Started With Testing in Python](https://realpython.com/python-testing/). -Similar structure should be followed in R, with all modules loaded in the beginning of a test script. +You should follow a similar structure in R, with all modules loaded in the beginning of a test script. Test contexts and then functions should be defined in turn as shown above. For more information see [testing design in R](https://r-pkgs.org/testing-design.html). -Generally tests within the same file should follow some structure or order. +Generally, tests within the same file should follow some structure or order. We recommend that the order that functions are defined in the main script is also mirrored within the test scripts. -This will be easier for future developers to debug and follow. -It also ensures that no functions have been missed and do not have unit tests written. +This will be easier for future developers to debug and follow and +ensures that no functions have been missed and do not have unit tests written. ## Test that new logic is correct using unit tests -When we implement new logic in code, tests are required to assure us that the code works as expected. +When you implement new logic in code, tests are required to assure that the code works as expected. To make sure that your code works as expected, you should write tests for each individual unit in your code. A unit is the smallest modular piece of logic in the code - a function or method. @@ -360,50 +359,50 @@ Unit tests should cover realistic use cases for your function, such as: When your function documentation describes the expected inputs to your function, there is less need to test unexpected cases. If misuse is still likely or risky, then providing the user with an error is the best approach to mitigate this risk. -Logic that is reused from an existing package that is already tested does not require tests when we use that logic alone. +Reusing logic from an existing package that is already tested does not require tests when we use that logic alone. You should be aware of whether your dependencies are sufficiently tested. Newly developed packages or those with very few users are more likely to not be thoroughly tested. ## Test that different parts of the code interact correctly using integration tests -We define integration tests as those that test on a higher level than a unit. This includes testing that: +Integration tests are those that test on a higher level than a unit. This includes testing that: * multiple units work together correctly * multiple high level functions work together (e.g. many units grouped into stages of a pipeline) * the analysis works with typical inputs from other systems -Integration tests give us assurance that our analysis is fit for purpose. -Additionally, they give us safety when refactoring or rearranging large parts of code. -Refactoring is an important part of managing the complexity of our analysis as it grows. +Integration tests give you assurance that your analysis is fit for purpose. +Additionally, they give you safety when refactoring or rearranging large parts of code. +Refactoring is an important part of managing the complexity of your analysis as it grows. -We can similarly consider a high level stage of an analysis pipeline. -If we have a stage responsible for imputing missing values, we might create integration tests to check that all values are -imputed and that particular imputation methods were used for specific cases in our test data. -When changes are made to individual imputation methods we might not expect these general characteristics to change. +You can similarly consider a high level stage of an analysis pipeline. +If you have a stage responsible for imputing missing values, you can create integration tests to check that all values are +imputed and that you used particular imputation methods for specific cases in your test data. +When changes are made to individual imputation methods you might not expect these general characteristics to change. This test helps to identify cases where this inadvertently has changed. ```{note} -Integration tests are more robust when they focus on general high level outcomes that we don't expect to change often. +Integration tests are more robust when they focus on general high level outcomes that you don't expect to change often. Integration tests that check very specific outcomes will need to be updated with any small change to the logic within the part that is being tested. ``` -## Test that the analysis runs as expected using end to end tests +## Test that the analysis runs as expected using end-to-end tests -End to end testing (sometimes called system testing) checks the entire workflow from start to finish, ensuring all components work correctly in real-world scenarios. While integration testing focuses on the interaction of specific modules, end to end testing involves all elements of a pipeline. This is useful when refactoring code for example, by providing assurance that overall functionality remains unchanged. +End-to-end testing (sometimes called system testing) checks the entire workflow from start to finish, ensuring all components work correctly in real-world scenarios. While integration testing focuses on the interaction of specific modules, end-to-end testing involves all elements of a pipeline. This is useful when refactoring code for example, by providing assurance that overall functionality remains unchanged. -For example, a piece of analysis has an end to end test to check that outputs are generated and the data are the right shape or format. There might also be a "regression" test that checks that the exact values in the output remain the same. After any changes that are made to tidy up or refactor the code, these end to end tests can be run to assure us that no functionality has been inadvertently changed. +For example, a piece of analysis has an end-to-end test to check that outputs are generated and the data are the right shape or format. There might also be a "regression" test that checks that the exact values in the output remain the same. After you make any changes to tidy up or refactor the code, these end-to-end tests can be run to assure no functionality has accidentally changed. -End to end tests can also be used to quality assure a project from an end user's perspective, and should be run in an environment that replicates the production environment as closely as possible. This type of testing can catch errors that individual unit tests might miss, and confirm that the output is fit for purpose and the user requirements are met. End to end testing is a form of 'black-box' testing, meaning the tester verifies functionality without focusing on the underlying code. It is therefore important to use end to end testing alongside other forms of testing such as unit tests. +Use end-to-end tests to also quality assure a project from an end user's perspective; these should be run in an environment that replicates the production environment as closely as possible. This type of testing can catch errors that individual unit tests might miss and confirms that the output is fit for purpose and the user requirements are met. End-to-end testing is a form of 'black box' testing, meaning the tester verifies functionality without focusing on the underlying code. It is therefore important to use end to end testing alongside other forms of testing such as unit tests. ## Good practices for integration and end-to-end testing When devising an integration or end-to-end testing it’s important to follow these good practices: -- Planning ahead: before you start testing, have a clear plan of what you want to test and how. +- Planning ahead: Have a clear plan of what you want to test and how before you start. - Testing Early: Start testing integration as soon as parts are combined rather than waiting until everything is finished. This helps catch issues sooner. - Use Real Data: Whenever possible, use real data in your tests to make sure everything behaves like it would in the real world. When not possible, make sure the test data reflect the complexities of real data. - Automate tests: Automate your integration tests. This makes it easier to run them frequently and catch problems quickly. -- Checking dependencies: make sure to test how different components rely on each other, as issues can arise there. +- Checking dependencies: Make sure to test how different components rely on each other, as issues can arise there. - Test for failures: don’t just test for success; also check how the system behaves when things go wrong. This helps ensure it handles errors gracefully. - Keep tests isolated: Try to isolate tests so that one failure doesn’t affect another, making it easier to pinpoint issues. - Document: Keep a record of tests, results, and issues found. This helps with future testing and understanding what changes might affect integration. @@ -427,10 +426,10 @@ Writing tests in this way means that tests evaluate how your code handles an out This is another benefit of enforcing isolation in unit tests - helping you understand when errors are coming from an external system, and when they're coming from your code. The unit of code being tested is referred to as the 'System Under Test' (SUT). -One way of achieving this is with mocking. This is where a response from an outside system is replaced with a mock object that your code can be tested against. +One way of achieving this is with mocking. This is where a response from an outside system is replaced with a mock object that you can test your code against. In this example, there's a function making an API request in `src/handle_api_request.py`, and two test functions in `tests/test_handle_api_request.py`. The response from `requests.get()` is mocked with a `Mock()` object, to which `text` and `status_code` attributes are assigned. -The `get_response()` function can now be evaluated for how it handles successful and unsuccessful requests; but thanks to the mocking, get requests are not made to `http://example.com`. +You can now evaluate the `get_response()` function for how it handles successful and unsuccessful requests; but thanks to the mocking, get requests are not made to `http://example.com`. ```{code-block} python # AI has been used to produce content within this artefact. @@ -513,7 +512,7 @@ def test_get_response_fail(mock_requests_get): ``` -You may also consider using fixtures to make test code more concise by generating the necessary `Mock` object attributes dynamically. [`Mock.reset_mock()`](https://docs.python.org/3/library/unittest.mock.html#unittest.mock.Mock.reset_mock) is used to remove the attributes associated with a mock object between different test cases. +Consider using fixtures to make test code more concise by generating the necessary `Mock` object attributes dynamically. Use [`Mock.reset_mock()`](https://docs.python.org/3/library/unittest.mock.html#unittest.mock.Mock.reset_mock) to remove the attributes associated with a mock object between different test cases. ``` # tests/test_handle_api_request.py @@ -555,12 +554,12 @@ def test_get_response_all_conditions(mock_requests_get, mock_response): ``` -[Monkeypatching](https://docs.pytest.org/en/stable/how-to/monkeypatch.html#how-to-monkeypatch-mock-modules-and-environments) in `pytest` provides an alternative way of handling mock objects and attributes, and also allows for the mocking of environment variables. +[Monkeypatching](https://docs.pytest.org/en/stable/how-to/monkeypatch.html#how-to-monkeypatch-mock-modules-and-environments) in `pytest` provides an alternative way of handling mock objects and attributes, and allows for the mocking of environment variables. ## Write tests to assure that bugs are fixed Each time you find a bug in your code, you should write a new test to assert that the code works correctly. -Once the bug is fixed, this new test should pass and give you confidence that the bug has been fixed. +Once you resolve the issue, this new test should pass and give you confidence that the bug has been fixed. When you change or refactor your code in future, the new tests will continue to assure that bugs you have already fixed will not reappear. @@ -572,8 +571,8 @@ The best practice for testing code is to use test-driven development (TDD). This is an iterative approach that involves writing tests before writing the logic to meet the tests. For a piece of analysis logic, you should know in advance what the desired outcome is. -This might be from a user need (e.g. someone needs output data in a certain shape) or an internal requirement (e.g. we need to impute all missing values). -Given that you know the expected outcome, you can write the test before even thinking about how you are going to write the solution. +This might be from a user need (for example, someone needs output data in a certain shape) or an internal requirement (For example you need to impute all missing values). +Given that you know the expected outcome, you can write the test before you think about how you are going to write the solution. ```{note} This section is framed more like training. Once dedicated training has been produced this section will likely be adapted to provide more concise guidance on @@ -586,13 +585,13 @@ TDD typically repeats three steps: 3. Refactor - Make improvements to the quality of the code without changing the functionality As with any code that is adequately covered by tests, code written using TDD can be safely refactored. -We can be more confident that our tests will capture any changes that would unintentionally alter the way our code works. +You can be more confident that your tests will capture any changes that would unintentionally alter the way our code works. -The three steps above are repeated to gradually increase the complexity of our code. -The first test that is written should focus on the minimum functionality. +Repeat the above three steps to gradually increase the complexity of your code. +The first test you write should focus on the minimum functionality. Then this minimal functionality is implemented, to do nothing more than the test requires. On the next iteration the test becomes more complex, as does the code logic. -In each iteration the refactoring steps means that the increasing complexity of the code is managed. +In each iteration the refactoring steps means that you manage the increasing complexity of the code. This approach provides many benefits beyond good test coverage. The iterative nature of TDD encourages you to follow a number of other good practices. @@ -605,7 +604,7 @@ are extensions of TDD with a useful focus on user needs. ## Modelling-relevant testing -To ensure that model-relevant tests are conducted within the analysis, it is important to use data that is representative of real-world scenarios and free from biases. This involves selecting diverse datasets that reflect the variety of conditions the model will encounter in practice. Additionally, it is important to regularly update test data to capture any changes in the environment or user behaviour. +To ensure you conduct model-relevant tests within the analysis, it is important to use data that is representative of real-world scenarios and free from biases. Select diverse datasets that reflect the variety of conditions the model will encounter in practice. Additionally, it is important to regularly update test data to capture any changes in the environment or user behaviour. ### Acceptance testing Acceptance testing ensures that the model meets specified requirements and performs well in real-world scenarios. It verifies that the model's outputs align with business needs and user expectations. There are three types of acceptance testing: @@ -619,17 +618,17 @@ Acceptance testing ensures that the model meets specified requirements and perfo ### Defining and Using Appropriate Metrics -Evaluating model performance using metrics is essential. You should choose metrics that align with the specific goals of the project and provide meaningful insights into the performance of the model in this context. It is important to select metrics that align with the specific goals of the project and provide meaningful insights into the model's performance. +Evaluating model performance using metrics is essential. Choose metrics that align with the specific goals of the project and provide meaningful insights into the performance of the model in this context. Use appropriate metrics to evaluate model performance. For example, precision and recall are well established measures for evaluating the performance of data linkage. The right metrics help assess the model's effectiveness in different scenarios. ### Cross-Validation Techniques -To ensure that the model generalises well to unseen data, techniques like k-fold cross-validation can be used. This method involves dividing the data into k subsets and training the model k times, each time using a different subset as the validation set and the remaining data as the training set. Cross-validation helps identify potential overfitting and ensures that the model performs consistently across different data subsets. +To ensure that the model generalises well to unseen data, you can use techniques like k-fold cross-validation. This method involves dividing the data into k subsets and training the model k times, each time using a different subset as the validation set and the remaining data as the training set. Cross-validation helps identify potential overfitting and ensures that the model performs consistently across different data subsets. ### Stress Testing -Stress testing evaluates how the model performs under extreme conditions or with noisy data. This helps identify the model's robustness and ability to handle unexpected inputs. Stress testing involves introducing variations or noise into the input data and observing how the model's outputs are affected. This type of testing is useful for understanding the model's limits and ensuring it can handle real-world challenges. +Stress testing evaluates how the model performs under extreme conditions or with noisy data. This helps identify the model's robustness and ability to handle unexpected inputs. Stress testing involves introducing variations or noise into the input data and observing how this affects the model's outputs. This type of testing is useful for understanding the model's limits and ensuring it can handle real-world challenges. ### Sensitivity Analysis @@ -641,7 +640,7 @@ Implementing methods to make the model's outputs interpretable is essential for ### Model Optimisation -Optimisation is used to adjust adjusting the model's parameters to achieve the best overall performance. Continuous optimisation ensures that the model remains effective and efficient over time as inputs change. There are lots of techniques available to optimise performance. Most are designed to help find the best parameters for the model to enhance its accuracy and efficiency. +Use optimisation to adjust the model's parameters to achieve the best overall performance. Continuous optimisation ensures that the model remains effective and efficient over time as inputs change. There are lots of techniques available to optimise performance; most are designed to help find the best parameters for the model to enhance its accuracy and efficiency. Examples of optimisation techniques for machine learning include grid search and parameter tuning. Grid search involves systematically searching through a predefined set of hyperparameters, while hyperparameter tuning adjusts the model's parameters to achieve the best possible performance. @@ -687,8 +686,8 @@ def test_another_function(spark_session): This example shows a fixture named `spark_session` with a testing session scope. Starting a new spark session can take a few seconds, so creating a new session for each test function would significantly increase the time it takes to run all of the tests. -With a session level scope, the function is called once for the whole testing session -and the resulting `SparkSession` object is shared between our tests. +With a session level scope, you call the function once for the whole testing session +and share the resulting `SparkSession` object between your tests. Reusing the same `SparkSession` object is safe to do if none of our tests modify the object. ```{code-block} python @@ -707,10 +706,10 @@ def test_another_function(database_connection): ... ``` -Fixtures can also be useful for undoing any effects each test run might have on the global environment. +Fixtures can also be useful for undoing any effects that each test run might have on the global environment. For example, they can remove test data which has been written to a temporary file or database. -The example above shows how a fixture might be used to reset a test database between each test. -Here a test function scope is used, so the fixture is run separately for each test function that uses it. +The example above shows how you might use a fixture to reset a test database between each test. +It uses a test function scope, so the fixture is run separately for each test function that uses it. The fixture performs a reset on the database after the database connection has been used by the test. For usage details see the documentation for packages that offer fixtures: @@ -720,8 +719,8 @@ For usage details see the documentation for packages that offer fixtures: ### Use parameterisation to reduce repetition in test logic Similar steps are often repeated when testing multiple combinations of inputs and outputs. -Parameterisation allows us to reduce repetition in our test code, in a similar way to writing our logic in functions. -You should specify pairs of inputs and expected outputs, so that your testing tool can repeat the same test for each scenario. +Parameterisation allows reduction of repetition in test code, in a similar way to writing your logic in functions. +Specify pairs of inputs and expected outputs, so your testing tool can repeat the same test for each scenario. Using parameterisation in a test framework is equivalent to using a for-loop to apply a test function over multiple inputs and expected outputs. Using functionality from test packages may provide improved running efficiency and more detailed reporting of test failures. @@ -733,7 +732,7 @@ In R, the `patrick` package extends `testthat` to provide a ### Define Source Code -Take the below function for example, it can take 2 arguments. +Take the below function for example, which can take 2 arguments. ```{code-block} python def sum_two_nums(num1:int, num2:int) -> int: @@ -810,7 +809,7 @@ def expected_answers() -> dict: This fixture of expected answers can be served to a parameterised test and the returned dictionary can be accessed to provide the expected answer for -parameter combinations. In order to parameterise both of the required arguments, +parameter combinations. To parameterise both of the required arguments, the parameterise statements are simply stacked on top of each other: ```{code-block} python @@ -863,11 +862,11 @@ Although testing SQL is outside the scope of this guidance, many of the concepts in this guidance are also applicable to SQL. In SQL, single queries often contain several parts. These can be more readily tested by breaking up these queries and taking a more step-by-step approach, -similar to breaking up functions. Integration testing can be used to verify +similar to breaking up functions. Use [Integration testing](https://github.com/best-practice-and-impact/qa-of-code-guidance/blob/main/book/testing_code.md#test-that-different-parts-of-the-code-interact-correctly-using-integration-tests) to verify that queries and functions behave as expected when combined. -Functions that interact with a database (DB) should be tested within a development -environment, rather than with a production database. This is in order to prevent +Test functions that interact with a database (DB) within a development +environment, rather than with a production database. This prevents unintended data modification or deletion. Functions can also be unit tested from simplified dummy data. @@ -876,8 +875,8 @@ and [pgTAP](https://github.com/theory/pgtap/) for Postgres. ## In a time crunch? The risks to skipping tests -In an ideal world, testing code would never be skipped, keeping the software reliable, -and easily reproducible. However, in practice there are times when skipping tests may be necessary— +In an ideal world, you would never skip testing code, ensuring the software is reliable +and easily reproducible. However, in practice there are times when skipping tests may be necessary — perhaps due to tight deadlines, limited resources, or the need to quickly get a feature up and running. While this can save time in the moment, it’s important to be cautious, as skipping tests can lead to hidden problems that may become harder to fix later, particularly From 4c055fe8c9e3387f7b1c6ba689f58d50a66ed4f2 Mon Sep 17 00:00:00 2001 From: Beth Jones Date: Mon, 20 Jan 2025 16:36:30 +0000 Subject: [PATCH 33/33] converted some passive sentences to active in testing chapter --- book/testing_code.md | 42 +++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/book/testing_code.md b/book/testing_code.md index cdca7c65..37428766 100644 --- a/book/testing_code.md +++ b/book/testing_code.md @@ -16,7 +16,7 @@ How can I demonstrate that my code does what it is supposed to do? As the developer of the code, you are best placed to decide what tests you need to put in place to answer that question confidently. -Take a risk-based approach to testing. Tests should be used proportionately for your analysis. This usually means writing more tests for parts of your code that are very new, more complex, or carry more risk. +Take a risk-based approach to testing. You should use tests proportionately based on your analysis. This usually means writing more tests for parts of your code that are very new, more complex, or carry more risk. When you are developing your tests, here are some points to think about: @@ -45,7 +45,7 @@ Other useful learning resources include: * Hadley Wickham's [testthat: getting started with testing](https://vita.had.co.nz/papers/testthat.pdf) and [testing design in R](https://r-pkgs.org/testing-design.html) ``` -In this section we assume that you are using a testing framework to run your tests (for example `pytest` for python or `testthat` for R) and have your code in a package. +In this section, we assume that you are using a testing framework to run your tests (for example `pytest` for python or `testthat` for R) and have your code in a package. It is more difficult to test code that is not in a package and therefore follow the testing good practices described here. ## Write reproducible tests @@ -58,7 +58,7 @@ However, it's important you are able to produce the same checks against your cod Code changes over time, so you need to be able to repeat these checks against the updated code. Additionally, other analysts should be able to carry out the same checks and get the same results. -Representing your tests as code allows you to consistently repeat the same steps. +You can consistently repeat the same steps when you represent your tests as code. This lets you or another analyst carry out the same verification again to get the same results. When you have carried out a test manually, you should ensure that you add a code test to reproduce this. @@ -66,10 +66,10 @@ Code that you write for testing should also follow the good practices described ## Write repeatable tests -For you to be able to trust the results of your tests you need them to be repeatable. +You need your tests to be repeatable for you to be able to trust their results. This means they should give the same outcome if you run them more than once against the same version of your analysis code. -For tests to run repeatably each test must be independent. +For tests to run repeatably, each test must be independent. There should not be a shared state between tests, for example a test should not depend on another test having already run. You could intentionally randomise the order that tests are executed to encourage this. @@ -89,7 +89,7 @@ This ensures that changes do not break the existing, intended functionality of y Running the entire collection of tests has the added benefit of detecting unexpected side-effects of your changes. For example, you might detect an unexpected failure in part of your code that you didn't change. -Running tests regularly allows you to fix any issues before changes are added to a stable or production version of your code (e.g. the `main` Git branch). +If you run tests regularly, you will be more able to fix any issues before changes are added to a stable or production version of your code (e.g. the `main` Git branch). If you have altered the functionality of your code, this will likely break existing tests. Failing tests here act as a good reminder that you should update your tests and documentation to reflect the new functionality. @@ -98,7 +98,7 @@ Many testing frameworks support writing tests as examples in the function docume It's not easy to remember to run your tests manually at regular intervals. And you're right to think "surely this could be automated too?". Use [continuous integration](continuous-integration) to automate -the running of tests. This way, tests can be triggered to run when any changes are made to your +the running of tests. This way, you can trigger tests to run when any changes are made to your remote version control repository. ## Record the outcomes of your tests @@ -117,7 +117,7 @@ Tests for analytical code will usually require data. To ensure that tests are cl Good test data are: * only just detailed enough data to carry out the test -* fake, static (hardcoded) and readable +* fake, static (hardcoded), and readable * stored closely to the test ```{warning} @@ -176,7 +176,7 @@ def test_sum_columns(): ``` Using minimal and general data in the test has made it clearer what is being tested, and also avoids any unnecessary disclosure. -In this case the function is very generic, so the test doesn't need to know the names of real columns in our data or even have similar values in the data. +In this case, the function is very generic, so the test doesn't need to know the names of real columns in our data or even have similar values in the data. The test data are focussed on testing specific, realistic cases. This makes it easy to see that this function works correctly with positive, negative and zero values. @@ -276,7 +276,7 @@ class TestSum: Using classes for unit tests has many additional benefits, allowing reuse of the same logic either by class inheritance, or through fixtures. Similar to fixtures, -you are able to use the same pieces of logic through class inheritance in python. +you can use the same pieces of logic through class inheritance in python. Note that it is easier to mix up and link unit tests when using class inheritance. The following code block demonstrates an example of class inheritance which will inherit both the variable and the `test_var_positive` unit test, meaning three unit tests are run. @@ -310,14 +310,14 @@ Note that some test frameworks allow you to keep the tests in the same file as t This is a good way of keeping tests and code associated, but you should follow good modular code practices to separate unrelated code into different files. Additional arguments are made to separate tests and functions when you are packaging your code. -If unit tests and code are located together in the same file, +If you store unit tests and code in the same file, the unit tests would also be packaged and installed by additional users. Therefore when packaging code, -the unit tests should be moved to an adjacent test folder as users will not need to have unit tests installed when installing the package. +you should move the unit tests to an adjacent test folder as users will not need to have unit tests installed when installing the package. When separating unit tests into main package and testing scripts, it is important to import your package to ensure the correct functions are being unit tested. For the module structure outlined previously, use `from src.math import my_math_function`. -For R you need to specify the name of your package within the `testthat.R` file within your tests folder. +For R, you need to specify the name of your package within the `testthat.R` file within your tests folder. ## Structuring tests @@ -326,7 +326,7 @@ or [Google](https://google.github.io/styleguide/Rguide.html) / [tidyverse](https For python this involves importing all needed functions at the beginning of the test file. To ensure you import the correct functions from your module, -it is also recommended to install a local editable version into your virtual environment. +we recommended you install a local editable version into your virtual environment. Run `pip install -e .` and any changes made to your module functions will also be updated in your python environment. Following this it is recommended to define fixtures, classes and then test functions. @@ -353,7 +353,7 @@ A unit is the smallest modular piece of logic in the code - a function or method Unit tests should cover realistic use cases for your function, such as: * boundary cases, like the highest and lowest expected input values -* positive, negative, zero and missing value inputs +* positive, negative, zero, and missing value inputs * examples that trigger errors that have been defined in your code When your function documentation describes the expected inputs to your function, there is less need to test unexpected cases. @@ -367,7 +367,7 @@ Newly developed packages or those with very few users are more likely to not be Integration tests are those that test on a higher level than a unit. This includes testing that: * multiple units work together correctly -* multiple high level functions work together (e.g. many units grouped into stages of a pipeline) +* multiple high level functions work together (e.g., many units grouped into stages of a pipeline) * the analysis works with typical inputs from other systems Integration tests give you assurance that your analysis is fit for purpose. @@ -571,7 +571,7 @@ The best practice for testing code is to use test-driven development (TDD). This is an iterative approach that involves writing tests before writing the logic to meet the tests. For a piece of analysis logic, you should know in advance what the desired outcome is. -This might be from a user need (for example, someone needs output data in a certain shape) or an internal requirement (For example you need to impute all missing values). +This might be from a user need (for example, someone needs output data in a certain shape) or an internal requirement (for example, you need to impute all missing values). Given that you know the expected outcome, you can write the test before you think about how you are going to write the solution. ```{note} @@ -580,9 +580,9 @@ the practice. ``` TDD typically repeats three steps: -1. Red - Write a test that we expect to fail -2. Green - Write or update our code to pass the new test -3. Refactor - Make improvements to the quality of the code without changing the functionality +1. Red - Write a test that we expect to fail. +2. Green - Write or update our code to pass the new test. +3. Refactor - Make improvements to the quality of the code without changing the functionality. As with any code that is adequately covered by tests, code written using TDD can be safely refactored. You can be more confident that your tests will capture any changes that would unintentionally alter the way our code works. @@ -646,7 +646,7 @@ Examples of optimisation techniques for machine learning include grid search and ## Reduce repetition in test code (fixtures and parameterised tests) -Where possible, reduce repetition in your tests. Tests are code too, so you should still [make this code reusable](functions). +Where possible, you should reduce repetition in your tests. Tests are code too, so you should still [make this code reusable](functions). As with functional code, test code is much easier to maintain when it is modular and reusable. ### Use fixtures to reduce repetition in test set up