Skip to content

Commit

Permalink
Initial project structure (#1)
Browse files Browse the repository at this point in the history
* Initial project structure
  • Loading branch information
brendancsmith authored Mar 6, 2024
1 parent 80e9598 commit b228d30
Show file tree
Hide file tree
Showing 16 changed files with 310 additions and 19 deletions.
77 changes: 77 additions & 0 deletions .github/workflows/python-package.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: Python Package
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
# no inputs

concurrency:
group: python-package-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

permissions: read-all

jobs:
build:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]
steps:
#----------------------------------------------
# check-out repo and set-up python
#----------------------------------------------
- name: Check out repo
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
id: setup-python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
#----------------------------------------------
# install & configure poetry
#----------------------------------------------
- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-in-project: true
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install poetry
poetry install
#----------------------------------------------
# load cached venv if cache exists
#----------------------------------------------
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v4
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
#----------------------------------------------
# install dependencies if cache does not exist
#----------------------------------------------
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction --no-root
#----------------------------------------------
# run linter
#----------------------------------------------
# - name: Lint with ruff
# run: |
# source .venv/bin/activate
# # stop the build if there are Python syntax errors or undefined names
# ruff check . --select=E9,F63,F7,F82 --output-format=full --no-fix --statistics
# # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
# ruff check . --select=E501,C901 --line-length=127 --exit-zero --no-fix --statistics
#----------------------------------------------
# run tests
#----------------------------------------------
- name: Test with pytest
run: |
source .venv/bin/activate
pytest
29 changes: 29 additions & 0 deletions .github/workflows/trunk-check.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Trunk Check
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
# no inputs

concurrency:
group: trunk-check-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

permissions: read-all

jobs:
trunk_check:
name: Trunk Check Runner
runs-on: ubuntu-latest
permissions:
checks: write # For trunk to post annotations
contents: read # For repo checkout

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Trunk Check
uses: trunk-io/trunk-action@v1
9 changes: 9 additions & 0 deletions .trunk/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*out
*logs
*actions
*notifications
*tools
plugins
user_trunk.yaml
user.yaml
tmp
10 changes: 10 additions & 0 deletions .trunk/configs/.markdownlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Autoformatter friendly markdownlint config (all formatting rules disabled)
default: true
blank_lines: false
bullet: false
html: false
indentation: false
line_length: false
spaces: false
url: false
whitespace: false
7 changes: 7 additions & 0 deletions .trunk/configs/.yamllint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
rules:
quoted-strings:
required: only-when-needed
extra-allowed: ["{|}"]
key-duplicates: {}
octal-values:
forbid-implicit-octal: true
5 changes: 5 additions & 0 deletions .trunk/configs/ruff.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generic, formatter-friendly config.
select = ["B", "D3", "E", "F"]

# Never enforce `E501` (line length violations). This should be handled by formatters.
ignore = ["E501"]
37 changes: 37 additions & 0 deletions .trunk/trunk.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# This file controls the behavior of Trunk: https://docs.trunk.io/cli
# To learn more about the format of this file, see https://docs.trunk.io/reference/trunk-yaml
version: 0.1
cli:
version: 1.20.1
# Trunk provides extensibility via plugins. (https://docs.trunk.io/plugins)
plugins:
sources:
- id: trunk
ref: v1.4.4
uri: https://github.com/trunk-io/plugins
# Many linters and tools depend on runtimes - configure them here. (https://docs.trunk.io/runtimes)
runtimes:
enabled:
- node@18.12.1
- python@3.10.8
# This is the section where you manage your linters. (https://docs.trunk.io/check/configuration)
lint:
enabled:
- actionlint@1.6.27
- checkov@3.2.31
- trivy@0.49.1
- yamllint@1.35.1
- ruff@0.3.0
- git-diff-check
- markdownlint@0.39.0
- osv-scanner@1.6.2
- prettier@3.2.5
- taplo@0.8.1
- trufflehog@3.68.4
actions:
disabled:
- trunk-announce
- trunk-check-pre-push
- trunk-fmt-pre-commit
enabled:
- trunk-upgrade-available
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Brendan Smith mail@brendansmith.ai

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
40 changes: 21 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,42 @@
# Hi Brendan,
# vcf-isec

## Part 1: Welcome to your code challenge!
A simple python implementation of Variant Call Format intersection and complements.

## Background

### Background
Bioinformaticians store variants identified by next generation sequencing in a VCF file. The VCF specification was originally maintained by the 1000 Genomes Project, and the torch has since been passed to the Global Alliance for Genomics and Health Data Working group file format team.

Specifications for VCF v4.1 can be found [here](http://samtools.github.io/hts-specs/VCFv4.1.pdf).

Essentially, a variant is represented as a separate line in the VCF, where the chromosome, position, reference base(s), and alternate base(s) identified at that position are found in columns 1, 2, 4, and 5, resp. Additional information pertaining to the variant is listed in the remaining fields of the VCF.

## Task

### Task
A common task for bioinformaticians is to compare variants, whether to compare VCF files generated by different analytical pipelines or to simply compare variants between related individuals.

Write a script (preferably in python) that takes as input two VCFs and performs a comparison of the variants found in each file. The script should output 3 VCFs, reflecting those variants that are shared and unique to each individual. The script should gracefully handle errors.
This script takes as input two VCFs and performs a comparison of the variants found in each file. The script outputs 3 VCFs, reflecting those variants that are shared and unique to each individual.

Upload your final script to this GitHub repo as a pull request to the master branch.
**NOTE**: An example VCF is provided at `tests/resources/sample.vcf`. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.

**NOTE**: An example VCF is provided in this repo for your development purposes. VCFs can grow up to 4 million variants in size, as in the case of whole genome sequencing.
## Part 2) Problem Solving Challenge (to be discussed in person at 2nd interview)

**BONUS TASK**: The script also generates relevant summary metrics describing the two sets of variants (number of variants shared, etc.).
Carefully read the following questions. In your next interview, be prepared to discuss each question and scenario. During the discussion, be as specific as possible regarding how the mentioned technology would be used. Our goal is to find out how quickly you grasp novel technologies, with respect to both their benefits and drawbacks.

Should you accept this challenge, you have 48 hours to complete your task. Feel free to email if you have any questions.
### Scenario A:

## Part 2) Problem Solving Challenge (to be discussed in person at 2nd interview)
Carefully read the following questions. In your next interview, be prepared to discuss each question and scenario. During the discussion, be as specific as possible regarding how the mentioned technology would be used. Our goal is to find out how quickly you grasp novel technologies, with respect to both their benefits and drawbacks.

#### Scenario A:
In the current system large JSON Lines files (100,000 JSON dictionaries per file) are uploaded to S3. Each file is then read line by line and the JSON dictionaries sent to a MongoDB collection using Apache Airflow. The client would like to remove Airflow from this process. The entire contents of the JSON Lines file must be uploaded to MongoDB. Files that are in the process of being uploaded and those that have errored must be tracked.
##### Questions:
1. Take 5-10 minutes to present an architecture that would eliminate the need to use Apache Airflow. Feel free to use any combination of AWS services you wish to solve the problem. Be prepared to justify your technology and architectural decisions.
2. How would you retry the process if MongoDB went offline in the middle of an upload?

#### Scenario B

#### Questions:

1. Take 5-10 minutes to present an architecture that would eliminate the need to use Apache Airflow. Feel free to use any combination of AWS services you wish to solve the problem. Be prepared to justify your technology and architectural decisions.
2. How would you retry the process if MongoDB went offline in the middle of an upload?

### Scenario B

Linux nodes continuously watch a RabbitMQ queue to see if any jobs have been submitted. When a job appears in a queue, one node will pick up the job, thereby preventing a different node from picking up the same job. RabbitMQ’s ack late feature is used to ensure jobs are retried when a node abruptly fails. Only one node is supposed to execute a job at one time, however, under some conditions, it is possible for two nodes to pick up the same job which will cause issues. The only shared systems between the nodes are the RabbitMQ queue and a MongoDB database.
##### Questions:

#### Questions:

1. Come up with a strategy on how to keep two jobs from running at the same time. The strategy can only use the given resources.
2. If a new shared resource or technology could be added to the system, what would you add to ensure that two jobs do not run at the same time?

Expand Down
Binary file added dist/vcf_isec-0.1.0-py3-none-any.whl
Binary file not shown.
Binary file added dist/vcf_isec-0.1.0.tar.gz
Binary file not shown.
74 changes: 74 additions & 0 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 17 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[tool.poetry]
name = "vcf-isec"
version = "0.1.0"
description = "A simple python implementation of Variant Call Format intersection and complements"
authors = ["Brendan Smith <brendan.smith.93@gmail.com>"]
license = "MIT"
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"

[tool.poetry.group.dev.dependencies]
pytest = "^8.0.2"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Empty file added src/vcf_isec/__init__.py
Empty file.
File renamed without changes.
3 changes: 3 additions & 0 deletions tests/unit/test_no_op.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
def test_no_op():
# No-op test to ensure GitHub Actions is working with pytest
pass

0 comments on commit b228d30

Please sign in to comment.