-
Notifications
You must be signed in to change notification settings - Fork 0
Contributing Guide
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. You can contribute in many ways.
Please check GitHub issue tracker for:
- Bugs
- Features
- Enhancements
Wrap line over 99 (79 if possible). Lines with more than 79 characters are harder to read in iPython and command line. Else, we try to keep the pep8 standards. Private functions start with an underline.
Functions that:
- get fetch data from resources should start with
fetch/_fetch
- parse/read data from a file should start with
parse/_parse/read/_read
- download data from resources to a file should use
download/_download
- write down data to a file should start with
write/_write
- load and process python objects
filter/_filter/select/_select
Python code is also documentation! So use self-explanatory function names:
-
parse_mmcif_atoms()
instead ofparse_mmcif()
as well as function variables:
-
uniprot_id
instead ofidentifier
Generally, a function argument identifier
can be used if the Docstring clearly defines which type of Accession Identifier should be used.
If a resource has its own name for a field or value, try to keep, for consistency.
Use the following template:
def df_encoder(data, descriptor=None, col_names=None):
"""Encode a pandas DataFrame with a descriptor. Similar to one hot encoding,
however it preserves encoding between columns.
:param descriptor: dict like descriptor to be applied to the columns
:type descriptor:
:param data: pandas DataFrame with categorical data
:type data: pandas.DataFrame
:param col_names: names for the new columns
:type col_names: list of [str,]
:return: table with encoded data
:rtype : pandas.DataFrame
:Example:
>>> import pandas as pd
>>> df = pd.DataFrame(map(list, ['ABC', 'AAA', 'ACD']))
>>> print(df_encoder(df))
0_A 1_A 2_A 0_B 1_B 2_B 0_C 1_C 2_C 0_D 1_D 2_D
0 1 0 0 0 0 1 0 0 0 0 1 0
1 1 0 0 0 1 0 0 0 1 0 0 0
2 1 0 0 0 0 0 1 0 0 0 0 1
.. note:: Use an external descriptor to make sure you descriptor is
replicated
"""
If the function returns a pandas.DataFrame, is good practice to add which columns: dtype you expect, so we can keep track of it:
def fetch_ensembl_variants(identifier, feature=None):
"""Queries the Ensembl API for germline variants (mostly dbSNP) and somatic
(mostly COSMIC) based on Ensembl Protein identifiers (e.g. ENSP00000326864).
:param identifier: Ensembl acession to a protein: ENSP00000XXXXXX
:param feature: either 'transcript_variation' or
'somatic_transcript_variation'
:return: table[Parent: str,
allele: str,
clinical_significance: list,
codons: str,
end: int,
feature_type: str,
id: str,
minor_allele_frequency: float ,
polyphen: float,
residues: str,
seq_region_name: str,
sift: float,
start: int,
translation: str,
type: str ]
:rtype: pandas.DataFrame
"""
Column type normalisation is a central issue in ProteoFAV. There is no simple way to make column type consistent across all data files. Some pragmatic rules to deal with NANs (Not a number) in non-float columns are defined here, but open to change. NANs in Python are always floats. If one has to operate with integers or string, it must eliminate the NAN’s, and in ProteoFAV we use the following rules:
- If is a sequence index:
-9999
- If is a sequence column NAN's:
'X'
- If is another string column:
''
(empty string)
Doctests are not mandatory, but tests are. Tests are located in /tests
and we use standard Unittest setup.
Various git and github guides exist online. Here are some that we find useful: