Contributing Guide

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. You can contribute in many ways.

Types of Contributions

Please check GitHub issue tracker for:

Bugs
Features
Enhancements

Style

Wrap line over 99 (79 if possible). Lines with more than 79 characters are harder to read in iPython and command line. Else, we try to keep the pep8 standards. Private functions start with an underline.

Naming Conventions

Functions that:

get fetch data from resources should start with fetch/_fetch
parse/read data from a file should start with parse/_parse/read/_read
download data from resources to a file should use download/_download
write down data to a file should start with write/_write
load and process python objects filter/_filter/select/_select

Python code is also documentation! So use self-explanatory function names:

parse_mmcif_atoms() instead of parse_mmcif()

as well as function variables:

uniprot_id instead of identifier

Generally, a function argument identifier can be used if the Docstring clearly defines which type of Accession Identifier should be used.

If a resource has its own name for a field or value, try to keep, for consistency.

Docstrings

Use the following template:

def df_encoder(data, descriptor=None, col_names=None):
    """Encode a pandas DataFrame with a descriptor. Similar to one hot encoding,
    however it preserves encoding between columns.

    :param descriptor: dict like descriptor to be applied to the columns
    :type descriptor:
    :param data: pandas DataFrame with categorical data
    :type data: pandas.DataFrame
    :param col_names: names for the new columns
    :type col_names: list of [str,]
    :return: table with encoded data
    :rtype : pandas.DataFrame

    :Example:

        >>> import pandas as pd
        >>> df = pd.DataFrame(map(list, ['ABC', 'AAA', 'ACD']))
        >>> print(df_encoder(df))
           0_A  1_A  2_A  0_B  1_B  2_B  0_C  1_C  2_C  0_D  1_D  2_D
        0    1    0    0    0    0    1    0    0    0    0    1    0
        1    1    0    0    0    1    0    0    0    1    0    0    0
        2    1    0    0    0    0    0    1    0    0    0    0    1

      .. note:: Use an external descriptor to make sure you descriptor is
      replicated
      """

If the function returns a pandas.DataFrame, is good practice to add which columns: dtype you expect, so we can keep track of it:

def fetch_ensembl_variants(identifier, feature=None):
    """Queries the Ensembl API for germline variants (mostly dbSNP) and somatic
    (mostly COSMIC) based on Ensembl Protein identifiers (e.g. ENSP00000326864).

    :param identifier: Ensembl acession to a protein: ENSP00000XXXXXX
    :param feature: either 'transcript_variation' or 
        'somatic_transcript_variation'
    :return: table[Parent: str,
                   allele: str,
                   clinical_significance: list,
                   codons: str,
                   end: int,
                   feature_type: str,
                   id: str,
                   minor_allele_frequency: float ,
                   polyphen: float,
                   residues: str,
                   seq_region_name: str,
                   sift: float,
                   start: int,
                   translation: str,
                   type: str ]
    :rtype: pandas.DataFrame
    """

Columns type pragma

Column type normalisation is a central issue in ProteoFAV. There is no simple way to make column type consistent across all data files. Some pragmatic rules to deal with NANs (Not a number) in non-float columns are defined here, but open to change. NANs in Python are always floats. If one has to operate with integers or string, it must eliminate the NAN’s, and in ProteoFAV we use the following rules:

If is a sequence index: -9999
If is a sequence column NAN's: 'X'
If is another string column: '' (empty string)

Testing

Doctests are not mandatory, but tests are. Tests are located in /tests and we use standard Unittest setup.

Git/Github guides

Various git and github guides exist online. Here are some that we find useful:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly