We'd love your help! This doc covers how to become a contributor and submit code to the project.
We use Google Style Guide for our code, which is mainly in Python
and R
. The items below highlight some starting points, but you should ensure your code also comply to the associated coding style unless otherwise noted in bold below.
- The way you open and close data files.
- How you name variables, functions classes, methods, etc.
- Your code documentation, in particular functions.
- Modules and usage.
- Code identation.
- If you happen to be a Vim, Google also provide a settings file.
- File Names
- Function Definition
- Function Documentation
- Identation: Use 4 spaces, for consistency with Python style.
- Assignment
- Else
- Braces
Writing Python and R Notebooks is a different program paradigm than writing scripts, and borrows much from Literate Programming, introduced by Donald Knuth, being one of the most important distinctions as follows:
A program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated.
Writing good Python and R notebooks have much more to do with writing good essays and storytelling than programming. If you are contributing to PERCEIVE code notebooks, we will not expect you to be proficient in doing so, but be eager to work with us to brush up your notebook, if necessary.
Some great examples of contributed notebooks to PERCEIVE to serve as inspiration are as follows:
- Full Disclosure Social Network Analysis, by Jeff Gerhard.
- CAPEC Foaamtree Visualization -- With No Plots or CAPEC Foamtree Visualization -- With Plots, by Vignesh Rajan.
- Full Disclosure Word and Vocabulary Distributions or Full Disclosure Word and Vocabulary Distributions - With Plots, by Shashank Kava.
- And maybe in the future, yours!
Please note most of these Notebooks were semester-long contributions. Your initial Python Notebook can be a much smaller contribution, such as one section of the listed Notebooks. Afterall, it is ok to be simple as possible, not simpler! A future contributor will still need to see some basic structure and narrative in your code notebook to make sense of what you are trying to say.
Most of PERCEIVE is currently organized in Python Notebooks and R Notebooks, as most of our code is experimental. Nonetheless, some fronts are well-defined:
- Source Crawlers: Not all data sources are readily available for the source to be used in this project code. We currently have Crawlers for:
- Mailing Lists: Seclists
- Knowledge Sources: CAPEC, CWE, and CVE Details (CVE Mitre and NVD provide well-formatted XML representations of their Raw Data).
- Database: It is currently undergoing work to define a schema for all the source, parsed and analyzed data of the project. Currently, the data lives in a private Mega Upload account. Note: Due to the nature of the data, Google Drive will (incorrectly) state the data has viruses. However, this is only true in the harmful sense if you intentionally execute the code.
- CAPEC Foamtree Visualization: Interactive visualization to explore older versions of Capec as a Tree.
- Topic Models: We currently have one implementation of LDA VEM, as proposed by Blei et al. (2003), and associated utility functions and visualizations to explore Topics in the Source Crawlers at a specific point in time.
- Topic Flow: Hosted as a separate project, we have a working version fork of Topic Flow. Our fork includes the data pipeline necessary to generate the visualizations in the forked tool, and is useful to observe topics over time.
We have a strict format to submit code (patches) to this project, including but not limited to submitting pull requests through branches, using a specific branch and commit labels.
If you wish to collaborate but do not have the time availability to learn about Git and Github, please open an issue and we can try to work it out.
If you are unsure on how to perform the below steps or are new to Git and Github, please see the Learning Resources (Section 5.) at the bottom of this document for some learning material, including free video lectures.
The step-by-step process is as follows:
- Submit an issue describing your proposed change to the repo in question.
- Please make sure you understand and agree on what files will be submitted, where in the repository, and on what file format to use before submitting a Pull Request.
- It is ok to upload example files or images to clarify your point during an issue discussion, but it is not ok to submit entire datasets to either issue or as pull requests. Data should be hosted separately and discussed in the issue where it will be hosted in PERCEIVE, if necessary. Your code, if not requesting data directly from PERCEIVE database, should clearly indicate the final location where the data used for your analysis is located, so others can reproduce your results by re-running your Notebook.
- The repo owner will respond to your issue promptly.
- Fork the repo, develop and test your code changes.
- After cloning your fork, before starting to modify any file, please create a topic branch. A topic branch has the following format:
<issue-id>-<meaningful name associated to **what** you are trying to do>
.- Example: If the issue ID #27 is about creating histograms of full disclosure word counts, then your branch should be
27-full-disclosure-word-count-histogram
(note the # symbol is not included in the branch name).
- Example: If the issue ID #27 is about creating histograms of full disclosure word counts, then your branch should be
- Please include the issue ID in all commits (e.g. #27). Your commits should follow the format
<issue-id>-<meaningful commit name>
.- Example: Continuing the example above, as you work in branch 27-full-disclosure-word-count-histogram, one of your commits may be
#27 parse input data into data frames
, followed by#27 plot and specify histogram ranges
. Notice that, different from topic branches, the # symbol must be included in all commit labels.
- Example: Continuing the example above, as you work in branch 27-full-disclosure-word-count-histogram, one of your commits may be
- After cloning your fork, before starting to modify any file, please create a topic branch. A topic branch has the following format:
- Ensure that your code adheres to the existing Google code style in the sample to which you are contributing (see Section 1.).
- Submit a pull request using the created topic branch in step 3.1.
- Git will prompt the commit message to be used as the title of your Pull Request. Please remove the issue code from the pull request title, as it becomes confusing to read with the Pull Request own number.
- Ensure that after clicking
New Pull Request
you select the correct branch of your fork. Ensure you are not using your fork's master branch, but instead the topic branch.
- Please note the concept behind Github Pull Requests is synchronizing the branch from your fork to the Pull Request interface. As such, please refrain from making modifications to the associated topic branch after submitted. The repo owner will review your pull request and let you know if any further modifications are necessary.
- One common request is to
git squash
some of the commits, orre-label
them. - Please avoid deleting your fork once a Pull Request is submitted. Doing so will forcefully close the Pull Request, breaking the discussion about the same Pull Request in several different new ones. This makes it much more difficult in the future for new contributors to follow-up a related contribution discussion. If needed, please contact the repo owner in the associated issue if you need help instead of deleting it.
- One common request is to
The videos listed below are available for free by Udacity and may require an account to be created.
If you are completely new to Git, we recommend the introdutory Git course by Udacity first. This course will teach you how to use git on your computer, but not to contribute code to this repository.
If you are comfortable using git locally, but are new to Github, we recommend the follow-up course by the same author, also from Udacity.
Going through both courseworks should take less than a weekend if following only the videos, or no more than 1 week and a half if doing the homework of both courses and videos.
If you are used to Git and Github, but only need to brush up or review material as required, here are the relevant lecture pointers:
- If you are on Windows and having user authentication problems, you may need to remove saved username and passwords to git from your Windows Keychain.
git log --oneline --abbrev-commit --all --graph --decorate