seguard-resources

Welcome to our project!

For a new member, we recommend you to get started with our project following the steps below so as to start contributing and learning right away.

Pre-requisites: To effectively collaborate with us, or anyone doing serious engineering, you need to learn to use shell interface and git.

Shell:
- Our preferred working environment is macOS/Ubuntu, which should save you a lot of time and headache in the long run. But you can try to use other Linux or Windows.
- If you are using Windows natively, you are recommended to use Git Bash, which is installed along with Git for Windows.
- Otherwise, I assume you have native access to bash or bash like shell. CSE390 Bash Reference
- Test yourself: What does echo $(date) > date.txt mean?
Git:
- Git Handbook by GitHub
- Test yourself: What is git rebase and why is it useful?

Now, read the overview of this project, ask me to send the newest seguard draft paper to you. Feel free to ask me if anything is missing or unclear to you :) It is more important to be clear and transparent than being opaque.

Depending on your interest, there are two tracks for you:

Track I: Analyzer framework

Send me (zhen) an email including your Github username, ask for access to seguard-framework repo. Follow the instructions in https://izgzhen.github.io/seguard-www/quickstart.html. Your goal is to successful generate the graph visualization.
Find an issue with “good-first-issue” and assign yourself. Some of the descriptions might not be clear, be sure to ask for clarification and we are here to help!

Track II: Machine learning

Start with the Python notebook classifier.ipynb that loads in DX (graph features vector), DY (multi-class labels), and DZ (binary-class labels). Play around it and try to use Random Forest or whatever model you like to predict the result using 10-fold cross-validation method. Note down the precision.
Now, let dig deeper to see where all these features come from. In folder data/graph, there are folders who names are labels. In each sub-folder, there are a lot of *.dot files. Each dot file is a graphviz format of abstract graph that represents some program behavior. The naive way to transform a graph into feature vector is one-hot encoding of names of edges and nodes. However, there might be a better way to do it. Can you first try to implement the naive approach, and then come up with a better way to do it? (e.g. encoding connectivity, or use a feature learning tool like node2vec, or use graph neural networks! Search “Graph Embedding” on Google for related information. Welcome to discuss with me) Can you compare these different featurization methods by using the label as classification groundtruth and calculate recall etc. metrics? (related API: https://github.com/izgzhen/seguard-framework/blob/master/tools/python/seguard/graph.py)

Deliverable: Please fork this repo and update the classifier.ipynb accordingly with the following code and results: (1) random forest classification recall and precision based on vector data with KFold method (K=10) (2) featurize the graphs in second dataset and use it for the above classification instead of the vectors I showed in dataset 1

Learning Resources (This repo)

We are trying to maintain a library to needed knowledge and references in this repo: https://github.com/izgzhen/seguard-resources. They might not be complete, so ask by creating an issue here would be helpful!

Troubleshooting

https://izgzhen.github.io/seguard-www/troubleshooting.html lists problem you might encounter when playing with the analyzer framework.

For problem regarding project code access and data access, please don’t hesitate to contact me (zgzhen cs washington edu). Sometimes I forgot things to set up or reply in time, just remind me through email!