SeGuard Public Resources
Welcome to our project!
For a new member, we recommend you to get started with our project following the steps below so as to start contributing and learning right away.
Pre-requisites: To effectively collaborate with us, or anyone doing serious engineering, you need to learn to use shell interface and git.
echo $(date) > date.txt
mean?git rebase
and why is it useful?Now, read the overview of this project, ask me to send the newest seguard draft paper to you. Feel free to ask me if anything is missing or unclear to you :) It is more important to be clear and transparent than being opaque.
Depending on your interest, there are two tracks for you:
Track I: Analyzer framework
Track II: Machine learning
classifier.ipynb
that loads in DX
(graph features vector), DY
(multi-class labels), and DZ
(binary-class
labels). Play around it and try to use Random Forest or whatever model
you like to predict the result using 10-fold cross-validation method.
Note down the precision.data/graph
, there are folders who names are labels.
In each sub-folder, there are a lot of *.dot
files. Each dot file is a
graphviz format of abstract graph that represents some program behavior.
The naive way to transform a graph into feature vector is one-hot encoding
of names of edges and nodes. However, there might be a better way to do it.
Can you first try to implement the naive approach, and then come up with
a better way to do it? (e.g. encoding connectivity, or use a feature learning
tool like node2vec, or use graph neural networks! Search “Graph Embedding” on
Google for related information. Welcome to discuss with me)
Can you compare these different featurization methods
by using the label as classification groundtruth and calculate recall etc.
metrics?
(related API: https://github.com/izgzhen/seguard-framework/blob/master/tools/python/seguard/graph.py)Deliverable: Please fork this repo and update the classifier.ipynb
accordingly with the following code and results:
(1) random forest classification recall and precision based on vector data with KFold method (K=10)
(2) featurize the graphs in second dataset and use it for the above classification instead of the
vectors I showed in dataset 1
We are trying to maintain a library to needed knowledge and references in this repo: https://github.com/izgzhen/seguard-resources. They might not be complete, so ask by creating an issue here would be helpful!
https://izgzhen.github.io/seguard-www/troubleshooting.html lists problem you might encounter when playing with the analyzer framework.
For problem regarding project code access and data access, please don’t hesitate to contact me (zgzhen cs washington edu). Sometimes I forgot things to set up or reply in time, just remind me through email!