StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling
Wed 25 Aug 2021 20:20 - 20:30 - SE & AI—Machine Learning for Software Engineering 1 Chair(s): Kelly Lyons, Phuong T. Nguyen
Binary type inference is a critical reverse engineering task supporting many security applications, including vulnerability analysis, binary hardening, forensics, and decompilation. It is a difficult task because source-level type information is often stripped during compilation, leaving only binaries with untyped memory and register accesses. Existing approaches rely on hand-coded type inference rules defined by domain experts, which are brittle and require nontrivial effort to maintain and update. Even though machine learning approaches have shown promise at automatically learning the inference rules, their accuracy is still low, especially for optimized binaries.
We present StateFormer, a new neural architecture that is adept at accurate and robust type inference. StateFormer follows a two-step transfer learning paradigm. In the pretraining step, the model is trained with Generative State Modeling (GSM), a novel task that we design to teach the model to statically approximate execution effects of assembly instructions in both forward and backward directions. In the finetuning step, the pretrained model learns to use its knowledge of operational semantics to infer types.
We evaluate StateFormer's performance on a corpus of 33 popular open-source software projects containing over 1.67 billion variables of different types. The programs are compiled with GCC and LLVM over 4 optimization levels O0-O3, and 3 obfuscation passes based on LLVM. Our model significantly outperforms state-of-the-art ML-based tools by 14.6% in recovering types for both function arguments and variables. Our ablation studies show that GSM improves type inference accuracy by 33%.
Wed 25 AugDisplayed time zone: Athens change
08:00 - 09:00 | SE & AI—Machine Learning for Software Engineering 1Research Papers +12h Chair(s): Michael Pradel University of Stuttgart, Ivica Crnkovic Chalmers University of Technology | ||
08:00 10mPaper | Boosting Coverage-Based Fault Localization via Graph-Based Representation Learning Research Papers Yiling Lou Purdue University, Qihao Zhu Peking University, Jinhao Dong Peking University, Xia Li Kennesaw State University, Zeyu Sun Peking University, Dan Hao Peking University, Lu Zhang Peking University, Lingming Zhang University of Illinois at Urbana-Champaign DOI | ||
08:10 10mPaper | SynGuar: Guaranteeing Generalization in Programming by Example Research Papers Bo Wang National University of Singapore, Teodora Baluta National University of Singapore, Aashish Kolluri National University of Singapore, Prateek Saxena National University of Singapore DOI | ||
08:20 10mPaper | StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling Research Papers Kexin Pei Columbia University, Jonas Guan University of Toronto, Matthew Broughton Columbia University, Zhongtian Chen Columbia University, Songchen Yao Columbia University, David Williams-King Columbia University, Vikas Ummadisetty Dublin High School, Junfeng Yang Columbia University, Baishakhi Ray Columbia University, Suman Jana Columbia University DOI | ||
08:30 30mLive Q&A | Q&A (SE & AI—Machine Learning for Software Engineering 1) Research Papers |
20:00 - 21:00 | SE & AI—Machine Learning for Software Engineering 1Research Papers Chair(s): Kelly Lyons University of Toronto, Phuong T. Nguyen University of L’Aquila | ||
20:00 10mPaper | Boosting Coverage-Based Fault Localization via Graph-Based Representation Learning Research Papers Yiling Lou Purdue University, Qihao Zhu Peking University, Jinhao Dong Peking University, Xia Li Kennesaw State University, Zeyu Sun Peking University, Dan Hao Peking University, Lu Zhang Peking University, Lingming Zhang University of Illinois at Urbana-Champaign DOI | ||
20:10 10mPaper | SynGuar: Guaranteeing Generalization in Programming by Example Research Papers Bo Wang National University of Singapore, Teodora Baluta National University of Singapore, Aashish Kolluri National University of Singapore, Prateek Saxena National University of Singapore DOI | ||
20:20 10mPaper | StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling Research Papers Kexin Pei Columbia University, Jonas Guan University of Toronto, Matthew Broughton Columbia University, Zhongtian Chen Columbia University, Songchen Yao Columbia University, David Williams-King Columbia University, Vikas Ummadisetty Dublin High School, Junfeng Yang Columbia University, Baishakhi Ray Columbia University, Suman Jana Columbia University DOI | ||
20:30 30mLive Q&A | Q&A (SE & AI—Machine Learning for Software Engineering 1) Research Papers |