Protocol reverse engineering has often been a manual process that is considered time-consuming, tedious and error-prone. To address this limitation, a number of solutions have recently been proposed to allow for automatic protocol reverse engineering. Unfortunately, they are either limited in extracting protocol fields due to lack of program semantics in network traces or primitive in only revealing the flat structure of protocol format. In this paper, we present a system called AutoFormat that aims at not only extracting protocol fields with high accuracy, but also revealing the inherently “non-flat”, hierarchical structures of protocol messages. AutoFormat is based on the key insight that different protocol fields in the same message are typically handled in different execution contexts (e.g., the runtime call stack). As such, by monitoring the program execution, we can collect the execution context information for every message byte (annotated with its offset in the entire message) and cluster them to derive the protocol format. We have evaluated our system with more than 30 protocol messages from seven protocols, including two text-based protocols (HTTP and SIP), three binary-based protocols (DHCP, RIP, and OSPF), one hybrid protocol (CIFS/SMB), as well as one unknown protocol used by a real-world malware. Our results show that AutoFormat can not only identify individual message fields automatically and with high accuracy (an average 93:4% match ratio compared with Wireshark), but also unveil the structure of the protocol format by revealing possible relations (e.g., sequential, parallel, and hierarchical) among the message fields.
Right now we have two versions of AutoFormat, a Valgrind based and a QEMU based. If you want to play with it, write to us.