Skip to main content Skip to local navigation

Source Code Authorship Attribution (YU-SCAA-2022)

Source Code Authorship Attribution (SCAA) is a technique used to identify the actual author of source code within a corpus. Although it poses a privacy threat to open-source programmers, it is significantly helpful in developing forensic-based applications, such as ghostwriting detection, copyright dispute settlements, identifying authors of malicious applications using source code, and other code analysis applications. This dataset was created by extracting ’code’ data from the GCJ and GitHub datasets, including examples of attacks and adversarial examples generated using the Source Code Imitator. The dataset has a total of 1,632 code files from 204 authors.

The full research paper outlining the details of the dataset and its underlying principles:

”AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework”, Abhishek Chopra, Nikhill Vombatkere, Arash Habibi Lashkari, The 12th International Conference on Communication and Network Security (ICCNS), 2022, China

Download Dataset: