As part of our Understanding Cybersecurity Series (UCS) program, we create and release public cybersecurity datasets to help you to analyze, test and evaluate your models, solutions and tools.
4. DNS over HTTPS ( BCCC-CIRA-CIC-DoHBrw-2020 )
The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset was created to address the imbalance in the 'CIRA-CIC-DoBre-2020' dataset. Unlike the 'CIRA-CIC-DoHBrw-2020' dataset, which is skewed with about 90% malicious and only 10% benign Domain over HTTPS (DoH) network traffic, the 'BCCC-CIRA-CIC-DoHBrw-2020' dataset offers a more balanced composition. It includes equal numbers of malicious and benign DoH network traffic instances, with 249,836 instances in each category. This balance was achieved using the Synthetic Minority Over-sampling Technique (SMOTE). The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset comprises three CSV files: one for malicious DoH traffic, one for benign DoH traffic, and a third that combines both types.
The full research paper outlining the details of the dataset and its underlying principles:
"An Evolutionary Algorithm for Adversarial SQL Injection Attack Generation", Maryam Issakhani, Mufeng Huang, “Unveiling DoH Tunnel: Toward Generating a Balanced DoH EncryptedTraffic Dataset and Profiling malicious Behaviour using InherentlyInterpretable Machine Learning“, Sepideh Niktabe, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Peer-to-Peer Networking and Applications, Vol. 17, 2023
Download Dataset:
Will be available soon ...
3. Vulnerable Smart Contracts (BCCC-VulSCs-2023)
The BCCC-VulSCs-2023 dataset is a substantial collection for Solidity Smart Contracts (SCs) analysis, comprising 36,670 samples, each enriched with 70 feature columns. These features include the raw source code of the smart contract, a hashed version of the source code for secure referencing, and a binary label that indicates a contract as secure (0) or vulnerable (1). The dataset's extensive size and comprehensive features make it a valuable resource for machine-learning models to predict contract behavior, identify patterns, or classify contracts based on security and functionality criteria.
The full research paper outlining the details of the dataset and its underlying principles:
“Unveiling Vulnerable Smart Contracts: Toward Profiling Vulnerable Smart Contracts using Genetic Algorithm and Generating Benchmark Dataset”, Sepideh Hajihosseinkhani, Arash Habibi Lashkari, Ali Mizani, Blockchain: Research and Applications, Vol. 4, 2023
Download Dataset:
Will be available soon ...
2. SQL Injection Attack (BCCC-SFU-SQLInj-2023)
This dataset consists of a collection of 11,012 evasive or sophisticated malicious SQL queries. These queries are generated using a genetic algorithm applied to the Kaggle malicious SQL dataset. The goal of the genetic algorithm is to enhance the evasiveness and sophistication of the original malicious queries.
The full research paper outlining the details of the dataset and its underlying principles:
"An Evolutionary Algorithm for Adversarial SQL Injection Attack Generation", Maryam Issakhani, Mufeng Huang, Mohammad A. Tayebi, Arash Habibi Lashkari, IEEE Intelligence and Security Informatics (ISI2023), NC, USA
Download Dataset:
1. Source Code Authorship Attribution (YU-SCAA-2022)
Source Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications such as ghostwriting detection, copyright dispute settlements, catching authors of malicious applications using source code, and other code analysis applications. This dataset was created by extracting ’code’ data from the GCJ, and GitHub datasets, including examples of attacks and adversarial examples, were created using Source Code imitator. The dataset in a total of 1,632 code files from 204 authors.
The full research paper outlining the details of the dataset and its underlying principles:
”AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework”, Abhishek Chopra , Nikhill Vombatkere , Arash Habibi Lashkari, The 12th International Conference on Communication and Network Security (ICCNS), 2022, China
Download Dataset: