In the rapidly evolving world of blockchain technology, Solidity smart contracts have become a cornerstone, powering a wide range of decentralized applications (dApps) across finance, governance, and the arts. However, the irrevocable nature of deploying these contracts on blockchain networks like Ethereum presents a significant challenge: once deployed, any vulnerabilities or bugs are permanent, potentially leading to substantial financial losses. This underscores the critical need for robust testing, analysis, and repair of smart contract code before deployment.
Historically, the development and evaluation of tools for smart contract analysis have been hindered by the lack of comprehensive, real-world smart contract datasets. Many existing datasets are either outdated or too small and non-diverse to represent the complex landscape of real-world applications. This gap has limited the effectiveness of tools designed for smart contract testing, analysis, and repair, leaving them unable to prevent recent, successful attacks on deployed projects.
Enter DISL (Dataset of Solidity smart contracts), a groundbreaking solution to this pervasive challenge. Created by a team of researchers from KTH Royal Institute of Technology in Stockholm, DISL addresses the critical need for a large and diverse dataset of real-world smart contracts. With 514,506 unique Solidity files sourced from verifiable smart contracts deployed to the Ethereum mainnet, DISL stands as the largest and most recent dataset of its kind, significantly surpassing existing collections in both size and recency.
What makes DISL stand out?
- Comprehensiveness: By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL offers unparalleled breadth and depth. This includes a significant subset of 7,188 smart contracts written in Vyper, making it the largest dataset of Vyper contracts currently available.
- Diversity and Real-world Relevance: The dataset covers a broad spectrum of applications, from DeFi to art, ensuring that researchers and developers have access to a wide variety of smart contract code for analysis, tool development, and AI-based tasks.
- Quality and Integrity: DISL includes only the verified source code of smart contracts, ensuring that the dataset comprises solely real, in-use contracts. This is critical for the development of effective tools and methodologies for smart contract analysis and security.
- Accessibility and Usability: Published on Huggingface, DISL is readily accessible to researchers and practitioners, presented in a user-friendly format that facilitates easy integration into existing workflows and toolchains.
What are the use-cases of DISL?
- AI and Machine Learning: With its vast collection of deduplicated contract files, DISL is an ideal candidate for training and fine-tuning AI models, including large language models (LLMs), for smart contract analysis, synthesis, and repair tasks.
- Benchmarking and Evaluation: The dataset provides a unique resource for benchmarking the performance of smart contract analysis tools. Given its composition of real-world, verified contracts, DISL offers a new standard against which tool efficacy can be measured.
- Empirical Research: The availability of a diverse, real-world dataset like DISL opens up new avenues for empirical studies in smart contract security, allowing researchers to investigate the prevalence of vulnerabilities and assess the impact of various mitigation strategies.
Where can you find DISL?
The dataset paper is available via Arxiv: https://arxiv.org/abs/2403.16861
The dataset files are available in HuggingFace: https://huggingface.co/datasets/ASSERT-KTH/DISL
How to cite our dataset paper?
Use the following bibtex entry:
@misc{DISLDataset,
title = {{{DISL}}: {{Fueling Research}} with {{A Large Dataset}} of {{Solidity Smart Contracts}}},
shorttitle = {{{DISL}}},
author = {Morello, Gabriele and Eshghie, Mojtaba and Bobadilla, Sofia and Monperrus, Martin},
year = {2024},
month = mar,
number = {arXiv:2403.16861},
eprint = {2403.16861},
primaryclass = {cs},
publisher = {arXiv},
urldate = {2024-03-26},
abstract = {The DISL dataset features a collection of \$514,506\$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL surpasses existing datasets in size and recency.},
archiveprefix = {arxiv},
keywords = {{Computer Science - Distributed, Parallel, and Cluster Computing},Computer Science - Machine Learning,Computer Science - Software Engineering},
}