Skip to main content
Unlocking the Potential of Data Reuse for Computer Security Researchers 

In the age of big data, researchers are constantly seeking ways to accelerate discovery, unlock cross-disciplinary breakthroughs and conserve scarce resources. One key strategy is the reuse of datasets, which involves building upon collected data that can take months or even years to collect and develop. 

To understand the importance of data reuse, consider an example from the golden age of hand-drawn animation. At Walt Disney Studios, animators reused sheets of transparent celluloid (“cels”) across multiple frames to reduce production time and increase consistency. This approach allowed them to create complex animations with limited resources. 

In a similar way, scientists today can use previously developed datasets to build upon existing work and explore new research questions, while also ensuring the reproducibility of results. 

Within the Computer Security community, however, the patterns and motivations behind data reuse remain largely understood. Anna Crowder, a Ph.D. student in Computer and Information Science Engineering (CISE) and researcher within the Florida Institute for Cybersecurity Research (FICS), recently led a study addressing this knowledge gap by investigating the state of dataset generation and reuse in Computer Security and comparing these findings with the Measurement community. 

“Datasets are one of the research community’s most valuable assets, the creation of which takes months and sometimes even years of effort,” Crowder said. “We need to ensure these assets are being used to their fullest by reusing them to support additional research insights.” 

This work was recently presented at the 2025 IEEE Symposium on Security and Privacy, one of the most prestigious academic venues for computer security research. 

The study analyzed 948 dataset papers over the past five years of Computer Security and Measurement research from top-tier academic conferences in each area, each of which created a standalone data asset that exists as an artifact beyond what appears in the paper. The results show that dataset reuse has a way to go in order to be a common practice. 

Over half of the papers in Measurement and almost 60% of Computer Security papers did not provide statements about the availability of their data. Among the ones that did, most had either fully or partially accessible data, but documentation about the datasets is 

inconsistent and not always understandable, even when the research paper associated with the dataset is available. 

The study also identified the drivers and barriers to data reuse. Positive drivers include reduced effort, the ability to benchmark and validate, and the potential for novel angles on existing observations. 

However, missing licensing, inconsistent or absent metadata and technical friction can limit the reuse of data. Moreover, when compared to highly cited literature on best practices in the Machine Learning community for developing datasets, the study found that these practices do not directly translate to the Security community. 

“This work demonstrates that as a community, we need to provide better solutions for ensuring that valuable datasets can be reused by other researchers. To do this, we’ve developed new templates for documentation and classifying datasets, designed for the needs of Computer Security researchers,” said FICS Director Kevin Butler, Ph.D., professor of Computer and Information Science and Engineering and lead principal investigator on the study. 

To overcome these barriers, researchers can take several steps. 

First, they can ensure their data is properly licensed and the terms of use are clear. Secondly, they can provide standardized metadata, such as descriptions of the data and its collection methods, to make it easier for others to understand and use the data. Finally, they can provide easy-to-navigate documentation, such as data dictionaries and user guides, to facilitate the reuse of their data. 

The findings of this study have significant implications for researchers, institutions and funding agencies. 

By embracing data reuse, scientists can accelerate discovery, unlock cross-disciplinary breakthroughs and conserve scarce resources. Moreover, the adoption of data reuse best practices can foster a culture of collaboration and innovation, ultimately driving progress in various fields of research. 

“As the volume and complexity of research data continues to grow, it is essential that we prioritize data reuse, embracing the opportunities and challenges that it presents. By doing so, we can unlock the full potential of data reuse, driving innovation and progress across disciplines,” said Crowder. 

This study was supported, in part, by PRISM, the National Science Foundation Center for Privacy and Security for Marginalized and Vulnerable Populations led by UF.