You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Repository Overview

This repository contains the preprocessed data used in the Audio Deepfake Detection task. The task was composed of 2 main sections: the definition of a backbone model to perform the detection on one dataset and the search for an approach that could guarantee high detection rates in a continual learning scenario. For the former, the dataset chosen for the training of the model was the training section of the ASVspoof2019 dataset, which includes English speech samples. The model was then evaluated using the development and evaluation sets from the same dataset, and the ASVspoof2021 dataset, which contains similar samples but with a worse quality to make the detection task harder. For the continual learning task instead, three different datasets were evaluated: ADD2022, FakeOrReal (FoR) and InTheWild (ITW).

This repository contains only the preprocessed data for the project. The code will be shared in future, as the work performed for the project has also been submitted for a scientific paper and the evaluation is pending. Once accepted, the code will be shared also in this repository and this file will be updated accordingly.

Data is generally proposed as 2D arrays, in particular of size 128 x 42, where 128 are the Mel-Frequency Cepstral Coefficients (MFCCs) and 42 are the frames considered for each instance. From the original audios, these instances are considered around saliency points: both the transitions from Silence To Voice (STV) and Voice To Silence (VTS) were considered, but the former yielded better results so it was the approach adopted. As regards the instances taken, multiple experiments were performed. The original setup included the consideration of a saliency point without further consideration regarding close frames, while latter studies focused also on the values of nearby frames around saliency points. The threshold for 42 frames per instance was experimentally set in order to obtain meaningful data, while also ensuring that not too many audios would be discarded.

Folder/File Structure

The data is divided into subfolders, indicating the respective scope of the data.

OLD_FILES: contains data described above in the original setup.
Test_data: contains the data used for the evaluation in the continual learning scenario.
Training_data: contains the updated data used in the training process of the backbone model.
VTS_Training_data: Similar to the Training_data folder, but analysing the case of Voice To Silence (VTS) saliency points.

Downloads last month: -; Downloads are not tracked for this model. How to track