This repository contains the dataset and code for our research on concept drift in Android malware detection. LAMDA is designed to help researchers analyze the evolving nature of Android malware by capturing temporal variations and distribution shifts over time.
Note: The dataset (both parquet and npz) is already available on Zenodo. You can just download and start working with it directly if you don’t wish to add new samples. The dataset also includes a train–test split.
AndroZoo-extractor.py
:
vt_detection
:
vt_detection == 0
vt_detection >= 4
1 <= vt_detection <= 3
├── malware/
│ ├── 2013/hashes.txt
│ ├── 2014/hashes.txt
│ └── ...
├── benign/
│ ├── 2013/hashes.txt
│ ├── 2014/hashes.txt
│ └── ...
APKDownloader.sh
to download samples as per need:
bash APKDownloader.sh /benign/2018/hashes.txt 2018 ben YOUR_ANDROZOO_API_KEY
bash APKDownloader.sh /malware/2018/hashes.txt 2018 mal YOUR_ANDROZOO_API_KEY
python ./dataset_preparation/drebin-feature-extractor/extractor.py \
--input_dir ./androzoo_malware/2013/ \
--result_dir ./output/malware/2013/
python virustotal-downloader.py all_year_malware_hashes.txt
python jsoncompact.py <directory_with_json_files>
avclass -d ./all_year_vt_json/ -hash sha256 -o labels.txt
After downloading and decompiling all the APKs, the creation of LAMDA is the following:
First, you need to split all the .data files into train and test using the following code. Make sure you have all the .data files into yearly folders as followings and you have provided the correct metadata.csv.gz file.
├── malware/
│ ├── 2013/hashes.data ...
│ ├── 2014/hashes.data ...
│ └── ...
├── benign/
│ ├── 2013/hashes.data ...
│ ├── 2014/hashes.data ...
│ └── ...
python LAMDA_get_train_test.py
After the execution is done, you will get the train and test splits saved into npz.
For vectorization and final dataset creation, you need to use the following code. Make sure that you have provided the correct directory ( where the train test split npz files you get from the previous run) to the following code.
python vectorization_npz_creation.py
After vectorization, you will be able to get final LAMDA dataset which is saved into npz files. All the features are saved in YYYY_X_train.npz and metadata containing binary label, family label, number of virus total flagged, year month and hash are saved in YYYY_meta_train.npz (YYYY = four digit of the year when the APK was added to AndroZoo repository)
Next, you might need to split yearly npz files into monthly npz file, if you want it. For LAMDA experiments and concept drift detection, you will need to split them into yearly form.
For LAMDA experiments, go through the following instructions to recreate the results
python download_dataset_and_convert_to_npz.py
Second, concept drift analysis with supervised learning Use the following code to run the experiments with supervised learning on LAMDA under AnoShift-style splits, make sure you have provided the correct path of downloaded LAMDA
python anoshift_experiment_models_separate.py
To run all models in one command use the following code script,
./4_1_anoshift_script_LAMDA.sh
python plot_near_far.py
For the histogram, t-sne and Jeffreys divergence heatmap plot which is provided in the paper, use the followings:
python malware_family_historgram.py
python 4_1_visual_analysis_jeffreys_divergence.py
python 4_1_visual_analysis_t_sne.py
python anoshift_experiment_api_graph.py
4_3_lambda_stability_otdd_analysis.ipynb
4_3_apigraph_stability_otdd_analysis.ipynb
Furthermore we get 10 common families across all years (2013-2024) and we calculate the feature stability for each individual family from the year 2013 to year 2024. We add the cade evaluation result on test data (2014-2024) in terms of Drift and Non-drift samples. Details are available in the paper Section 4.4.
4_4_stability_analysis_common_families.ipynb
4_4_cade_evaluation_common_families.ipynb
4_5_shap_explanation_monthly_lamda.py
run this file to generate top 100 and top 1000 SHAP indices from lamda dataset and store in file top_shap_indices_100_lamda.txt and top_shap_indices_1000_lamda.txt
4_5_shap_explanation_monthly_apigraph.py
run this file to generate top 100 and top 1000 SHAP indices from apigraph dataset and store in file top_shap_indices_100_apigraph.txt and top_shap_indices_1000_apigraph.txt
4_5_shap_explanation_graphs.py
run this file to generate graphs shap_explanation_drift_monthly_top_100_features_apigraph.png, shap_explanation_drift_monthly_top_1000_features_apigraph.png, shap_explanation_drift_monthly_top_100_features_lamda.png and shap_explanation_drift_monthly_top_1000_features_lamda.png