LAMDA: A Longitudinal Android Malware Dataset for Analyzing Concept Drift

Abstract

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection.

To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA’s utility16 by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at https://iqsec-lab.github.io/LAMDA/.

Dataset Preparation

Dataset Construction: Over 1 million APKs (2013–2025, excluding 2015) were collected from the AndroZoo repository. A 20% overhead was included to account for download and decompilation failures. APKs were organized into year-based directories: [year]/malware/ and [year]/benign/.
Label Assignment: Binary labels were assigned using VirusTotal (VT) results:
- Benign: vt_detection = 0
- Malware: vt_detection ≥ 4
- Uncertain: vt_detection ∈ [1, 3] (discarded)
50,000 malware and 50,000 benign samples were collected per year (except for years with insufficient data), preserving month-wise distributions to reduce sampling bias.
Static Feature Extraction (Drebin-inspired): Each APK was decompiled using apktool to extract static features:
- From AndroidManifest.xml: permissions, components (activities, services, receivers), hardware features, intent filters
- From smali code: restricted/suspicious API calls, hardcoded URLs/IPs
Family Labeling: AVClass2 was used to standardize malware family labels using VirusTotal reports. Labels were linked to APKs using SHA256 hashes to support multi-class and temporal malware analysis.
Vectorization & Preprocessing: Extracted features were vectorized into high-dimensional binary vectors using a bag-of-tokens approach. A global vocabulary (~9.69M tokens) was constructed. Dimensionality was reduced using VarianceThreshold (threshold = 0.001), resulting in 4,561 final features.
Data Splitting: Each year’s data was split using stratified sampling:
- Training: 80%
- Testing: 20%
Class balance was maintained within each split.
Storage & Format: Final datasets were saved in both .npz (sparse matrix) and .parquet (tabular) formats. Each year’s folder includes:
- X_train.parquet
- X_test.parquet
Metadata columns include: hash, label, family, vt_count, year_month, followed by binary features.
Scalability Support: Released global vocabulary, selected features, and preprocessing objects (e.g., VarianceThreshold) to enable integration with ML pipelines, including Hugging Face.

Dataset Statistics

Year-wise distribution of total, malware, and benign samples
Year	Total Samples	Malware Samples	Benign Samples
2013	86,431	44,383	42,048
2014	101,183	45,756	55,427
2016	109,193	45,134	64,059
2017	99,144	21,359	77,785
2018	104,292	39,350	64,942
2019	91,050	41,585	49,465
2020	102,073	46,355	55,718
2021	81,155	35,627	45,528
2022	86,416	41,648	44,768
2023	54,354	7,892	46,462
2024	48,427	794	47,633
2025	44,663	23	44,640
Total	1,008,381	369,906	638,475

Year-wise breakdown of malware family distributions in LAMDA
Year	New	Existing	Valid Families	# of Singletons	# of Unknown
2013	213	0	213	1,550	24
2014	91	140	231	2,482	345
2016	179	196	375	5,861	177
2017	88	119	207	9,063	1,108
2018	153	220	373	20,579	1,242
2019	259	376	635	18,916	22
2020	141	447	588	30,644	25
2021	43	252	295	30,020	23
2022	161	490	651	24,927	4
2023	37	187	224	5,922	15
2024	14	50	64	626	0
2025	1	7	8	14	0
Total	1,380			150,604	2,985

Experiment Results

Figure: t-SNE projection of LAMDA dataset showing feature space evolution.

Figure: Jeffreys divergence heatmaps across years for LAMDA, illustrating feature-wise distributional drift.

Figure: Performance of models across IID, NEAR, and FAR splits for LAMDA (Baseline variant).

Stability and distribution analysis on
malware families — Figure: Stability and distribution analysis on malware families.

Monthly explanation drift on LAMDA, measured by Jaccard and Kendall distances between top-1000
SHAP features — Figure: Monthly explanation drift on LAMDA, measured by Jaccard and Kendall distances between top-1000 SHAP features.

How to Use the LAMDA Dataset

Follow the steps below to install, load, explore the data in your projects.

1. Install the datasets library

pip install datasets

pip install huggingface_hub[hf_xet] #required for XET downloading support

2. Load the dataset

from datasets import load_dataset

dataset = load_dataset('IQSeC-Lab/LAMDA', 'Baseline')
# Check available splits
dataset.keys()

3. Inspect splits & features

# Peek a sample record
print(dataset['train'][0])

4. Usage examples

Convert to pandas DataFrame:

import pandas as pd

df = pd.DataFrame(dataset['train'][:100])
print(df.head())

Key Contributions

We present LAMDA, a large-scale Android malware benchmark consisting of over 1 million APKs across 1,380 unique families, spanning 12 years (2013 to 2025, excluding 2015). LAMDA is built on static features derived from the Drebin feature set.
We conduct longitudinal evaluations using structured temporal splits, analyze per-feature distribution shifts, and perform feature stability analysis across malware families.
LAMDA enables explanation-driven drift analysis using SHapley Additive exPlanations (SHAP), supporting investigations into how feature importance changes as malware evolves.

Limitations

While LAMDA provides a strong foundation for studying concept drift in malware analysis, we acknowledge a few limitations. It relies exclusively on static features, omitting dynamic behaviors observable only at runtime. While prior work suggests a 10:90 malware-to-benign ratio, LAMDA attempts to maintain a 50:50 ratio, which may be viewed as downplaying the role of benign software distributions. However, we argue that LAMDA is constructed as a challenging dataset with greater family diversity and balanced class distribution. As such, LAMDA will facilitate the investigation of detectors that are more resilient to distributional shifts and capable of generalizing across a broad spectrum of evolving malware behaviors.

BibTeX

@article{lamda,
  title     = {{LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis}},
  author    = {Md Ahsanul Haque and Ismail Hossain and Md Mahmuduzzaman Kamol and Md Jahangir Alam and Suresh Kumar Amalapuram and Sajedul Talukder and Mohammad Saidur Rahman},
  year      = {2025},
  eprint    = {2505.18551},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url       = {https://arxiv.org/abs/2505.18551}
}