LAMDA: A Longitudinal Android Malware Dataset for Analyzing Concept Drift

1Department of Computer Science, University of Texas at El Paso 2School of Informatics, University of Edinburgh

Abstract

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection.

To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA’s utility16 by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at https://iqsec-lab.github.io/LAMDA/.

Dataset Preparation

  • Dataset Construction: Over 1 million APKs (2013–2025, excluding 2015) were collected from the AndroZoo repository. A 20% overhead was included to account for download and decompilation failures. APKs were organized into year-based directories: [year]/malware/ and [year]/benign/.
  • Label Assignment: Binary labels were assigned using VirusTotal (VT) results:
    • Benign: vt_detection = 0
    • Malware: vt_detection ≥ 4
    • Uncertain: vt_detection ∈ [1, 3] (discarded)
    50,000 malware and 50,000 benign samples were collected per year (except for years with insufficient data), preserving month-wise distributions to reduce sampling bias.
  • Static Feature Extraction (Drebin-inspired): Each APK was decompiled using apktool to extract static features:
    • From AndroidManifest.xml: permissions, components (activities, services, receivers), hardware features, intent filters
    • From smali code: restricted/suspicious API calls, hardcoded URLs/IPs
  • Family Labeling: AVClass2 was used to standardize malware family labels using VirusTotal reports. Labels were linked to APKs using SHA256 hashes to support multi-class and temporal malware analysis.
  • Vectorization & Preprocessing: Extracted features were vectorized into high-dimensional binary vectors using a bag-of-tokens approach. A global vocabulary (~9.69M tokens) was constructed. Dimensionality was reduced using VarianceThreshold (threshold = 0.001), resulting in 4,561 final features.
  • Data Splitting: Each year’s data was split using stratified sampling:
    • Training: 80%
    • Testing: 20%
    Class balance was maintained within each split.
  • Storage & Format: Final datasets were saved in both .npz (sparse matrix) and .parquet (tabular) formats. Each year’s folder includes:
    • X_train.parquet
    • X_test.parquet
    Metadata columns include: hash, label, family, vt_count, year_month, followed by binary features.
  • Scalability Support: Released global vocabulary, selected features, and preprocessing objects (e.g., VarianceThreshold) to enable integration with ML pipelines, including Hugging Face.

Dataset Statistics

Year-wise distribution of total, malware, and benign samples
Year Total Samples Malware Samples Benign Samples
2013 86,431 44,383 42,048
2014 101,183 45,756 55,427
2016 109,193 45,134 64,059
2017 99,144 21,359 77,785
2018 104,292 39,350 64,942
2019 91,050 41,585 49,465
2020 102,073 46,355 55,718
2021 81,155 35,627 45,528
2022 86,416 41,648 44,768
2023 54,354 7,892 46,462
2024 48,427 794 47,633
2025 44,663 23 44,640
Total 1,008,381 369,906 638,475

Year-wise breakdown of malware family distributions in LAMDA
Year New Existing Valid Families # of Singletons # of Unknown
2013 213 0 213 1,550 24
2014 91 140 231 2,482 345
2016 179 196 375 5,861 177
2017 88 119 207 9,063 1,108
2018 153 220 373 20,579 1,242
2019 259 376 635 18,916 22
2020 141 447 588 30,644 25
2021 43 252 295 30,020 23
2022 161 490 651 24,927 4
2023 37 187 224 5,922 15
2024 14 50 64 626 0
2025 1 7 8 14 0
Total 1,380 150,604 2,985

Experiment Results

t-SNE projection of LAMDA
Figure: t-SNE projection of LAMDA dataset showing feature space evolution.
Jeffreys divergence heatmaps across years for LAMDA
Figure: Jeffreys divergence heatmaps across years for LAMDA, illustrating feature-wise distributional drift.
Performance of models across IID, NEAR, and FAR splits for LAMDA
Figure: Performance of models across IID, NEAR, and FAR splits for LAMDA (Baseline variant).
Stability and distribution analysis on
malware families
Figure: Stability and distribution analysis on malware families.
Monthly explanation drift on LAMDA, measured by Jaccard and Kendall distances between top-1000
SHAP features
Figure: Monthly explanation drift on LAMDA, measured by Jaccard and Kendall distances between top-1000 SHAP features.

How to Use the LAMDA Dataset

Follow the steps below to install, load, explore the data in your projects.

1. Install the datasets library

pip install datasets
pip install huggingface_hub[hf_xet] #required for XET downloading support

2. Load the dataset

from datasets import load_dataset

dataset = load_dataset('IQSeC-Lab/LAMDA', 'Baseline')
# Check available splits
dataset.keys()

3. Inspect splits & features

# Peek a sample record
print(dataset['train'][0])

4. Usage examples

Convert to pandas DataFrame:

import pandas as pd

df = pd.DataFrame(dataset['train'][:100])
print(df.head())

Key Contributions

  • We present LAMDA, a large-scale Android malware benchmark consisting of over 1 million APKs across 1,380 unique families, spanning 12 years (2013 to 2025, excluding 2015). LAMDA is built on static features derived from the Drebin feature set.
  • We conduct longitudinal evaluations using structured temporal splits, analyze per-feature distribution shifts, and perform feature stability analysis across malware families.
  • LAMDA enables explanation-driven drift analysis using SHapley Additive exPlanations (SHAP), supporting investigations into how feature importance changes as malware evolves.

Limitations

  • While LAMDA provides a strong foundation for studying concept drift in malware analysis, we acknowledge a few limitations. It relies exclusively on static features, omitting dynamic behaviors observable only at runtime. While prior work suggests a 10:90 malware-to-benign ratio, LAMDA attempts to maintain a 50:50 ratio, which may be viewed as downplaying the role of benign software distributions. However, we argue that LAMDA is constructed as a challenging dataset with greater family diversity and balanced class distribution. As such, LAMDA will facilitate the investigation of detectors that are more resilient to distributional shifts and capable of generalizing across a broad spectrum of evolving malware behaviors.

BibTeX

@article{lamda,
  title     = {{LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis}},
  author    = {Md Ahsanul Haque and Ismail Hossain and Md Mahmuduzzaman Kamol and Md Jahangir Alam and Suresh Kumar Amalapuram and Sajedul Talukder and Mohammad Saidur Rahman},
  year      = {2025},
  eprint    = {2505.18551},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url       = {https://arxiv.org/abs/2505.18551}
}