M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining

Rui Lv1,*, Juncheng Mo1,2,*, Tianyi Chu2, Chen Rao1,2, Hongyi Jing1, Jiajie Teng1, Jiafu Chen1,2, Shiqi Zhang1, Liangzi Ding1, Shuo Fang1, Huaizhong Lin2, Ziqiang Dang1 ✉️, Chenguang Ma1 ✉️, Lei Zhao2 ✉️

1 Ant Group, 2 Zhejiang University (* equal contribution; ✉️ corresponding authors)

Paper Code

Demo video of M2-Miner

Demo video of M2-Miner, showcasing (1) the mining process on AndroidControl, (2) the mining process on CAGUI, and (3) GUI agent inference on CAGUI.

ABSTRACT

Graphical User Interface (GUI) agent is pivotal to advancing intelligent humancomputer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M2-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.

Overview

Given the urgent demand for high-quality GUI agent data and the laborious challenge of manually annotating interaction trajectories, we propose M2-Miner, the first automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS), as illustrated in Fig. 1. Based on vanilla MCTS, we specifically employ a collaborative multi-agent framework, which comprises an InferAgent, OrchestraAgent, and JudgeAgent, to enhance the expansion and simulation phases. Our approach reduces costs and improves both efficiency and mining success rate. Furthermore, we propose an intent recycling strategy to fully extract valid trajectories from the intent trajectory trees. Finally, to further boost the mining success rate, we introduce a model-in-the-loop training strategy to continuously evolve the capabilities of these agents.

Infrastructure Framework

We build a customized infrastructure framework that supports mobile agent data mining and modelin-the-loop training. As illustrated in Fig. 2, the framework adopts a layered architecture, consisting of the data layer, engine layer, algorithm layer, agent layer, execution layer, and environment layer. Our layered framework unifies GUI agent execution, intent-trajectory data mining, and model training into a whole, thereby forming a customized model-in-the-loop framework that can automatically perform end-to-end data mining and training.

Model-In-The-Loop Training Strategy

During the mining process, we observe that insufficient capabilities of the InferAgent and JudgeAgent led to numerous invalid explorations and inaccurate termination judgments, which resulted in a low mining success rate. Consequently, we design a progressive model-in-the-loop training strategy that iteratively improves the agents’ performance. First, in the warm-up stage, we first train the InferAgent and JudgeAgent using public datasets to equip the multi-agent framework with basic capability for trajectory mining. Subsequently, leveraging the models obtained in the warm-up stage, we conduct continuous training of the InferAgent and JudgeAgent through three stages. At each stage, we first generate intents, then use the mining framework to mine trajectory data, and subsequently retrain the model with all intent trajectory data mined prior to this stage. Fig. 3 illustrates the intent generation process at each stage. The proposed progressive model-in-the-loop training strategy iteratively improves both the mining success rate and the quality of the generated trajectories.

Experiments

In comparison to AITZ, which is of a similar scale, our method reduces construction costs by $6,010. Comparing on per-image cost, our method is 18 times more cost-effective than all other datasets.

Experimental results — Table 1. Statistics of different datasets.

BibTeX

@article{lv2026m2miner,
      title        = {M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining},
      author       = {Lv, Rui and Mo, Juncheng and Chu, Tianyi and Rao, Chen and Jing, Hongyi and Teng, Jiajie and Chen, Jiafu and Zhang, Shiqi and Ding, Liangzi and Fang, Shuo and Lin, Huaizhong and Dang, Ziqiang and Ma, Chenguang and Zhao, Lei},
      journal      = {arXiv preprint arXiv:2602.05429},
      year         = {2026},
      doi          = {10.48550/arXiv.2602.05429},
      url          = {https://doi.org/10.48550/arXiv.2602.05429},
      note         = {Accepted by ICLR 2026}
    }