详细信息

Pomelo Fractal Tree Image Generative Data Augmentation Method Using Vision-language Models; [融合视觉语言模型的柚子分形树图像生成增强方法]

文献类型：期刊文献

英文题名：Pomelo Fractal Tree Image Generative Data Augmentation Method Using Vision-language Models; [融合视觉语言模型的柚子分形树图像生成增强方法]

作者：Lai L.; Duan J.; Yang Z.; Yuan H.

机构：[1]College of Engineering, South China Agricultural University, Guangzhou, 510642, China;[2]College of Computer Science, Jiaying University, Meizhou, 514015, China;[3]School of Mechanical Engineering, Guangdong Ocean University, Zhanjiang, 524088, China

年份：2026

卷号：57

期号：1.0

起止页码：311.0

外文期刊名：Nongye Jixie Xuebao/Transactions of the Chinese Society for Agricultural Machinery

收录：Scopus(收录号：2-s2.0-105026277276)

语种：英文

外文关键词：few-shot learning; generative data augmentation; pomelo object detection; vision-language models

外文摘要：Aiming to address the heavy reliance on large amounts of annotated data in fruit object detection tasks such as pomelo, a pomelo tree image generative data augmentation method was proposed based on vision-language models. The approach required only 3～5 unlabeled real images to generate a large-scale labeled dataset, which can be used to train object detection models and enhance their performance in zero-shot and few-shot scenarios. The method consisted of the following three main stages. Firstly, real pomelo tree components ( including fruits, leaves) were extracted from unlabeled images by using the grounded segment anything model ( Grounded SAM ). Secondly, stable diffusion was used to create diverse background images based on textual descriptions, increasing the complexity and variability of the training data. Thirdly, a modified fractal tree algorithm was employed to construct structurally diverse pomelo trees, integrating real components with synthetic backgrounds to produce a variety of tree images and corresponding automatic annotations. Experimental results on pomelo object detection by using the YOLO vlO model ( Nano version ) showed that the proposed method improved mAP50 - 95 performance by 662. 3%, 24. 9%, 13. 7%, 8. 8%, and 1. 8% when the number of real training images was 0, 8, 16, 32, and 64, respectively. With 221 real and 512 generated images, the model achieved optimal performance; precision was 76. 9%, recall was 62. 7%, mAP50 was 70. 3%, and mAP50 -95 was 38. 4%. When transferred to orange detection tasks under the same data conditions, performance gains were 212. 9%, 16. 5%, 14. 0%, 5. 2%, and 4. 1%. With 1 302 real and 512 generated images, the model achieved the best overall performance; precision was 90. 3%, recall was 87. 8%, mAP50 was 94.0%, and mAP50 -95 was 54.0%, demonstrating strong generalization ability. Compared with tree images generated with blank backgrounds, the proposed method consistently outperformed across all training set sizes, whereas the blank-background approach only excelled in the zero-shot setting. Against traditional data augmentation techniques such as mosaic, this method performed better under low-shot conditions in pomelo detection, and although not the best in orange detection for every individual case, it achieved the best overall results under the default configuration of Ultralytics YOLO. In summary, the proposed method effectively mitigated the limitations caused by insufficient labeled data in fruit object detection model training and offered promising practical value and scalability. ? 2026 Chinese Society of Agricultural Machinery. All rights reserved.

参考文献：

正在载入数据...