Viability of Automating Annotation through MediaPipe’s Effectiveness on Pose Points Model Accuracy
DOI:
https://doi.org/10.47611/jsrhs.v14i1.8863Keywords:
Automated annotation, MediaPipe, manual annotation, pose estimation, keypoint variance, machine learning datasets, annotation efficiencyAbstract
This research investigates the effectiveness of automated annotation using MediaPipe for human motion recognition tasks, comparing its performance against manually annotated data from the MP2 dataset. Through the evaluation of three machine learning models—Generative Adversarial Network (GAN), Dense model, and Transformer model—applied to both datasets, we assess the impact of dataset quality and annotation methods on model performance. The findings indicate that while models trained on MediaPipe-annotated data generally outperformed those trained on MP2, the overall accuracy remained low across all models, highlighting challenges in generalization. The study identifies the need for high-quality automated annotations that approach the granularity of manual annotations to improve performance. Moreover, it suggests that environmental factors such as lighting, background, and camera angles, which can affect joint detection accuracy, contribute to performance inconsistencies. The research also emphasizes the importance of data preprocessing, augmentation, and the potential for combining multi-modal data to enhance annotation precision. Ultimately, this study demonstrates that automated annotation offers scalability for large-scale projects but requires refinement in handling complex, dynamic environments to fully realize its potential in machine learning applications.
Downloads
References or Bibliography
Fuchs, S., Schnellbach, J., Schmidt, L., & Wittges, H. (2024). Data Annotation for Support Ticket Data: A literature review. Proceedings of the Annual Hawaii International Conference on System Sciences. https://doi.org/10.24251/hicss.2023.196
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jordan, I. D., Sokół, P. A., & Park, I. M. (2021). Gated recurrent units viewed through the lens of continuous time dynamical systems. Frontiers in Computational Neuroscience, 15. https://doi.org/10.3389/fncom.2021.678158
K, A., P, P., & Paulose, J. (2021). Human body pose estimation and applications. 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), 1–6. https://doi.org/10.1109/i-pact52855.2021.9696513
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G., Lee, J., Chang, W.-T., Hua, W., Georg, M., & Grundmann, M. (2019). MediaPipe: A Framework for Building Perception Pipelines (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1906.08172
Price, E., & Ahmad, A. (2024). Accelerated video annotation driven by deep detector and tracker. Lecture Notes in Networks and Systems, 141–153. https://doi.org/10.1007/978-3-031-44981-9_12
Quiñonez, Y., Lizarraga, C., & Aguayo, R. (2022). Machine Learning Solutions with MediaPipe. 2022 11th International Conference On Software Process Improvement (CIMPS), 212–215. https://doi.org/10.1109/cimps57786.2022.10035706
Radeta, M., Freitas, R., Rodrigues, C., Zuniga, A., Nguyen, N. T., Flores, H., & Nurmi, P. (2024). Man and the machine: Effects of ai-assisted human labeling on interactive annotation of real-time video streams. ACM Transactions on Interactive Intelligent Systems, 14(2), 1–22. https://doi.org/10.1145/3649457
Sherstinsky, A. (2020). Fundamentals of Recurrent Neural Network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. https://doi.org/10.1016/j.physd.2019.132306
Ward, T. M., Fer, D. M., Ban, Y., Rosman, G., Meireles, O. R., & Hashimoto, D. A. (2021). Challenges in surgical video annotation. Computer Assisted Surgery, 26(1), 58–68. https://doi.org/10.1080/24699322.2021.1937320
Wood, D., Lublinsky, B., Roytman, A., Singh, S., Adam, C., Adebayo, A., An, S., Chang, Y. C., Dang, X.-H., Desai, N., Dolfi, M., Emami-Gohari, H., Eres, R., Goto, T., Joshi, D., Koyfman, Y., Nassar, M., Patel, H., Selvam, P., … Daijavad, S. (2024). Data-Prep-Kit: Getting your data ready for LLM application development. arXiv. https://doi.org/10.48550/ARXIV.2409.18164
Zhang, S., Jafari, O., & Nagarkar, P. (2021). A survey on machine learning techniques for auto labeling of video, audio, and text data (No. arXiv:2109.03784). https://doi.org/10.48550/arXiv.2109.03784
Zou, Y., Zhang, S., Chen, G., Tian, Y., Keutzer, K., & Moura, J. M. F. (2021). Annotation-efficient untrimmed video action recognition. Proceedings of the 29th ACM International Conference on Multimedia, 487–495. https://doi.org/10.1145/3474085.3475197
Published
How to Cite
Issue
Section
Copyright (c) 2025 Dhruv Jena, Daksh Jain; Anthony Mauro

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.


