NuminaMath 7B: Revolutionizing Math Solving with Integrated Reasoning Advanced Generative AI Tools and Python REPL

Adi Jufriansah; Irwan Akib; Naufal Ishartono; Azmi Khusnani; Tanti Diyah Rahmawati; Edwin Ariesto Umbu Malahina; Osniman Paulina Maure; Nova Tri Romadloni

doi:10.23917/saintek.v2i1.15728

Authors

Adi Jufriansah Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada
Indonesia https://orcid.org/0000-0003-0659-8093
Irwan Akib Faculty of Teacher Training and Education, Universitas Muhammadiyah Makassar
Indonesia https://orcid.org/0000-0003-0358-688X
Naufal Ishartono Faculty of Teacher Training and Education, Universitas Muhammadiyah Surakarta
Indonesia https://orcid.org/0000-0003-4269-4736
Azmi Khusnani Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada
Indonesia https://orcid.org/0000-0001-6287-4246
Tanti Diyah Rahmawati Department of Ship Machinery Engineering Technology, Politeknik Pelayaran Surabaya
Indonesia https://orcid.org/0000-0002-7826-2153
Edwin Ariesto Umbu Malahina Department of Informatics, STIKOM Uyelindo Kupang
Indonesia https://orcid.org/0000-0002-6004-4043
Osniman Paulina Maure Faculty of Mathematics and Natural Sciences, Universitas San Pedro
Indonesia https://orcid.org/0000-0001-6088-6593
Nova Tri Romadloni Faculty of Science, Technology and Animal Husbandry, Universitas Muhammadiyah Karanganyar
Indonesia https://orcid.org/0009-0002-7702-4898

DOI:

https://doi.org/10.23917/saintek.v2i1.15728

Keywords:

NuminaMath 7B, large language model, problem solving, chain of thought, AI math olympiad

Abstract

The efficacy of NuminaMath 7B, an AI model that was created to address mathematical challenges, is assessed in this investigation. We evaluated the model's accuracy and efficiency against conventional methods through experiments that produced quantitative data. Qualitative data were collected through surveys and interviews with users to gain insight into their experiences and pinpoint areas for improvement. The survey results indicated that users found NuminaMath 7B to be pertinent, effective, and user-friendly, as evidenced by the exceptionally high average scores in user experience (95), perception of features and interface (90), and additional feedback (85). NuminaMath 7B was able to offer mathematical solutions with logical and detailed explanations as a result of the model's development through two phases of adjustments, which were conducted using the Chain of Thought (CoT) methodology and inspiration from the Tool-Integrated Reasoning Agent (ToRA) framework. Testing demonstrated that the model achieved a score of 29 out of 50 in the AI Math Olympiad competition, despite encountering difficulties in resolving more intricate problems. This study underscores the significance and urgency of AI technology, particularly in the field of mathematics, as well as the significant potential of AI models to facilitate a more comprehensive comprehension of mathematical concepts.

Downloads

Download data is not yet available.

References

[1] M. A. K. Raiaan et al., “A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges,” IEEE Access, vol. 12, pp. 26839–26874, 2024, doi: 10.1109/ACCESS.2024.3365742.

[2] S. Minaee et al., “Large Language Models: A Survey,” Feb. 2024.

[3] V. Parra, P. Sureda, A. Corica, S. Schiaffino, and D. Godoy, “Can Generative AI Solve Geometry Problems? Strengths and Weaknesses of LLMs for Geometric Reasoning in Spanish,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, no. 5, p. 65, 2024, doi: 10.9781/ijimai.2024.02.009.

[4] G. F. C. F. Almeida, J. L. Nunes, N. Engelmann, A. Wiegmann, and M. de Araújo, “Exploring the psychology of LLMs’ moral and legal reasoning,” Artif Intell, vol. 333, p. 104145, Aug. 2024, doi: 10.1016/j.artint.2024.104145.

[5] E. Mazzullo, O. Bulut, T. Wongvorachan, and B. Tan, “Learning Analytics in the Era of Large Language Models,” Analytics, vol. 2, no. 4, pp. 877–898, Nov. 2023, doi: 10.3390/analytics2040046.

[6] B. Hu, L. Zheng, J. Zhu, L. Ding, Y. Wang, and X. Gu, “Teaching Plan Generation and Evaluation With GPT-4: Unleashing the Potential of LLM in Instructional Design,” IEEE Transactions on Learning Technologies, vol. 17, pp. 1471–1485, 2024, doi: 10.1109/TLT.2024.3384765.

[7] C. Zhou, “Integration of modern technologies in higher education on the example of artificial intelligence use,” Educ Inf Technol (Dordr), vol. 28, no. 4, pp. 3893–3910, Apr. 2023, doi: 10.1007/s10639-022-11309-9.

[8] Y. Jiang and B. Li, “Exploration on the Teaching Reform Measure for Machine Learning Course System of Artificial Intelligence Specialty,” Sci Program, vol. 2021, pp. 1–9, Nov. 2021, doi: 10.1155/2021/8971588.

[9] Y. Qiu, J. Pan, and N. A. Ishak, “Effectiveness of Artificial Intelligence (AI) in Improving Pupils’ Deep Learning in Primary School Mathematics Teaching in Fujian Province,” Comput Intell Neurosci, vol. 2022, pp. 1–10, Sep. 2022, doi: 10.1155/2022/1362996.

[10] L. Jiang et al., “Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies,” Journal of International Medical Research, vol. 49, no. 3, p. 030006052110001, Mar. 2021, doi: 10.1177/03000605211000157.

[11] F. Agterberg, “Geomathematics,” 2023, pp. 512–519. doi: 10.1007/978-3-030-85040-1_12.

[12] I. Drori et al., “A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level,” Proceedings of the National Academy of Sciences, vol. 119, no. 32, Aug. 2022, doi: 10.1073/pnas.2123433119.

[13] Y. Wang, K. Chen, H. Tan, and K. Guo, “Tabi: An Efficient Multi-Level Inference System for Large Language Models,” in Proceedings of the Eighteenth European Conference on Computer Systems, New York, NY, USA: ACM, May 2023, pp. 233–248. doi: 10.1145/3552326.3587438.

[14] R. K. Kodali, Y. Prasad Upreti, and L. Boppana, “Large Language Models in AWS,” in 2024 1st International Conference on Robotics, Engineering, Science, and Technology (RESTCON), IEEE, Feb. 2024, pp. 112–117. doi: 10.1109/RESTCON60981.2024.10463557.

[15] S. Prasad, H. Gupta, and A. Ghosh, “Leveraging the Potential of Large Language Models,” Informatica, vol. 48, no. 8, May 2024, doi: 10.31449/inf.v48i8.5635.

[16] J. Ji et al., “GenRec: Large Language Model for Generative Recommendation,” 2024, pp. 494–502. doi: 10.1007/978-3-031-56063-7_42.

[17] S. Kukreja, T. Kumar, A. Purohit, A. Dasgupta, and D. Guha, “A Literature Survey on Open Source Large Language Models,” in Proceedings of the 2024 7th International Conference on Computers in Management and Business, New York, NY, USA: ACM, Jan. 2024, pp. 133–143. doi: 10.1145/3647782.3647803.

[18] E. Beeching et al., “NuminaMath 7B TIR,” Hugging Face repository, 2024, Accessed: Jul. 16, 2024. [Online]. Available: https://huggingface.co/AI-MO/NuminaMath-7B-TIR

[19] J. Chen et al., “When large language models meet personalization: perspectives of challenges and opportunities,” World Wide Web, vol. 27, no. 4, p. 42, Jul. 2024, doi: 10.1007/s11280-024-01276-1.

[20] L. T. van Binsbergen, M. Verano Merino, P. Jeanjean, T. van der Storm, B. Combemale, and O. Barais, “A principled approach to REPL interpreters,” in Proceedings of the 2020 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, New York, NY, USA: ACM, Nov. 2020, pp. 84–100. doi: 10.1145/3426428.3426917.

[21] Z. Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” Feb. 2024, [Online]. Available: http://arxiv.org/abs/2402.03300

[22] B. Masikisiki, V. Marivate, and Y. Hlophe, “Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thought Prompting,” in Proceedings of the 4th African Human Computer Interaction Conference, New York, NY, USA: ACM, Nov. 2023, pp. 44–49. doi: 10.1145/3628096.3628747.

[23] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of Thought Prompting in Large Language Models,” Oct. 2022.

[24] X. Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” Mar. 2022.

[25] M. Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them,” Oct. 2022.

[26] S. Schulhoff et al., “The Prompt Report: A Systematic Survey of Prompting Techniques,” Jun. 2024.

[27] F. Xu, Q. Sun, K. Cheng, J. Liu, Y. Qiao, and Z. Wu, “Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models,” Jun. 2024.

[28] K. M. Collins et al., “Evaluating language models for mathematics through interactions,” Proceedings of the National Academy of Sciences, vol. 121, no. 24, Jun. 2024, doi: 10.1073/pnas.2318124121.

[29] D. Sulisworo et al., “Enhancing the science teacher skills on integration of augmented reality based media and learning strategy,” 2023, p. 020045. doi: 10.1063/5.0154257.

[30] D. Sulisworo et al., “The Science Teachers’ Optimism Response to the Use of Marker-Based Augmented Reality in the Global Warming Issue,” Educ Res Int, vol. 2021, pp. 1–9, Dec. 2021, doi: 10.1155/2021/7264230.

[31] A. Kolides et al., “Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts,” Simul Model Pract Theory, vol. 126, p. 102754, Jul. 2023, doi: 10.1016/j.simpat.2023.102754.

[32] S. M. Jain, Introduction to Transformers for NLP. Berkeley, CA: Apress, 2022. doi: 10.1007/978-1-4842-8844-3.

[33] I. Lipovac and M. B. Babac, “Developing a data pipeline solution for big data processing,” International Journal of Data Mining, Modelling and Management, vol. 16, no. 1, pp. 1–22, 2024, doi: 10.1504/IJDMMM.2024.136221.

[34] S. Pais, J. Cordeiro, and M. L. Jamil, “NLP-based platform as a service: a brief review,” J Big Data, vol. 9, no. 1, p. 54, Dec. 2022, doi: 10.1186/s40537-022-00603-5.

[35] L. Chen, P. Chen, and Z. Lin, “Artificial Intelligence in Education: A Review,” IEEE Access, vol. 8, pp. 75264–75278, 2020, doi: 10.1109/ACCESS.2020.2988510.

[36] V. D. Kirova, C. S. Ku, J. R. Laracy, and T. J. Marlowe, “Software Engineering Education Must Adapt and Evolve for an LLM Environment,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, New York, NY, USA: ACM, Mar. 2024, pp. 666–672. doi: 10.1145/3626252.3630927.

[37] R. Nakamoto, B. Flanagan, T. Yamauchi, Y. Dai, K. Takami, and H. Ogata, “Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach,” Computers, vol. 12, no. 11, p. 217, Oct. 2023, doi: 10.3390/computers12110217.

[38] S. Yang, “Mathematical Analysis: How Would AI Tackle Math Olympiad Problems?,” International Journal of High School Research, vol. 3, no. 3, pp. 61–65, Jun. 2021, doi: 10.36838/v3i3.13.

[39] T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,” Nature, vol. 625, no. 7995, pp. 476–482, Jan. 2024, doi: 10.1038/s41586-023-06747-5.

[40] H. Kabiri, Y. Ghanou, H. Khalifi, and G. Casalino, “AMAdam: adaptive modifier of Adam method,” Knowl Inf Syst, vol. 66, no. 6, pp. 3427–3458, Jun. 2024, doi: 10.1007/s10115-023-02052-9.

[41] M. Bhandari, P. Parajuli, P. Chapagain, and L. Gaur, “Evaluating Performance of Adam Optimization by Proposing Energy Index,” 2022, pp. 156–168. doi: 10.1007/978-3-031-07005-1_15.

[42] F. Mehmood, S. Ahmad, and T. K. Whangbo, “An Efficient Optimization Technique for Training Deep Neural Networks,” Mathematics, vol. 11, no. 6, p. 1360, Mar. 2023, doi: 10.3390/math11061360.

[43] M. Reyad, A. M. Sarhan, and M. Arafa, “A modified Adam algorithm for deep neural network optimization,” Neural Comput Appl, vol. 35, no. 23, pp. 17095–17112, Aug. 2023, doi: 10.1007/s00521-023-08568-z.

NuminaMath 7B: Revolutionizing Math Solving with Integrated Reasoning Advanced Generative AI Tools and Python REPL

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Submitted

Accepted

Published

How to Cite

Issue

Section

License

qrcode

SidebarMenu

Journal Template

Visitors