Explainable Machine Learning for Software Defect Prediction in Large-Scale Code Repositories Using Code Embeddings and Repository-Level Knowledge Graphs

KARTIK  KUMARAN

doi:10.64137/31079458/IJCSEI-V2I2P102

Authors

KARTIK KUMARAN Assistant Professor, Department of AI & ML, Independent Researcher, Coimbatore, Tamil Nadu. Author

DOI:

https://doi.org/10.64137/31079458/IJCSEI-V2I2P102

Keywords:

Software Defect Prediction, Explainable AI, Code Embeddings, Repository Knowledge Graph, Codebert, Graphcodebert, Just-in-Time Prediction, Devops Governance

Abstract

Software defect prediction has moved from static metric classification toward learning-based quality intelligence that must operate across millions of functions, commits, tests, services, and developer interactions. This paper proposes an explainable machine learning framework for large-scale defect prediction that fuses contextual code embeddings, repository-level knowledge graphs, and governance-aware explanation mechanisms into a single lifecycle architecture. The core premise is that source code tokens alone cannot represent the socio-technical context in which defects emerge: dependency changes, test failures, build instability, ownership patterns, issue histories, deployment topology, and review signals also shape risk. Accordingly, the proposed framework encodes functions and patches with transformer-based code representations, constructs a typed repository graph from version-control and DevOps artifacts, and combines these representations through a graph-aware fusion layer that predicts file-, function-, and commit-level defect risk. The design builds on advances in pre-trained programming-language models [1]. It also aligns with lifecycle governance work that connects defect prediction, automated testing, and architecture-centered delivery controls [2]. For regulated and high-assurance domains, the model must produce explanations that are traceable enough for review triage, audit, and risk acceptance rather than merely producing probability scores [3]. The explainability layer, therefore, integrates local feature attribution, graph rationales, counterfactual repository edits, and confidence calibration so that predictions can be inspected by developers, testers, release managers, and compliance reviewers. A reproducible experimental protocol is specified for temporal validation, cross-project evaluation, ablation analysis, effort-aware ranking, and explanation faithfulness. The paper contributes a rigorous blueprint for defect prediction systems that are semantically aware, context-rich, and operationally trustworthy while avoiding ungrounded claims about empirical performance before dataset-specific evaluation is executed.

References

[1] Z. Feng et al., “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” Empirical Methods in Natural Language Processing, Feb. 2020, doi: https://doi.org/10.18653/v1/2020.findings-emnlp.139.

[2] S. D. Sivva, R. R. Thalakanti, S. S. G. Bandari, and S. D. R. Yettapu, “AI-Driven Decision Intelligence for Agile Software Lifecycle Governance: An Architecture-Centered Framework Integrating Machine Learning Defect Prediction and Automated Testing,” International Journal of Emerging Trends in Computer Science and Information Technology, vol. 4, pp. 167–172, 2023, doi: https://doi.org/10.63282/3050-9246.ijetcsit-v4i4p118.

[3] S. K. Gunda, “Predictive Validation of Banking APIs and Transaction Workflows Using Machine Learning-Based Defect Detection Model,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, pp. 284–292, Mar. 2025, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v6i1p133.

[4] S. M. Lundberg and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” Neural Information Processing Systems, 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html

[5] S. Yalamati, “Energy-Efficient Task Offloading in Multi-Tenant Edge Clouds,” 2026 International Conference on Electronic Systems and Intelligent Computing (ICESIC), pp. 379–384, Mar. 2026, doi: https://doi.org/10.1109/icesic67389.2026.11496473.

[6] “Enhancing Reliability in Java Enterprise Systems through Comparative Analysis of Automated Testing Frameworks,” International Journal of Emerging Trends in Computer Science and Information Technology, vol. 4, 2023, doi: https://doi.org/10.63282/3050-9246.ijetcsit-v4i2p115.

[7] D. Guo et al., “GraphCodeBERT: Pre-training Code Representations with Data Flow,” arXiv.org, Sep. 13, 2021. https://arxiv.org/abs/2009.08366

[8] S. S. G. Bandari, S. D. Sivva, and R. R. Thalakanti, “Regulatory Grade Fraud Detection using Explainable Artificial Intelligence with Auditable Decision Pathways and Empirical Validation on Banking Data,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, pp. 139–147, 2024, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v5i3p115.

[9] A. K. K. V. Alluri, “A Systematic Study of Machine Learning Frameworks Enabling Scalable Secure and Explainable Artificial Intelligence in Salesforce CRM Platforms,” 2026 International Conference on Electronic Systems and Intelligent Computing (ICESIC), pp. 396–401, Mar. 2026, doi: https://doi.org/10.1109/icesic67389.2026.11496486.

[10] J. Śliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?,” ACM SIGSOFT Software Engineering Notes, vol. 30, no. 4, pp. 1–5, Jul. 2005, doi: https://doi.org/10.1145/1082983.1083147.

[11] S. K. Gunda, “The Future of Software Development and the Expanding Role of ML Models,” International Journal of Emerging Research in Engineering and Technology, vol. 4, 2023, doi: https://doi.org/10.63282/3050-922x.ijeret-v4i2p113.

[12] “Design and Evaluation of Secure Microservices Architecture for HIPAA-Compliant Prescription Processing on AWS and OpenShift,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 2, Jun. 2024, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v5i2p116.

[13] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks,” Advances in Neural Information Processing Systems, vol. 32, 2019, Available: https://papers.nips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html

[14] R. R. Thalakanti and S. S. G. Bandari, “Intelligent Continuous Integration and Delivery for Banking Systems using Machine Learning Driven Risk Detection with Real World Deployment Evaluation,” International Journal of AI, BigData, Computational and Management Studies, vol. 5, no. 4, pp. 168–175, Dec. 2024, doi: https://doi.org/10.63282/3050-9416.ijaibdcms-v5i4p118.

[15] T. Raikar, F. Ezeugboaja, S. Bussa, H. Upadhyay, and P. Kalaru, “Ethics of AI-based supply chain optimization: a better balance between efficiency and fairness,” Future Technology, vol. 5, no. 2, pp. 281–296, May 2026, doi: https://doi.org/10.55670/fpll.futech.5.2.26.

[16] T. Hoang, Hoa Khanh Dam, Y. Kamei, D. Lo, and Naoyasu Ubayashi, “DeepJIT: An End-to-End Deep Learning Framework for Just-in-Time Defect Prediction,” Mining Software Repositories, May 2019, doi: https://doi.org/10.1109/msr.2019.00016.

[17] N. Mutyam, “Graph-Based Modeling of Service Dependencies for Predicting Failure Propagation in Distributed Systems,” International Journal of Multidisciplinary Evolutionary Research, vol. 5, no. 1, pp. 113–116, 2024, doi: https://doi.org/10.54660/ijmer.2024.5.1.113-116.

[18] S. K. Gunda, “Comparative Analysis of Machine Learning Models for Software Defect Prediction,” pp. 1–6, Oct. 2024, doi: https://doi.org/10.1109/icpects62210.2024.10780167.

[19] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pp. 1135–1144, Aug. 2016, doi: https://doi.org/10.1145/2939672.2939778.

[20] S. Yalamati, “Sparse Matrix Factorization for Scalable Machine Learning in Cloud Environments,” 2025 International Conference on NexGen Networks and Cybernetics (IC2NC), pp. 333–338, Dec. 2025, doi: https://doi.org/10.1109/ic2nc67409.2025.11376338.

[21] S. D. Sivva, “An End-to-End AI-Based Systems Engineering Paradigm for Lifecycle Governance, Predictive Quality Assurance, Automation Economics, and Cybersecurity Intelligence,” Journal of Frontiers in Multidisciplinary Research, vol. 4, no. 1, pp. 600–604, 2023, doi: https://doi.org/10.54660/.jfmr.2023.4.1.600-604.

[22] Y. Kamei et al., “A large-scale empirical study of just-in-time quality assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 757–773, Jun. 2013, doi: https://doi.org/10.1109/tse.2012.70.

[23] S. R. Gudi, “Monitoring and Deployment Optimization in Cloud-Native Systems: A Comparative Study Using OpenShift and Helm,” 2025 4th International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 792–797, Sep. 2025, doi: https://doi.org/10.1109/icimia67127.2025.11200594.

[24] A. K. K. Varma Alluri, “Governed Agentic AI for Salesforce CRM Platforms: A Reference Architecture for Data Grounding, Decision Intelligence, Trust Controls, and Lifecycle Reliability,” International Journal of Emerging Trends in Computer Science and Information Technology, vol. 7, pp. 374–382, 2026, doi: https://doi.org/10.63282/3050-9246.ijetcsit-v7i1p153.

[25] Sai Krishna Gunda, “An Exploration of Adaptive Ensemble Approaches in Software Fault Detection: Balancing Accuracy and Robustness,” The First International Conference on Recent Trends in Artificial Intelligence, Cyber Security, and Embedded Systems: ICRTACES2024, Tiruchirappalli, India, vol. 3345, no. 1, 7 January 2026, https://doi.org/10.1063/5.0298093

[26] B. Siri and Sai, “Replacing AI Agents for Backend,” INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, vol. 09, no. 06, pp. 1–8, Jun. 2025, doi: https://doi.org/10.55041/ijsrem.ncft011.

[27] S. R. Gudi, “Enhancing Optical Character Recognition (OCR) Accuracy in Healthcare Prescription Processing using Artificial Neural Networks,” European Journal of Artificial Intelligence and Machine Learning, vol. 4, no. 6, pp. 1–6, Nov. 2025, doi: https://doi.org/10.24018/ejai.2025.4.6.79.

[28] R. R. Thalakanti, “Formalizing feature model integrity: a typing system and refactoring approaches for improving software product line design,” IET Conference Proceedings, vol. 2025, no. 43, pp. 710–717, Feb. 2026, doi: https://doi.org/10.1049/icp.2025.4792.

[29] Sai Santosh Goud Bandari, “Machine Learning (ML) based Anomaly Detection in Insurance Industries,” Journal of Information Systems Engineering and Management, vol. 10, no. 32s, pp. 13–21, Apr. 2025, doi: https://doi.org/10.52783/jisem.v10i32s.5182.

[30] S. K. Gunda, “A Hybrid Deep Learning Model for Software Fault Prediction Using CNN, LSTM, and Dense Layers,” Communications in Computer and Information Science, pp. 282–290, Oct. 2025, doi: https://doi.org/10.1007/978-3-032-05144-8_21.

[31] R. R. Thalakanti, “Optimizing Neural Network Architecture for Binary Classification Using Evolutionary Algorithms,” 2025 International Conference on Electronics and Computing, Communication Networking Automation Technologies (ICEC2NT), pp. 1–6, Sep. 2025, doi: https://doi.org/10.1109/icec2nt65402.2025.11380048.

[32] S. Yalamati, “Reinforcement Learning for Dynamic Service Composition in Edge Networks,” 2025 4th International Conference on Applied Artificial Intelligence and Computing (ICAAIC), pp. 1158–1163, Dec. 2025, doi: https://doi.org/10.1109/icaaic64647.2025.11330768.

[33] A. K. K. Varma Alluri, “Using Salesforce CRM and Deep Learning (CNN) Techniques to Improve Patient Journey Mapping and Engagement in Small and Medium Healthcare Organizations,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, 2025, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v6i4p115.

[34] S. R. Gudi, “Deconstructing Monoliths: A Fault-Aware Transition to Microservices with Gateway Optimization using Spring Cloud,” 2025 6th International Conference on Electronics and Sustainable Communication Systems (ICESC), pp. 815–820, Sep. 2025, doi: https://doi.org/10.1109/icesc65114.2025.11212326.

[35] S. K. Gunda, “AI-Enhanced API Reliability Testing for Digital Banking: Improving Accuracy, Resilience, and Integrity in Financial Transaction Processing,” International Journal of Emerging Trends in Computer Science and Information Technology, vol. 6, no. 2, pp. 136–143, May 2025, doi: https://doi.org/10.63282/3050-9246.ijetcsit-v6i2p116.

[36] S. Naik, Praneeth Aitharaju, and Sai, “AI Chatbots in Enterprise Solutions: Transforming Customer Support, Industry-Specific Challenges and Ethical Considerations,” vol. 01, no. 01, pp. 49–59, Jan. 2025, doi: https://doi.org/10.63665/gjis.v1.11.

[37] R. R. Thalakanti, S. S. G. Bandari, and S. D. Sivva, “Federated Learning for Privacy Preserving Fraud Detection across Financial Institutions: Architecture Protocols and Operational Governance,” International Journal of Emerging Research in Engineering and Technology, vol. 5, pp. 108–114, 2024, doi: https://doi.org/10.63282/3050-922x.ijeret-v5i2p111.

[38] S. K. Gunda, “Automatic Software Vulnerabilty Detection Using Code Metrics and Feature Extraction,” 2025 2nd International Conference On Multidisciplinary Research and Innovations in Engineering (MRIE), pp. 115–120, Jul. 2025, doi: https://doi.org/10.1109/mrie66930.2025.11156601.

[39] T. Raikar, “Preserving the clean core principles in SAP systems: Design strategies for integrating AI,” 2026 International Conference on Electronic Systems and Intelligent Computing (ICESIC), pp. 1036–1041, Mar. 2026, doi: https://doi.org/10.1109/icesic67389.2026.11496501.

[40] V. K. R. Mittamidi, “AI/ML Powered Intelligent Root Cause Analysis and Automated Remediation for Multi System Data Integrity Issues,” International Journal of AI, BigData, Computational and Management Studies, vol. 6, pp. 133–141, 2025, doi: https://doi.org/10.63282/3050-9416.ijaibdcms-v6i4p115.

[41] S. Yalamati, “Probabilistic Reasoning in Multi-Agent Reinforcement Learning Systems,” 2025 International Conference on NexGen Networks and Cybernetics (IC2NC), pp. 707–712, Dec. 2025, doi: https://doi.org/10.1109/ic2nc67409.2025.11376303.

[42] “Decision Intelligence Methodology for AI-Driven Agile Software Lifecycle Governance and Architecture-Centered Project Management,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, 2023, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v4i1p112.

[43] R. R. THALAKANTI, “AI-Driven API Architectures for Multi-Cloud Enterprises: A Comparative Study of Centralized, Distributed, and Hybrid Deployment Models,” International Journal of Computer Science and Engineering Innovations, vol. 2, no. 1, pp. 60–67, Feb. 2026, doi: https://doi.org/10.64137/31079458/ijcsei-v2i1p108.

[44] “AI-Driven Fax-to-Digital Prescription Automation: A Cloud-Native Framework Using OCR, Machine Learning, and Microservices for Pharmacy Operations,” International Journal of Emerging Research in Engineering and Technology, vol. 5, no. 1, Mar. 2024, doi: https://doi.org/10.63282/3050-922x.ijeret-v5i1p113.

[45] S. K. Gunda, “A Scalable AI-Driven Quality Engineering Architecture for End-To-End Validation of Core Banking, API, and UAT Ecosystems,” American International Journal of Computer Science and Technology, vol. 7, no. 6, pp. 126–138, Dec. 2025, doi: https://doi.org/10.63282/3117-5481/aijcst-v7i6p113.

[46] R. R. Thalakanti, “Convergence Analysis and Implementation of Linear Multistep Methods for Solving Ordinary Differential Equations,” 2025 2nd Asian Conference on Intelligent Technologies (ACOIT), pp. 1–18, Oct. 2025, doi: https://doi.org/10.1109/acoit66109.2025.11436783.

[47] M. Balerao, “A Converged Artificial Intelligence Architecture for Innovation, Software Lifecycle Optimization, and Cybersecurity Risk Mitigation,” International Journal of Multidisciplinary Futuristic Development, vol. 4, no. 1, pp. 117–120, 2023, doi: https://doi.org/10.54660/ijmfd.2023.4.1.117-120.

[48] S. R. Gudi, “Ensuring Secure and Compliant Fax Communication: Anomaly Detection and Encryption Strategies for Data in Transit,” 2025 4th International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 786–791, Sep. 2025, doi: https://doi.org/10.1109/icimia67127.2025.11200537.

[49] S. K. Gunda, “An Intelligent AI-Driven Framework for Real-Time ATM Transaction Validation, Fraud Detection and Financial Switching Integrity,” International Journal of Emerging Research in Engineering and Technology, vol. 5, pp. 180–191, 2024, doi: https://doi.org/10.63282/3050-922x.ijeret-v5i4p119.

[50] A. K. K. V. Alluri and S. Barde, “AI Powered Decision Intelligence Frameworks for Predictive and Prescriptive Business Optimization in Salesforce Enterprise Platforms,” 2026 International Conference on Electronic Systems and Intelligent Computing (ICESIC), pp. 438–443, Mar. 2026, doi: https://doi.org/10.1109/icesic67389.2026.11496409.

[51] T. Raikar, “High-Performance In-Memory Computing: A Research Study on SAP S/4 HANA Database Layer,” American Journal of Technology, vol. 4, no. 2, pp. 93–113, Dec. 2025, doi: https://doi.org/10.58425/ajt.v4i2.449.

[52] V. K. R. Mittamidi, “Leveraging AI and ML for Predictive Monitoring and Error Mitigation in Change Data Capture Pipelines,” International Journal of Emerging Trends in Computer Science and Information Technology, vol. 6, pp. 104–111, 2025, doi: https://doi.org/10.63282/3050-9246.ijetcsit-v6i3p116.

[53] S. Yalamati, “AI-Augmented Service Fabric for Adaptive Resource Management in Cloud Environments,” 2025 5th International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), pp. 963–968, Nov. 2025, doi: https://doi.org/10.1109/icuis67429.2025.11380548.

[54] S. K. Gunda, “Accelerating Scientific Discovery With Machine Learning and HPC-Based Simulations,” Advances in Systems Analysis, Software Engineering, and High Performance Computing, pp. 229–252, Dec. 2024, doi: https://doi.org/10.4018/978-1-6684-3795-7.ch009.

[55] “EmoVision: An Intelligent Deep Learning Framework for Emotion Understanding and Mental Wellness Assistance in Human Computer Interaction,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, 2025, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v6i4p103.

[56] M. Ukey, S. R. Abbidi, T. K. Kota, T. Raikar, M. Mallepati, and P. J. Adinarayana, “Digital Transformation in Healthcare: Integrating Clinical Research with Data Management Technologies,” 2026 6th International Conference on Recent Trends in Computer Science and Technology (ICRTCST), pp. 886–891, Jan. 2026, doi: https://doi.org/10.1109/icrtcst68392.2026.11545210.

[57] “View of A Comparative Analysis of Pivotal Cloud Foundry and OpenShift Cloud Platforms,” Doi.org, 2026. https://doi.org/10.37547/tajas/Volume07Issue07-03

[58] R. R. Thalakanti, “Enhancing Convergence in Fully Connected Neural Networks via Optimized Backpropagation,” 2025 2nd International Conference on Computing and Data Science (ICCDS), pp. 1–6, Jul. 2025, doi: https://doi.org/10.1109/iccds64403.2025.11209625.

[59] “Leveraging Predictive Analytics and Redis-Backed Caching to Optimize Specialty Medication Fulfillment and Pharmacy Inventory Management,” International Journal of AI, BigData, Computational and Management Studies, vol. 5, no. 3, Oct. 2024, doi: https://doi.org/10.63282/3050-9416.ijaibdcms-v5i3p116.

[60] T. Raikar and V. Apelagunta, “Implementing SAP Fiori in S/4HANA Transitions: Key Guidelines, Challenges, Strategic Implications, AI Integration Recommendations,” Journal of Engineering Research and Sciences, vol. 4, no. 11, pp. 1–9, Nov. 2025, doi: https://doi.org/10.55708/js0411001.

[61] V. K. R. Mittamidi, “An Automated AI-Driven Monitoring and Observability Framework for Cloud-Based Data Pipelines by Software Defect Prediction Research,” International Journal of Multidisciplinary Evolutionary Research, vol. 5, no. 1, pp. 109–112, 2024, doi: https://doi.org/10.54660/ijmer.2024.5.1.109-112.

[62] A. K. K. Varma Alluri, “Salesforce CRM Framework for Real Time DeFi Portfolio Intelligence and Customer Engagement Forecasting in Web3 Based Decentralized Finance Ecosystems Using ML Techniques,” International Journal of AI, BigData, Computational and Management Studies, vol. 6, 2025, doi: https://doi.org/10.63282/3050-9416.ijaibdcms-v6i4p111.

[63] S. K. Gunda, “Fault Prediction Unveiled: Analyzing the Effectiveness of RandomForest, LogisticRegression, and KNeighbors,” 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), pp. 107–113, Oct. 2024, doi: https://doi.org/10.1109/icssas64001.2024.10760620.

[64] I. Manga, S. D. Sivva, and V. K. Manga, “The Adaptive Intelligence in Cloud Systems: A Unified Architecture for AI Enhanced Observability and Automated Root Cause Analysis,” International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, pp. 160–166, 2024, doi: https://doi.org/10.63282/3050-9262.ijaidsml-v5i1p115.

[65] S. K. Gunda, “Analyzing Machine Learning Techniques for Software Defect Prediction: A Comprehensive Performance Comparison,” 2024 Asian Conference on Intelligent Technologies (ACOIT), pp. 1–5, Sep. 2024, doi: https://doi.org/10.1109/acoit62457.2024.10939610.

Explainable Machine Learning for Software Defect Prediction in Large-Scale Code Repositories Using Code Embeddings and Repository-Level Knowledge Graphs

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Side