Abstract
This thesis presents the design and development of a modular voice assistant capable of handling real-time, speech-based interactions tailored for academic environments. The system integrates key components—voice activity detection (VAD), automatic speech recognition, text generation, and text-to-speech synthesis—to deliver a seamless, end-to-end conversational experience. Unlike generic voice assistants, this system prioritizes domain-specific accuracy and low-latency communication, with a focus on university-level student support. Interaction begins with browser-based VAD, which detects when a user starts and stops speaking, enabling hands-free communication. The captured audio is transcribed using OpenAI’s Whisper-Large-V3-Turbo, selected for its robustness in handling various accents and noisy environments. The transcription is then processed by a fine-tuned LLaMA 3.2B model, which has been adapted using Low-Rank Adaptation (LoRA) techniques. Trained on custom Q&A pairs from the University of Massachusetts Dartmouth’s College of Engineering, the model generates responses specific to academic advising, curricula, and campus resources. Context tracking enables coherent multi-turn dialogue. Generated responses are synthesized using Kokoro 82M, a lightweight neural TTS model known for its expressive, low-latency audio output. The conversation is managed through a Gradio interface that integrates voice and text input-output, visual feedback, and message history. This work demonstrates how modern AI tools can be integrated into a practical, domain-specific educational assistant. With its modular architecture and open-source foundation, the system provides a scalable framework that is adaptable to other departments or institutions.