Have you ever found it difficult to understand a large, messy codebase? Or wondered how tools that analyze and explore code actually work? In this article, we’ll solve these problems by building a powerful codebase exploration tool from scratch. Using static code analysis and the Gemini model, we’ll create an easy-to-use system that helps developers query, understand, and gain useful insights from their code. Ready to change the way you navigate code? Let’s begin!
Learning Objectives
- Complex software development using Object Oriented Programming.
- How to parse and analyze the Python Codebase using AST or Abstract Syntax Tree.
- Understanding how to integrate Google’s Gemini LLM API with the Python application of code analysis.
- Typer Command line-based query system for codebase exploration.
This article was published as a part of theData Science Blogathon.
Table of contents
- The Need for Smarter Code Exploration
- Architecture Overview
- Starting Hands-on Project
- Setup Project Environment
- Implementing the Code
- Query Processing Engine
- Query Handling System
- Command Line App Implementation(CLI)
- Test the Application
- Future Development
- Conclusion
- Frequently Asked Questions
The Need for Smarter Code Exploration
First of all, building such an application gives you a learning boost in software development, it will help you learn how to implement complex software using Object Oriented Programming paradigm and also help you to master the art of handling large projects (although it is not that large)
Second, today’s software projects consist of thousands of lines of code written across many files and folders. Traditional approaches to code exploration, such as Grep or IDE search function. This type of system will fall short when developers need to understand the higher-level concepts or relationships within the codebase. Our AI-powered tools can make a significant stride in this realm. Our application allows developers to ask questions about their codebase in plain English and receive detailed, contextual responses.
Architecture Overview
The tool consists of four main components
- Code Parser: It is the foundation of our system, which is responsible for analyzing Python files and extracting their structure using Python’s Abstract Syntax Tree (AST) module. It identifies classes, methods, functions, and imports. It will create a comprehensive map of the codebase.
- Gemini Client: A wrapper around Google’s Gemini API that handles communication with the LLM model. These components manage API authentication and provide a clean interface for sending queries and receiving responses.
- Query Processor: It is the main engine of the tool which is responsible for formatting the codebase context and queries in a way that Gemini can understand and process effectively. It maintains a persistent index of the codebase structure and manages the interaction between the parser and the LLM.
- CLI interface: A user-friendly command-line interface built with Typer, providing commands for indexing codebase, querying code structure, and analyzing stack traces.
Starting Hands-on Project
This section will guide you through the initial steps to build and implement your project, ensuring a smooth start and effective learning experience.
Project Folder Structure
The project folder structure will be similar to these
|--codebase_explorer/ |src/ ├──| __init__.py ├──| indexer/ │ ├── __init__.py │ └── code_parser.py ├──| query_engine/ │ ├── __init__.py │ ├── query_processor.py │ └── gemini_client.py | ├── main.py └── .env
Setup Project Environment
Setup project environment in the following step:
#create a new conda env conda create -n cb_explorer python=3.11 conda activate cb_explorer
Install all the necessary libraries:
pip install google-generativeai google-ai-generativelanguage pip install python-dotenv typer llama-index
Implementing the Code
We will start with understanding and implementing the codebase parsing system. It has two important functions
- parse_codebase()
- extract_definitions()
Extracting definitions from the Abstract Syntax Tree:
import ast import os from typing import Dict, Any def extract_definitions(tree: ast.AST) -> Dict[str, list]: """Extract class and function definitions from AST.""" definitions = { "classes": [], "functions": [], "imports": [] } for node in ast.walk(tree): if isinstance(node, ast.ClassDef): definitions["classes"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.FunctionDef): definitions["functions"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.Import): for name in node.names: definitions["imports"].append(name.name) return definitions
This is a helper function for parse_codebase(). It will take an abstract syntax tree(AST) of a Python file. The function initiates a dictionary with empty lists for classes, functions, and imports. Now, ast.walk() iterates through all nodes in the AST tree. The AST module will identify all the Classes, Functions, Imports, andline numbers. Then append all the definitions to the definitions dictionary.
Parsing CodeBase
This function scans a directory for Python files, reads their content, and extracts their structure.
import ast import os from typing import Dict, Any def parse_codebase(directory: str) -> Dict[str, Any]: """Parse Python files in the directory and extract code structure.""" code_structure = {} for root, _, files in os.walk(directory): for file in files: if file.endswith(".py"): file_path = os.path.join(root, file) with open(file_path, "r", encoding="utf-8") as f: try: content = f.read() tree = ast.parse(content) code_structure[file_path] = { "definitions": extract_definitions(tree), "content": content } except Exception as e: print(f"Error parsing {file_path}: {e}") return code_structure
The functions initiate with the directory path as a string. It outputs a dictionary of code’s structures. The dictionary stores the extracted data for each Python file.
It loops through all subdirectories and the files in the given directory. os.walk() provided a recursive way to explore the entire directory tree. It will process files ending the .py extensions.
Using the Python ast module to parse the file’s content into an abstract syntax tree (AST), which represents the file’s structure. The extracted tree is then passed to the extract_definitions(tree). If parsing fails, it prints an error message but continues processing other files.
Query Processing Engine
In the query engine directory create two files named gemini_client.py and query_processor.py
Gemini Client
This file will use
|--codebase_explorer/ |src/ ├──| __init__.py ├──| indexer/ │ ├── __init__.py │ └── code_parser.py ├──| query_engine/ │ ├── __init__.py │ ├── query_processor.py │ └── gemini_client.py | ├── main.py └── .env
Here, we define a GeminiClient class to interact with Google’s Gemini AI model. It will authenticate the model using GOOGLE_API_KEY from your .env file. After configuring the model API, it provides a query method to generate a response on a given prompt.
Query Handling System
In this section, we will implement the QueryProcessor class to manage the codebase context and enable querying with Gemini.
#create a new conda env conda create -n cb_explorer python=3.11 conda activate cb_explorer
After loading the necessary libraries, load_dotenv() loads environment variables from the .env file which contains our GOOGLE_API_KEY for the Gemini API key.
- GeminiEmbedding class initializes the embedding-001 models from the Google server.
- QueryProcessor class is designed to handle the codebase context and interact with the GeminiClient.Loading_contextmethod loads codebase information from a JSON file it exists.
- Thesaving_contextmethod saves the current codebase context into the JSON file for persistence.save_contextmethod updates the codebase context and immediately saves it usingsave_context andtheformat_contextmethod converts the codebase data into a human-readable string format, summarizing file paths, classes, functions, and imports for queries.
- Querying Geminiis the most important method which will construct a prompt using the codebase context and the user’s query. It sends this prompt to the Gemini model through GeminiClient and gets back the response.
Command Line App Implementation(CLI)
Create a main.py file in the src folder of the project and follow the steps
Step 1: Import Libraries
pip install google-generativeai google-ai-generativelanguage pip install python-dotenv typer llama-index
Step 2: Initialize typer and query processor
Let’s create a typer and query processor object from the classes.
import ast import os from typing import Dict, Any def extract_definitions(tree: ast.AST) -> Dict[str, list]: """Extract class and function definitions from AST.""" definitions = { "classes": [], "functions": [], "imports": [] } for node in ast.walk(tree): if isinstance(node, ast.ClassDef): definitions["classes"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.FunctionDef): definitions["functions"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.Import): for name in node.names: definitions["imports"].append(name.name) return definitions
Step 3: Indexing the Python Project Directory
Here, the index method will be used as a command in the terminal, and the function will index the Python codebase in the specified directory for future querying and analysis.
|--codebase_explorer/ |src/ ├──| __init__.py ├──| indexer/ │ ├── __init__.py │ └── code_parser.py ├──| query_engine/ │ ├── __init__.py │ ├── query_processor.py │ └── gemini_client.py | ├── main.py └── .env
It will first check if the directory exists and then use the parse_codebase function to extract the structure of Python files in the directory.
After parsing it will save the parsed codebase structure in query_processor. All the processes are in the try and except block so that exceptions can be handled with care during parsing. It will prepare the codebase for efficient querying using the Gemini model.
Step 4: Querying the codebase
After indexing we can query the codebase for understanding or getting information about any functions in the codebase.
#create a new conda env conda create -n cb_explorer python=3.11 conda activate cb_explorer
First, check whether the query_processor has loaded a codebase context or not and try to load the context from the computer’s hard disk. and then uses the query_processor’s query method to process the query.
And the last, it will print the response from the LLM to the terminal using typer.echo() method.
Step 5: Run the Application
pip install google-generativeai google-ai-generativelanguage pip install python-dotenv typer llama-index
Test the Application
To test your hard work follow the below steps:
- Create a folder name indexes in your project root where we will put all our index files.
- Create a codebase_index.json and put it in the previously (indexes) created folder.
- Then create a project_test folder in the root where we will store our Python files for testing
- Create a find_palidrome.py file in the project_test folder and put the below code in the file.
Code Implementation
import ast import os from typing import Dict, Any def extract_definitions(tree: ast.AST) -> Dict[str, list]: """Extract class and function definitions from AST.""" definitions = { "classes": [], "functions": [], "imports": [] } for node in ast.walk(tree): if isinstance(node, ast.ClassDef): definitions["classes"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.FunctionDef): definitions["functions"].append({ "name": node.name, "lineno": node.lineno }) elif isinstance(node, ast.Import): for name in node.names: definitions["imports"].append(name.name) return definitions
This file will find the palindrome from a given string. we will index this file query from terminal using the CLI application.
Now, open your terminal, paste the code and see the magic.
Indexing the project
import ast import os from typing import Dict, Any def parse_codebase(directory: str) -> Dict[str, Any]: """Parse Python files in the directory and extract code structure.""" code_structure = {} for root, _, files in os.walk(directory): for file in files: if file.endswith(".py"): file_path = os.path.join(root, file) with open(file_path, "r", encoding="utf-8") as f: try: content = f.read() tree = ast.parse(content) code_structure[file_path] = { "definitions": extract_definitions(tree), "content": content } except Exception as e: print(f"Error parsing {file_path}: {e}") return code_structure
Output:
You may show Successfully indexed 1 Python file. and the JSON data looks like
import os from typing import Optional from google import generativeai as genai from dotenv import load_dotenv load_dotenv() class GeminiClient: def __init__(self): self.api_key = os.getenv("GOOGLE_API_KEY") if not self.api_key: raise ValueError("GOOGLE_API_KEY environment variable is not set") genai.configure(api_key=self.api_key) self.model = genai.GenerativeModel("gemini-1.5-flash") def query(self, prompt: str) -> Optional[str]: """Query Gemini with the given prompt.""" try: response = self.model.generate_content(prompt) return response.text except Exception as e: print(f"Error querying Gemini: {e}") return None
Querying the project
import os import json from llama_index.embeddings.gemini import GeminiEmbedding from dotenv import load_dotenv from typing import Dict, Any, Optional from .gemini_client import GeminiClient load_dotenv() gemini_api_key = os.getenv("GOOGLE_API_KEY") model_name = "models/embeddings-001" embed_model = GeminiEmbedding(model_name=model_name, api_key=gemini_api_key) class QueryProcessor: def __init__(self): self.gemini_client = GeminiClient() self.codebase_context: Optional[Dict[str, Any]] = None self.index_file = "./indexes/codebase_index.json" def load_context(self): """Load the codebase context from disk if it exists.""" if os.path.exists(self.index_file): try: with open(self.index_file, "r", encoding="utf-8") as f: self.codebase_context = json.load(f) except Exception as e: print(f"Error loading index: {e}") self.codebase_context = None def save_context(self): """Save the codebase context to disk.""" if self.codebase_context: try: with open(self.index_file, "w", encoding="utf-8") as f: json.dump(self.codebase_context, f, indent=2) except Exception as e: print(f"Error saving index: {e}") def set_context(self, context: Dict[str, Any]): """Set the codebase context for queries.""" self.codebase_context = context self.save_context() def format_context(self) -> str: """Format the codebase context for Gemini.""" if not self.codebase_context: return "" context_parts = [] for file_path, details in self.codebase_context.items(): defs = details["definitions"] context_parts.append( f"File: {file_path}\n" f"Classes: {[c['name'] for c in defs['classes']]}\n" f"Functions: {[f['name'] for f in defs['functions']]}\n" f"Imports: {defs['imports']}\n" ) return "\n\n".join(context_parts) def query(self, query: str) -> Optional[str]: """Process a query about the codebase.""" if not self.codebase_context: return ( "Error: No codebase context available. Please index the codebase first." ) prompt = f""" Given the following codebase structure: {self.format_context()} Query: {query} Please provide a detailed and accurate answer based on the codebase structure above. """ return self.gemini_client.query(prompt)
Output:
import os import json import typer from pathlib import Path from typing import Optional from indexer.code_parser import parse_codebase from query_engine.query_processor import QueryProcessor
Output:
If everything is done properly you will get these outputs in your terminal. You can try it with your Python code files and tell me in the comment section what is your output. THANK YOU for staying with me.
Future Development
This is a prototype of the foundation system that can be extended with many interesting features, such as
- You can integrate with IDE plugins for seamless code exploration.
- AI-driven automated debugging system (I am working on that).
- Adding support for many popular languages such as Javascript, Java, Typescripts, Rust.
- Real-time code analysis and LLM powered suggestions for improvements.
- Automated documentation using Gemini or LLama3.
- Local LLM integration for on-device code exploration , features addition.
Conclusion
The Codebase Explorer helps you understand the practical application of AI in software development tools. By combining traditional static analysis with modern AI capabilities, we have created a tool that makes codebase exploration more intuitive and efficient. This approach shows how AI can augment developer workflows without replacing existing tools, providing a new layer of understanding and accessibility to complex codebases.
All the code used in this article is here.
Key Takeaways
- Structure code parsing is the most imortant technique for the code analysis.
- CodeBase Explorer simplifies code navigation, allowing developers to quickly understand and manage complex code structures.
- CodeBase Explorer enhances debugging efficiency, offering tools to analyze dependencies and identify issues faster.
- Gemini can significantly enhance code understanding when combined with traditional static analysis.
- CLI tools can provide a powerful interface for LLM assisted code exploration.
Frequently Asked Questions
Q1. How does the tool handle large codebase?A. The tool uses a persistent indexing system that parses and stores the codebase structure, allowing for efficient queries without nedding to reanalyze the code each time. The index is updated only when the codebase changes.
Q2. Can the tool work offline?A. The code parsing and index management can work offline, but the querying the codebase using Gemini API need internet connection to communicate with the external servers. We can integrated Ollama with the tools which will possible to use on-device LLM or SLM model such as LLama3 or Phi-3 for querying the codebase.
Q3. How accurate are the LLM generated responses?A. The accuracy depends on both the quality of the parsed code context and the capabilities of the Gemini model. The tools provides structured code information to the AI model, with helps improve response accuracy, but users should still verify critical information through traditional means.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
The above is the detailed content of Building a CodeBase Explorer with Google's Gemini-2.0. For more information, please follow other related articles on the PHP Chinese website!

The legal tech revolution is gaining momentum, pushing legal professionals to actively embrace AI solutions. Passive resistance is no longer a viable option for those aiming to stay competitive. Why is Technology Adoption Crucial? Legal professional

Many assume interactions with AI are anonymous, a stark contrast to human communication. However, AI actively profiles users during every chat. Every prompt, every word, is analyzed and categorized. Let's explore this critical aspect of the AI revo

A successful artificial intelligence strategy cannot be separated from strong corporate culture support. As Peter Drucker said, business operations depend on people, and so does the success of artificial intelligence. For organizations that actively embrace artificial intelligence, building a corporate culture that adapts to AI is crucial, and it even determines the success or failure of AI strategies. West Monroe recently released a practical guide to building a thriving AI-friendly corporate culture, and here are some key points: 1. Clarify the success model of AI: First of all, we must have a clear vision of how AI can empower business. An ideal AI operation culture can achieve a natural integration of work processes between humans and AI systems. AI is good at certain tasks, while humans are good at creativity and judgment

Meta upgrades AI assistant application, and the era of wearable AI is coming! The app, designed to compete with ChatGPT, offers standard AI features such as text, voice interaction, image generation and web search, but has now added geolocation capabilities for the first time. This means that Meta AI knows where you are and what you are viewing when answering your question. It uses your interests, location, profile and activity information to provide the latest situational information that was not possible before. The app also supports real-time translation, which completely changed the AI experience on Ray-Ban glasses and greatly improved its usefulness. The imposition of tariffs on foreign films is a naked exercise of power over the media and culture. If implemented, this will accelerate toward AI and virtual production

Artificial intelligence is revolutionizing the field of cybercrime, which forces us to learn new defensive skills. Cyber criminals are increasingly using powerful artificial intelligence technologies such as deep forgery and intelligent cyberattacks to fraud and destruction at an unprecedented scale. It is reported that 87% of global businesses have been targeted for AI cybercrime over the past year. So, how can we avoid becoming victims of this wave of smart crimes? Let’s explore how to identify risks and take protective measures at the individual and organizational level. How cybercriminals use artificial intelligence As technology advances, criminals are constantly looking for new ways to attack individuals, businesses and governments. The widespread use of artificial intelligence may be the latest aspect, but its potential harm is unprecedented. In particular, artificial intelligence

The intricate relationship between artificial intelligence (AI) and human intelligence (NI) is best understood as a feedback loop. Humans create AI, training it on data generated by human activity to enhance or replicate human capabilities. This AI

Anthropic's recent statement, highlighting the lack of understanding surrounding cutting-edge AI models, has sparked a heated debate among experts. Is this opacity a genuine technological crisis, or simply a temporary hurdle on the path to more soph

India is a diverse country with a rich tapestry of languages, making seamless communication across regions a persistent challenge. However, Sarvam’s Bulbul-V2 is helping to bridge this gap with its advanced text-to-speech (TTS) t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Chinese version
Chinese version, very easy to use

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

WebStorm Mac version
Useful JavaScript development tools

Atom editor mac version download
The most popular open source editor
