Use Zoekt For code search
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
ongoing |
dgruzd
DylanGriffith
|
DylanGriffith
|
changzhengliu
|
devops foundations | 2022-12-28 |
Summary
We have implemented an additional code search functionality in GitLab that is backed by Zoekt, an open source search engine specifically designed for code search. Zoekt is used as an API by GitLab and remains an implementation detail, while the user interface in GitLab has been enhanced with new features enabled by Zoekt’s capabilities.
This integration provides significant improvements over the existing Elasticsearch-based search, including:
- Exact match mode: Returns results that precisely match the search query, eliminating false positives
- Regular expression mode: Supports regex patterns and boolean expressions for powerful code searching
- Multiple line matches: Shows multiple matching lines from the same file in the search results
- Self-registering architecture: Enables simple scaling and management of search infrastructure
Motivation
GitLab code search functionality has historically been backed by Elasticsearch. While Elasticsearch has proven useful for other types of search (issues, merge requests, comments, etc.), it is not ideally suited for code search where users expect matches to be precise (no false positives) and flexible (supporting features like substring matching and regexes).
After investigating our options, we determined that Zoekt is the most suitable well-maintained open source technology for code search. Our research indicated that the fundamental architecture of Zoekt matches what we would implement if we were to build a solution from scratch.
Our benchmarking showed that Zoekt is viable at our scale, and the integration has been successfully deployed to GitLab.com.
Goals
The main goals of this integration have been to implement the following highly requested improvements to code search:
- Exact match (substring match) code searches in advanced search
- Support regular expressions with Advanced Global Search
- Support multiple line matches in the same file
The rollout was designed to catch and resolve scaling or infrastructure cost issues as early as possible, allowing us to pivot if necessary before investing too heavily in this technology.
Non-Goals
The following were not initial goals but could be built upon this solution in the future:
- Improving security scanning features by leveraging fast regex scans across repositories
- Reducing search infrastructure costs (though this may be possible with further optimizations)
- AI/ML features to predict what users might be interested in finding
- Comprehensive code intelligence and navigation features (which would require more structured data)
Proposal
An initial implementation of the Zoekt integration was created to demonstrate the feasibility of using Zoekt as a drop-in replacement for Elasticsearch code searches. This design document outlines the details of the implementation and the steps taken to scale the solution for GitLab.com and self-managed instances.
Design and implementation details
User Experience
When a user performs an advanced search on a group or project where Zoekt is enabled, they can now toggle between two search modes in the UI:
- Exact match mode: Returns results that exactly match the query (default mode)
- Regular expression mode: Supports regex patterns and boolean expressions
Users can select their preferred search mode using a toggle in the UI. The search syntax supports advanced filtering with modifiers like:
file:
to filter by filenamelang:
to filter by programming languagesym:
to search within symbols (methods, classes, etc.)- and other syntax options
Here’s a screenshot of the new UI:
Key Components
Unified Binary: gitlab-zoekt
Zoekt comes with it’s own binaries for indexing and searching. Initially we used some of these and we started to build out our own binaries over time. We then pivoted to a single binary for both indexing and searching.
We call this unified binary gitlab-zoekt
, which replaces the previously separate binaries (gitlab-zoekt-indexer
and gitlab-zoekt-webserver
). This is a Go codebase which uses public modules from the Zoekt codebase as a library, rather than using the binaries directly. This unified binary can operate in two distinct modes:
- Indexer mode: Responsible for indexing repositories
- Webserver mode: Responsible for serving search requests
Having a unified binary simplifies deployment, operation, and maintenance of the Zoekt infrastructure. The key advantages of this approach include:
- Simplified deployment: Only one binary needs to be built, deployed, and maintained
- Consistent codebase: Shared code between indexer and webserver is maintained in one place
- Operational flexibility: The same binary can run in different modes based on configuration
- Testing mode: The unified binary can run both services simultaneously for testing purposes
Database Models
Zoekt is not a distributed database (like Elasticsearch) or even really a database service (like Postgres) but instead it’s a set of Go modules (and binaries) that interact with index files on disk. It supports creating index files and searching them. Since we needed to build a higher level distributed, clustered and replicated search engine on top of it we needed to manage all of the lifecycle of Zoekt processes and indexes somewhere. We chose to store all this lifecycle data in Rails and Zoekt processes periodically poll Rails state to figure out what to do next.
GitLab uses several database models to manage Zoekt:
Search::Zoekt::EnabledNamespace
: Tracks which top-level namespaces have Zoekt enabledSearch::Zoekt::Node
: Represents a Zoekt server node with information about its capacity, status, and configurationSearch::Zoekt::Replica
: Manages replica relationships for high availabilitySearch::Zoekt::Index
: Manages the index state for a top level namespace, including storage allocation and watermark levelsSearch::Zoekt::Repository
: Represents a project repository in Zoekt with indexing stateSearch::Zoekt::Task
: Tracks indexing tasks (index, force_index, delete) that need to be processed by Zoekt nodes
Architecture Overview
graph TD User[User] --> GitLab[GitLab Rails Application] GitLab <--> DB[(GitLab Database)] GitLab <--> ZoektNode1[Zoekt Node 1] GitLab <--> ZoektNode2[Zoekt Node 2] GitLab <--> ZoektNodeN[Zoekt Node N] ZoektNode1 <--> Gitaly[Gitaly] ZoektNode2 <--> Gitaly ZoektNodeN <--> Gitaly subgraph "Zoekt Node" ZoektBinary["gitlab-zoekt binary"] IndexStorage[(Index Storage)] Gateway[NGINX Gateway] ZoektBinary --> IndexStorage Gateway --> ZoektBinary end subgraph "Database Models" Node[Search::Zoekt::Node] Index[Search::Zoekt::Index] Task[Search::Zoekt::Task] Repository[Search::Zoekt::Repository] EnabledNamespace[Search::Zoekt::EnabledNamespace] Replica[Search::Zoekt::Replica] end
The Zoekt integration consists of several key components working together:
- GitLab Rails Application: Manages which repositories need to be indexed, coordinates with Zoekt nodes
- Zoekt Nodes: Run the
gitlab-zoekt
binary to handle indexing and searching of repositories - Gitaly: Provides Git repository access to Zoekt for indexing
- Database: Stores metadata about nodes, indices, tasks, and repositories
Indexing Flow
sequenceDiagram participant GitLab as GitLab Rails participant DB as GitLab Database participant Zoekt as Zoekt Node participant Gitaly as Gitaly Service Note over GitLab: Repository updated or created GitLab->>DB: Create zoekt_tasks records loop Task Polling Zoekt->>GitLab: GET /internal/search/zoekt/:uuid/tasks GitLab->>DB: Find pending tasks GitLab->>Zoekt: Return tasks to process end Zoekt->>Gitaly: Fetch repository data Zoekt->>Zoekt: Create/update search index Zoekt->>GitLab: POST /internal/search/zoekt/:uuid/callback GitLab->>DB: Update task status GitLab->>DB: Update repository and index state
The indexing process follows these steps:
- When a repository is created or updated, the GitLab Rails application creates
zoekt_tasks
records - Zoekt nodes (running in indexer mode) periodically pull tasks through the internal API
- Zoekt nodes process the tasks by fetching repository data from Gitaly and creating search indices
- Zoekt nodes send callback notifications to GitLab to update task status
- GitLab updates the appropriate database records (
zoekt_task
,zoekt_repository
,zoekt_index
)
zoekt_task
can be of three different types:
index_repo
: Incremental indexing (from the last indexed SHA to the latest SHA of the default branch)force_index_repo
or forced indexing: Full reindex of the repository (deletes existing index files and reindexes everything)delete_repo
: Schedules existing indexed files for deletion
To avoid race conditions, there is a locking mechanism to ensure only one indexing operation occurs for a project at any given time.
Search Flow
sequenceDiagram participant User participant GitLab as GitLab Rails participant DB as GitLab Database participant Zoekt as Zoekt Node (Webserver) User->>GitLab: Perform code search GitLab->>DB: Check if namespace has Zoekt enabled GitLab->>DB: Get online Zoekt nodes Note over GitLab: Apply user permissions GitLab->>Zoekt: Forward search query to node Zoekt->>Zoekt: Process search query against index Zoekt->>GitLab: Return search results GitLab->>User: Format and present results
The search process follows these steps:
- User performs a search in GitLab UI
- GitLab determines if the search should use Zoekt based on user preferences and enabled namespaces
- If Zoekt is appropriate, GitLab forwards the search to a Zoekt node running in webserver mode
- Zoekt processes the search and returns results
- GitLab formats and presents the results to the user
Communication Flow
The communication between GitLab and Zoekt nodes happens through bidirectional API calls, all secured with appropriate authentication mechanisms.
Authentication Architecture
The Zoekt integration implements a comprehensive authentication system with these key components:
- Indexer → Rails Authentication (JWT): Zoekt indexer authenticates to GitLab Rails using JWT tokens signed with the GitLab shell secret
- Rails → Webserver Authentication (Basic Auth): GitLab Rails authenticates to Zoekt webserver using HTTP Basic Authentication via NGINX
- Future Planned Authentication (JWT): Plans to replace Basic Auth with JWT for Rails → Webserver authentication, mirroring the approach used for Indexer → Rails
This tiered authentication approach ensures secure communication in all directions while maintaining compatibility with GitLab’s existing security patterns.
Task Retrieval API
Zoekt nodes periodically call GitLab’s internal API to:
- Register themselves with GitLab (providing node information like UUID, URL, disk space)
- Retrieve tasks that need to be processed
- Update their status and metrics
GET /internal/search/zoekt/:uuid/tasks
This API is secured with JWT authentication, where:
- The JWT token is generated by the Zoekt indexer using the GitLab shell secret
- The token is included in the
Gitlab-Shell-Api-Request
header - GitLab Rails validates the token using the same shell secret
This authentication mechanism ensures that only authorized Zoekt nodes can register and retrieve tasks from GitLab.
Callback API
After processing tasks, Zoekt nodes call GitLab’s callback API to:
- Update task status (success/failure)
- Provide additional information (for example, repository size)
- Report errors or issues
POST /internal/search/zoekt/:uuid/callback
This API also uses JWT authentication with the same mechanism as the Task Retrieval API, ensuring secure bidirectional communication.
This asynchronous callback architecture is a significant improvement over the previous design, which used Sidekiq jobs for indexing operations. By using callbacks instead of blocking Sidekiq jobs, the system gains several important benefits:
- Reduced Sidekiq load: Indexing operations no longer block Sidekiq workers, freeing them for other critical GitLab tasks
- Better scalability: The number of concurrent indexing operations is only limited by Zoekt node capacity, not by Sidekiq worker availability
- Improved reliability: If a node goes down during indexing, it doesn’t leave Sidekiq jobs in an incomplete state
- More efficient resource usage: Long-running indexing tasks don’t consume valuable Sidekiq resources
- Separation of concerns: Zoekt nodes handle indexing independently, reporting back only when completed
This approach allows GitLab to maintain a lightweight coordination role while the computationally intensive work is handled by specialized Zoekt nodes, resulting in better overall system performance and responsiveness.
Search API
GitLab calls the Zoekt webserver API to:
- Execute search queries
- Retrieve search results
- Apply filtering based on user permissions
GET /api/search
In deployed environments (particularly with Helm), this communication is secured with HTTP Basic Authentication configured in NGINX. This provides a simple but effective authentication layer for search requests.
The authentication approach for search is planned to transition to JWT-based authentication in the future, which will provide more granular control and better align with GitLab’s authentication patterns.
Zoekt Infrastructure
Each Zoekt node runs a single gitlab-zoekt
binary that can operate in both indexer and webserver modes simultaneously. The nodes store .zoekt
index files on persistent storage for fast searches.
A typical deployment includes:
- The
gitlab-zoekt
binary serving both indexing and search requests - Universal CTags for symbol extraction
- An internal NGINX gateway for routing requests
Scaling and High Availability
Self-Registering Node Architecture
Zoekt implements a self-registering node architecture inspired by GitLab Runner:
- Zoekt nodes register themselves with GitLab by providing their address, name, and status
- GitLab maintains a registry of nodes with their status, capacity, and assignments
- GitLab manages the shard assignments internally, assigning namespaces to specific nodes
- Nodes that don’t check in for a configurable period can be automatically removed
This architecture makes the system self-configuring and facilitates easy scaling.
Unlike the GitLab Runner the Zoekt nodes authenticate with a shared secret managed at the infrastructure level and cannot be registered by users. So self-registration is more of a convenience for the operator rather than a feature for users.
Sharding Strategy
- Groups/namespaces are assigned to specific Zoekt nodes for indexing and searching
- GitLab manages the shard assignments internally based on node capacity and load
- When new nodes are added, they can automatically take on new workloads
- If nodes go offline, their work can be reassigned to other nodes
Replication Strategy
graph TD GitLab[GitLab Rails Application] DB[(Database)] GitLab <--> DB subgraph "Database Models" ReplicaRecord1["Replica Record 1"] ReplicaRecord2["Replica Record 2"] Index1["Index 1 for Namespace A\n(on Node 1)"] Index2["Index 2 for Namespace A\n(on Node 2)"] end ReplicaRecord1 --> Index1 ReplicaRecord2 --> Index2 subgraph "Physical Infrastructure" Node1[Zoekt Node 1] Node2[Zoekt Node 2] Gitaly[Gitaly] Node1 <--> Gitaly Node2 <--> Gitaly end Index1 -.-> Node1 Index2 -.-> Node2 GitLab --> Node1 GitLab --> Node2
The replication strategy works at the database record level rather than through actual data synchronization between nodes:
- Independent Indexing: Each Zoekt node independently indexes repositories by fetching data directly from Gitaly
- Multiple Replica Records: For high availability, GitLab can create multiple
Search::Zoekt::Replica
records for a single namespace - Distributed Indices: Each replica record is associated with an index record that may be assigned to different physical Zoekt nodes
- Fast Indexing: Indexing is efficient (approximately 10 seconds for a large repository like
gitlab-org/gitlab
), making it practical to maintain multiple independent indices - No Complex Synchronization: This approach eliminates the need for complex index file synchronization between nodes
- Search Load Distribution: GitLab can route search requests to any node that has an index for the relevant namespace
Currently, GitLab typically creates a single replica record per namespace, but the system is designed to support a configurable number of replicas per namespace in the future. This approach provides the following benefits:
- Horizontal Scalability: Add more nodes to handle more namespaces or increase replication
- High Availability: If one node fails, searches can be routed to other nodes with replica indices
- Simple Operation: No complex replication mechanisms to maintain or troubleshoot
- Independent Scaling: Search and indexing capacity can be scaled independently by adding more nodes
This design prioritizes operational simplicity and reliability while still providing the necessary redundancy for high availability.
Since our Zoekt database is not a source of truth (ie. it simply syncing repos from Gitaly) we do not need to worry about assigning specific replicas to be a “primary” or “leader”. Instead we just let the replicas independently sync data from Gitaly and assume that each time they update the index they will get the new source of truth.
Deployment Options
Kubernetes/Helm
GitLab provides a Helm chart (gitlab-zoekt
) for Kubernetes deployments with the following features:
- Deploys Zoekt in a StatefulSet with persistent volumes for index storage
- Configurable resource allocation, scaling, and networking options
- Automatic node registration and service discovery
- Gateway component for load balancing and authentication
- Configurable Basic Authentication through NGINX for Rails → Webserver communication
The gitlab-zoekt
Helm chart has proven to be highly scalable in production environments. On GitLab.com, this deployment is handling over 36 TiB of data, demonstrating its ability to operate at enterprise scale. The chart’s design allows for both horizontal and vertical scaling to accommodate growing code search needs while maintaining performance and reliability.
Docker/Container
Containers are built from the CNG repository with:
- The unified
gitlab-zoekt
binary - Universal CTags for symbol extraction
- Configurable environment variables for different operating modes
Database Schema
Key database tables include:
zoekt_nodes
: Information about Zoekt server nodeszoekt_indices
: Tracks the indexing state for namespaceszoekt_repositories
: Maps GitLab projects to Zoekt indiceszoekt_tasks
: Queue of indexing tasks to be processedzoekt_enabled_namespaces
: Configuration for which namespaces use Zoektzoekt_replicas
: Manages replica relationships for high availability
Database Model Relationships
The following diagram illustrates the relationships between the database models:
classDiagram class Node class Index class Repository class Task class EnabledNamespace class Replica Node "1" --> "*" Task : has_many tasks Node "1" --> "*" Index : has_many indices EnabledNamespace "1" --> "*" Replica : has_many replicas Replica "1" --> "*" Index : has_many indices Index "1" --> "*" Repository : has_many repositories Repository "1" --> "*" Task : has_many tasks
Here’s an example of the database structure for a namespace with multiple replicas and indices:
graph TD Namespace["Namespace (gitlab-org)"] Replica1["Replica #1"] Replica2["Replica #2"] Index1A["Index #1A (Node 1)"] Index1B["Index #1B (Node 2)"] Index2A["Index #2A (Node 3)"] Index2B["Index #2B (Node 4)"] Repo1["Repository: gitlab"] Repo2["Repository: omnibus-gitlab"] Repo3["Repository: gitaly"] Repo4["Repository: gitlab-runner"] Namespace --> Replica1 Namespace --> Replica2 Replica1 --> Index1A Replica1 --> Index1B Replica2 --> Index2A Replica2 --> Index2B Index1A --> Repo1 Index1A --> Repo2 Index1B --> Repo3 Index1B --> Repo4 Index2A --> Repo1 Index2A --> Repo2 Index2B --> Repo3 Index2B --> Repo4
In this example:
- A namespace (
gitlab-org
) has Zoekt enabled through anEnabledNamespace
record - Two replica records are created for this namespace
- Each replica has an associated index record, assigned to different physical nodes
- Each index contains repositories for multiple projects within the namespace
- Tasks are created for each repository, tracking their indexing state on the respective nodes
This structure enables high availability and load distribution while maintaining a clear organization of the relationship between namespaces, indices, nodes, and repositories.
Current Development
Federated Search Using gRPC
A new gRPC-based federated search capability is being developed to enhance search performance across multiple Zoekt nodes. This feature replaces the previous HTTP-based search proxying by using a more efficient gRPC streaming implementation.
JWT Authentication for Rails → Webserver
In addition to the gRPC improvements, there are plans to implement JWT-based authentication for the Rails → Webserver communication flow. This would replace the current Basic Authentication approach with a more secure JWT implementation that mirrors the existing Indexer → Rails authentication. The goal is to provide a consistent, unified authentication strategy across all communication channels while leveraging GitLab’s existing security infrastructure.
The gRPC federated search offers several advantages:
- More efficient communication: gRPC uses HTTP/2 for transport, providing better performance than HTTP/1.1
- Streaming between Zoekt nodes: Results are streamed between Zoekt nodes as they’re found, allowing the coordinating node to stop requesting results from other nodes once it has gathered enough matches
- Reduced latency: Faster response times, especially for searches across many repositories, achieved through:
- Concurrent searches across multiple nodes
- Early termination once enough results are collected
- More efficient binary protocol with HTTP/2
- Processing results as they arrive rather than waiting for complete result sets
- Better resource management: More granular control over search processing limits
It’s important to note that while this implementation streams results between Zoekt nodes, the final results are still collected by the coordinating Zoekt node before being returned to Rails. The current implementation does not stream results to Rails. Instead, the performance benefits come from more efficient inter-node communication and the ability to stop searching once sufficient results are found, rather than exhaustively searching all repositories.
Configuration Options
The GitLab Zoekt integration can be configured through:
- GitLab Admin Settings: Enable/disable indexing and searching, configure concurrent indexing tasks, set auto-deletion settings
- User Preferences: Enable/disable exact code search for individual users
- Zoekt Node Settings: Resource allocation, storage configuration, network settings
- Feature Flags: Control specific features or behaviors of the integration
Rollout Strategy
The rollout strategy has followed these steps:
- Initial availability for
gitlab-org
group - Improvements to monitoring and performance
- Expansion to select customers with high code search needs
- Implementation of sharding and replication for scalability
- Gradual rollout to more licensed groups
- Implementation of automatic balancing of shards
- Assessment of costs and performance for broader rollout
- Continued performance improvements
- Availability to the majority of licensed groups on GitLab.com
- General availability to all licensed groups on GitLab.com (pending)
For self-managed instances, administrators can enable Zoekt by installing the required components and enabling the feature in the admin area.
Monitoring and Maintenance
To monitor the health and performance of the Zoekt integration, GitLab provides:
- Admin UI: Shows indexing status, node health, and storage utilization
- Rake Tasks: Tools to check indexing status. For example,
gitlab:zoekt:info
- Automated Management: Features to automatically delete offline nodes, manage watermark levels, and redistribute indices
- Logging: Detailed logging of indexing operations, search queries, and errors
- Metrics: Performance metrics for indexing and search operations
Watermark Management
The Zoekt integration implements a sophisticated watermark management system that operates at both the node level and index level to ensure efficient storage utilization while preventing resource exhaustion.
Node-Level Watermarks
Each Zoekt node has watermark thresholds based on the percentage of total disk space used:
- Low Watermark (60%): When disk usage exceeds this threshold, GitLab starts taking proactive measures to avoid reaching higher levels
- High Watermark (75%): Signals potential storage pressure and prioritizes rebalancing actions
- Critical Watermark (85%): May pause new indexing operations to prevent node overload while performing evictions
These node-level watermarks are used for overall node health monitoring and to make decisions about task assignment and index reallocation.
Index-Level Watermarks
In addition to node-level watermarks, each index within a node has its own watermark levels based on the ratio of used storage to reserved storage:
- Ideal Storage Utilization (60%): Target level for optimal operation
- Low Watermark (70%): Triggers evaluation for potential rebalancing or increasing reserved storage allocation
- High Watermark (75%): Indicates the index is consuming more storage than expected and may prompt additional storage allocation if available
- Critical Watermark (80%): May trigger eviction processes for this specific index or, if storage is available, a significant increase in reserved storage allocation
Each index has an associated watermark_level
enum state that reflects its current status:
healthy
: Operating within expected parametersoverprovisioned
: Using less than the ideal storage percentage (has more reserved space than needed)low_watermark_exceeded
: Exceeded the low watermark thresholdhigh_watermark_exceeded
: Exceeded the high watermark thresholdcritical_watermark_exceeded
: Exceeded the critical watermark threshold
These watermark levels directly influence how storage allocation adjustments are made. Indices in the ready
state can both increase their reserved storage when hitting higher watermarks (if node storage is available) or decrease their reservations when overprovisioned.
Storage Reservation Mechanism
The system uses a storage reservation mechanism where:
- Each index maintains a
reserved_storage_bytes
value representing its allocation - The node tracks its total
usable_storage_bytes
and the sum of all index reservations - When an index needs more storage, it attempts to claim additional bytes from the node’s unclaimed storage
- Indices in a
ready
state can both increase and decrease their reservations as needed - Indices in initialization states can only increase their reservations until they’re fully indexed
This reservation system prevents overcommitment of storage while allowing flexible allocation based on actual needs. If a node approaches its critical watermark, indices may be marked for eviction to reclaim space.
The combination of node-level and index-level watermarks provides a comprehensive approach to storage management, ensuring efficient resource utilization while preventing resource exhaustion at both the node and index levels.
Conclusion
The Zoekt integration significantly improves GitLab’s code search capabilities by providing exact match and regular expression search modes. The architecture is designed to be scalable, self-managing, and resilient, with features like node self-registration, automatic sharding, and high availability through replication.
The unified binary approach simplifies deployment and maintenance, while the bidirectional communication between GitLab and Zoekt nodes enables efficient task distribution and status tracking.
Current development efforts focus on enhancing search performance through gRPC-based federated search and improving overall system scalability to support namespaces with more than tens of thousands of projects.
eef3c341
)