Back to issues
Open #42

Improve performance of file indexing for large repositories

A
Alice Chen opened 22 months ago
3 comments

Problem

When indexing repositories with more than 100,000 files, the indexing process becomes extremely slow and can take over 30 minutes.

## Expected behavior

Indexing should complete within 5 minutes for repositories of any size.

## Current behavior

The current implementation processes files sequentially and doesn't leverage parallel processing capabilities.

## Proposed solution

• Implement parallel file processing using a worker pool
• Add caching for frequently accessed file metadata
• Use incremental indexing for unchanged files

## Additional context

This is blocking adoption for enterprise customers with large monorepos.

Discussion

B
Bob Smith @bobsmith Mar 16, 2024, 09:15 AM
I've started working on the parallel file processing. Initial benchmarks show a 5x improvement on my test repository with 50k files.
// Example of the new parallel implementation
let pool = ThreadPool::new(num_cpus::get());
files.par_iter().for_each(|file| {
    index_file(file);
});
Will have a draft PR ready by EOD.
👍 8 ❤️ 3
C
Carol Johnson @carolj Mar 17, 2024, 11:20 AM
@bobsmith Great progress! A few considerations: 1. We should add a configuration option for the thread pool size 2. Need to ensure memory usage stays reasonable during parallel processing 3. Consider adding a progress indicator for the UI I can help with the caching implementation once the parallel processing is in.
👍 4
A
Alice Chen @alicechen Mar 18, 2024, 03:30 PM
Thanks for the quick turnaround! @carolj good points about the config option. Let's make sure we also add proper logging so users can debug any issues. I've linked the draft MR (#156) to this issue.
👍 2
Y
Markdown supported