Open #42
Improve performance of file indexing for large repositories
A
Problem
When indexing repositories with more than 100,000 files, the indexing process becomes extremely slow and can take over 30 minutes.
## Expected behavior
Indexing should complete within 5 minutes for repositories of any size.
## Current behavior
The current implementation processes files sequentially and doesn't leverage parallel processing capabilities.
## Proposed solution
• Implement parallel file processing using a worker pool
• Add caching for frequently accessed file metadata
• Use incremental indexing for unchanged files
## Additional context
This is blocking adoption for enterprise customers with large monorepos.
Discussion
B
Bob Smith @bobsmith • Mar 16, 2024, 09:15 AM
I've started working on the parallel file processing. Initial benchmarks show a 5x improvement on my test repository with 50k files.
// Example of the new parallel implementation
let pool = ThreadPool::new(num_cpus::get());
files.par_iter().for_each(|file| {
index_file(file);
});
Will have a draft PR ready by EOD.👍 8 ❤️ 3
C
Carol Johnson @carolj • Mar 17, 2024, 11:20 AM
@bobsmith Great progress! A few considerations:
1. We should add a configuration option for the thread pool size
2. Need to ensure memory usage stays reasonable during parallel processing
3. Consider adding a progress indicator for the UI
I can help with the caching implementation once the parallel processing is in.
👍 4
A
Alice Chen @alicechen • Mar 18, 2024, 03:30 PM
Thanks for the quick turnaround!
@carolj good points about the config option. Let's make sure we also add proper logging so users can debug any issues.
I've linked the draft MR (#156) to this issue.
👍 2
Y
Markdown supported