Improve performance of file indexing for large repositories

Alice Chen opened 22 months ago 3 comments

Problem

When indexing repositories with more than 100,000 files, the indexing process becomes extremely slow and can take over 30 minutes.

## Expected behavior

Indexing should complete within 5 minutes for repositories of any size.

## Current behavior

The current implementation processes files sequentially and doesn't leverage parallel processing capabilities.

## Proposed solution

• Implement parallel file processing using a worker pool
• Add caching for frequently accessed file metadata
• Use incremental indexing for unchanged files

## Additional context

This is blocking adoption for enterprise customers with large monorepos.

Discussion

Bob Smith @bobsmith • Mar 16, 2024, 09:15 AM

I've started working on the parallel file processing. Initial benchmarks show a 5x improvement on my test repository with 50k files.

// Example of the new parallel implementation
let pool = ThreadPool::new(num_cpus::get());
files.par_iter().for_each(|file| {
    index_file(file);
});

Will have a draft PR ready by EOD.

👍 8 ❤️ 3

Carol Johnson @carolj • Mar 17, 2024, 11:20 AM

@bobsmith Great progress! A few considerations: 1. We should add a configuration option for the thread pool size 2. Need to ensure memory usage stays reasonable during parallel processing 3. Consider adding a progress indicator for the UI I can help with the caching implementation once the parallel processing is in.

👍 4

Alice Chen @alicechen • Mar 18, 2024, 03:30 PM

Thanks for the quick turnaround! @carolj good points about the config option. Let's make sure we also add proper logging so users can debug any issues. I've linked the draft MR (#156) to this issue.

👍 2

Markdown supported