The most surprising true thing about scm generator is that it doesn’t find clusters; it generates them based on your code’s structure and history.

Let’s see it in action. Imagine you have a Git repository with a few microservices. You’ve been developing them for a while, with commits spanning across different service directories.

Here’s a simplified git log snippet:

commit abcdef1234567890abcdef1234567890abcdef12
Author: Jane Doe <jane@example.com>
Date:   Mon Oct 26 10:00:00 2023 +0000

    feat: Add user authentication to auth-service

    Reviewed-by: John Smith <john@example.com>

commit fedcba0987654321fedcba0987654321fedcba09
Author: John Smith <john@example.com>
Date:   Mon Oct 26 09:30:00 2023 +0000

    fix: Update database schema for order-service

    Co-authored-by: Jane Doe <jane@example.com>

commit 1234567890abcdef1234567890abcdef12345678
Author: Jane Doe <jane@example.com>
Date:   Mon Oct 26 09:00:00 2023 +0000

    chore: Refactor shared utility functions

    This commit touches common.

You’ve also got a CODEOWNERS file, which is crucial. It maps specific file paths to teams or individuals responsible for them. For scm generator to work, you need to have this file present in your repository, typically at the root or in a .github/ or docs/ directory.

Example CODEOWNERS file:

# This is a comment
*       @general-team
/auth-service/ @auth-team
/order-service/ @order-team
/shared/  @platform-team

Now, you run scm generator. It doesn’t just look at the files changed in the latest commit. Instead, it analyzes the entire commit history, the authors, the commit messages, and critically, how files are associated with owners over time. It uses this to infer relationships and group related development efforts.

The core idea is that if multiple developers who are primarily responsible for auth-service are also frequently committing to files within /shared/, scm generator might infer that auth-service and some parts of shared are tightly coupled and should potentially be grouped. It’s not just about what files are changed, but who is changing them and in what context (i.e., which other files they are also touching in related commits).

The output of scm generator isn’t a static list of "clusters." It’s a dynamic set of recommendations, often presented as a graph or a ranked list of potential groupings. For instance, it might suggest:

  • Cluster 1: Authentication Services

    • auth-service/
    • shared/utils/auth.py
    • shared/config/auth_config.yaml
    • Reasoning: High co-occurrence of commits from @auth-team and @platform-team in these files.
  • Cluster 2: Order Processing

    • order-service/
    • shared/db/schema.sql
    • Reasoning: Frequent commits from @order-team and @platform-team to these components.

This process helps you identify logical groupings that might not be immediately obvious from a simple directory structure. It can highlight dependencies, shared responsibilities, and areas where refactoring might lead to better modularity. The scm generator essentially uses the "social graph" of your codebase – who works on what – to suggest structural groupings.

The most powerful aspect is how it can reveal implicit dependencies or areas of overlap that aren’t explicitly defined. For example, if a developer assigned to auth-service in CODEOWNERS also frequently modifies files in payment-service, scm generator might flag this as a potential area for investigation, suggesting that these two services might be more tightly coupled than initially thought, or that the ownership in CODEOWNERS might need refinement.

The system doesn’t just look at file paths; it analyzes the commit history. When it sees commits that touch files A and B within a short time frame, and the authors of those commits are consistently associated with a particular team or set of responsibilities (as defined by CODEOWNERS over time), it builds a confidence score for grouping A and B. This is a probabilistic approach, not a deterministic one, and it’s why the output is often presented as "suggestions" rather than absolute truths.

The underlying mechanism relies on graph-based algorithms. Each file can be thought of as a node. An edge is created between two files if they are frequently modified together in the same commit or in a series of closely related commits by the same set of developers. The weights of these edges are determined by the frequency and the "ownership" of the commits. scm generator then applies community detection algorithms (like Louvain or Label Propagation) to find dense subgraphs within this weighted, directed graph. These dense subgraphs represent the clusters.

The actual generation of these clusters involves a process that can be computationally intensive, especially for large repositories. It iterates through the commit history, building a co-occurrence matrix of files modified within the same commit or within a defined time window. This matrix is then transformed into an adjacency list or matrix for a graph. The CODEOWNERS file acts as a powerful weighting factor and a way to assign "communities" or "roles" to developers, which in turn influences how file co-occurrences are interpreted. If commits involving files X and Y are consistently made by developers who are primarily owners of X (according to CODEOWNERS), it strengthens the link between X and Y for the purpose of clustering.

The next logical step after discovering these clusters is to analyze the interactions between them.

Want structured learning?

Take the full Argocd course →