Centre for Science and Technology Studies Centre for Science and Technology Studies 2333AL Leiden Zuid Holland 31715273909
  • Blog
  • Using the Leiden algorithm to find well-connected clusters in networks

Blog archive

This is the archive of our old blog. Please visit our new Leiden Madtrics blog.

Using the Leiden algorithm to find well-connected clusters in networks



Introduction

An exciting development in the field of quantitative science studies is the use of algorithmic clustering approaches to construct article-level classifications based on citation networks. Until recently, most classifications were based on categorizing journals rather than individual articles. This is understandable given the substantial challenges of classifying millions of articles. At CWTS, we now routinely work with article-level classifications. We have dedicated quite some time developing clustering algorithms for creating these classifications. These algorithms have an impact beyond our own research field and are of interest to many network scientists.

You may be surprised to learn that one the most famous clustering algorithms—commonly known as the Louvain algorithm—actually has a major flaw: the clusters it finds can be arbitrarily badly connected. For instance, the Louvain algorithm may group articles together in a cluster even though some of the articles have no citation links with the other articles in the cluster. We here briefly report on a new algorithm that we have developed, which we call the Leiden algorithm. This algorithm guarantees to find well-connected clusters. Even better, it does so much faster than the Louvain algorithm!

Louvain algorithm

The Louvain algorithm is a simple and elegant algorithm that is more efficient than many other network clustering algorithms. When it was introduced in 2008, it was applied to a huge network of more than one hundred million nodes and one billion links. It ranked among the best performing clustering algorithms in comparative studies in 2009 and 2016. The Louvain algorithm searches for high-quality clusters by moving individual nodes—for instance individual articles in a citation network—from one cluster to another in such a way that the quality of the clusters is improved as much as possible. When clusters cannot be improved further by moving individual nodes, the Louvain algorithm does something ingenious: it aggregates the network, so that each cluster in the original network becomes a node in the aggregated network. In the aggregated network, the algorithm then starts to move individual nodes from one cluster to another. By repeating the node movement and aggregation, the Louvain algorithm is able to find high-quality clusters in a short time. Unfortunately, however, this approach also leads to an important flaw, which seems to have gone unnoticed during the past decade.

image1

Sometimes, a node functions as a middle man or a bridge for the rest of its cluster. Without that crucial node, the cluster would not be connected anymore. Since the Louvain algorithm keeps moving nodes from one cluster to another, at some point it may move the crucial node to a different cluster, thereby breaking the connectivity of the original cluster. Perhaps surprisingly, the Louvain algorithm cannot fix this shattered connectivity. The complete breaking of connectivity is the worst thing that can happen to a cluster. It is the most extreme example of a more general problem of the Louvain algorithm: the algorithm can produce clusters that are badly connected and that should have been split up into multiple clusters.

Leiden algorithm

We fix this problem of the Louvain algorithm in our new Leiden algorithm. Similarly to the Smart Local Moving algorithm that was previously developed at CWTS, the Leiden algorithm is able to split clusters instead of only merging them, as is done by the Louvain algorithm. By splitting clusters in a specific way, the Leiden algorithm guarantees that clusters are well-connected. Moreover, the algorithm guarantees more than this: if we run the algorithm repeatedly, we eventually obtain clusters that are subset optimal. This means that it is impossible to improve the quality of the clusters by moving one or more nodes from one cluster to another. This is a strong property of the Leiden algorithm. It states that the clusters it finds are not too far from optimal. Finally, rather than continuously checking for all nodes in a network whether they can be moved to a different cluster, as is done in the Louvain algorithm, the Leiden algorithm performs this check only for so-called unstable nodes. As a result, the Leiden algorithm does not only find higher quality clusters than the Louvain algorithm, it also does so in much less time.

At CWTS, we use the Leiden algorithm to cluster large citation networks. The Louvain algorithm needs more than half an hour to find clusters in a network of about 10 million articles and 200 million citation links. The Leiden algorithm needs only a little over three minutes to cluster this network. Moreover, when run repeatedly, the Leiden algorithm easily finds higher quality clusters than the Louvain algorithm. 

image2

Try it yourself!

We expect the Leiden algorithm to prove useful not only to us at CWTS, but also to many other researchers in both quantitative science studies and network science. During the past decade, thousands of researchers have published papers in which they use the Louvain algorithm. In the future, these researchers could employ the Leiden algorithm.

Together with the paper introducing the Leiden algorithm, we have also released the Java source code of the algorithm on GitHub. We have made quite some effort to ensure that the algorithm is easy to use for everyone. For the more technically inclined, we have created technical documentation and code comments. Grab the source code, run it on your own network data, and let us know what you think of it!


About Vincent Traag

Senior researcher and bibliometric consultant. Vincent's research focuses on complex networks and social dynamics. He holds a Master in sociology and a PhD in applied mathematics, and tries to combine the two in his work.

About Ludo Waltman

Ludo Waltman is professor of Quantitative Science Studies and scientific director at the Centre for Science and Technology Studies (CWTS) at Leiden University. He is a coordinator of the Information & Openness focal area and a member of the Evaluation & Culture focal area. Ludo is co-chair of the Research on Research Institute (RoRI).

About Nees Jan van Eck

Senior researcher, head of data science, and coordinator of the Information & Openness focal area. Nees Jan's research focuses on infrastructures and the development of tools and algorithms to support research assessment, science policy, and scholarly communication.


13 comments

Mandatory fields
  • David Fajardo July 18th, 2023 1:47 pm
    Just a question regarding the names of the algorithms: is the VOS clustering algorithm the same or related to the Leiden methods?Also, can you suggest me some literature to cite this methods?
    Thanks in advance!
    David
    Reply
  • Vincent Traag December 15th, 2020 10:57 am
    There is little known about the theoretical complexity of the Louvain and Leiden algorithms. From numerical experiments, both seem to run in near-linear time in the number of edges. However, the constant factor of the Louvain algorithm is larger than the constant factor of the Leiden algorithm, i.e. it is slower overall.
    For some more challenging graphs, the runtime of the Louvain algorithm seriously degrades however (cf. Figure 7 in our paper). The runtime of the Louvain algorithm then almost becomes quadratic (in the number of nodes), while the runtime of the Leiden algorithm remains near-linear.
    Reply
    • Aj HQ December 15th, 2020 2:50 pm
      what is the way to know the theoretical complexity then?
      Reply
  • Aj HQ December 14th, 2020 3:09 pm
    please provide the time complexity of the algorithm.
    Reply
    • Vincent Traag December 15th, 2020 11:31 am
      There is little known about the theoretical complexity of the Louvain and Leiden algorithms. From numerical experiments, both seem to run in near-linear time in the number of edges. However, the constant factor of the Louvain algorithm is larger than the constant factor of the Leiden algorithm, i.e. it is slower overall.
      For some more challenging graphs, the runtime of the Louvain algorithm seriously degrades however (cf. Figure 7 in our paper). The runtime of the Louvain algorithm then almost becomes quadratic (in the number of nodes), while the runtime of the Leiden algorithm remains near-linear.
      Reply
      • Vlad Oles April 17th, 2023 11:28 pm
        Your last sentence suggests that Leiden algorithm runs in near-linear time wrt the number of nodes, but the 2nd sentence talks about the number of edges instead. Could you please clarify which one is correct?
        Also, do you have any references to the corresponding numerical experiments?
        Thanks!
        Reply
  • Martijn Gösgens February 18th, 2020 8:50 am
    Thank you for making this publicly available! What can you say about the class of functions that this algorithm is able to optimize?
    As far as I understand it, the clustering that results from a stable iteration of Louvain for any function that takes a clustering and graph as input, is guaranteed to be locally optimal w.r.t. merging two clusters and moving a single node.
    Does Leiden need additional constraints on the function to guarantee uniform density?
    Reply
    • Vincent Traag March 4th, 2020 2:58 pm
      Good question!
      It is not clear exactly for what class of objective functions the Leiden algorithm would guarantee uniform density (asymptotically). This is because it is not clear under what conditions Theorem 1 (in the appendix) holds. This theorem ensures that an optimal partition is reachable by a non-decreasing move sequence (starting from a singleton partition). It is of course clear that it holds for modularity and CPM. Instead of using a non-decreasing move sequence, it would be possible to use a move sequence that also allows decreasing moves (e.g. as done in simulated annealing), in which case (trivially) the optimal partition can be reached with some probability (for any objective function). What impact that would have on the speed and quality of the resulting partition is not immediately clear. In other words, something that could be further investigated!
      Reply
  • Avanti Shrikumar August 27th, 2019 7:59 pm
    Thank you for releasing this! Could you discuss the pros & cons (speed, scalability, etc) of the Python/C++ implementation at https://github.com/vtraag/leidenalg vs the Java implementation at https://github.com/CWTSLeiden/networkanalysis?
    Reply
    • Vincent Traag August 28th, 2019 8:35 am
      The Java implementation is faster, but the Python/C++ implementation is more flexible. The Java implementation supports only undirected networks and only optimizes CPM or Modularity. The Python package supports also directed networks (although this is unimportant for CPM), more partition quality functions, networks with negative links and multilayer networks, including temporal community detection (cf. https://doi.org/10.1126/science.1184819). As a consequence it is slower than the Java implementation.
      I plan to make an optimised and specialised implementation of the Leiden algorithm in C in igraph, which should be as fast as at the Java implementation. This should then also result in proper support for the Leiden algorithm in R.
      Reply
      • Avanti Shrikumar August 28th, 2019 6:13 pm
        Thank you. Does the Java implementation also require the network to be read into memory?
        Further, for a network with 20M-100M edges, do you expect the time spent reading in the input file to be a significant bottleneck?
        Reply
        • Vincent Traag September 10th, 2019 9:15 pm
          Yes, both implementations require the network to be read in memory. Reading the input file should not be a large bottleneck. You may provide the Java implementation with a pre-sorted input file, so that it does not need to sort it, which makes reading the input a bit faster. Reading the input file in Python is much more flexible. Please see the information provided in documentation and help files for more details.
          Reply
  • Cristian Colliander October 26th, 2018 1:33 pm
    Superb, many thanks for making this publicly available.
    Reply
Share on:
Subscribe to:
Build on Applepie CMS by Waltman Development