Collaborative Research: SHF: Medium: A Scalable Graph-Based Approach to Clustering

Information

  • NSF Award
  • 2403237
Owner
  • Award Id
    2403237
  • Award Effective Date
    10/1/2024 - 3 months ago
  • Award Expiration Date
    9/30/2028 - 3 years from now
  • Award Amount
    $ 364,000.00
  • Award Instrument
    Standard Grant

Collaborative Research: SHF: Medium: A Scalable Graph-Based Approach to Clustering

Clustering algorithms are one of the most important modern tools for understanding data. Given data on various entities, clustering algorithms group entities into sets or "clusters" such that similar entities are likely to end up in the same cluster while dissimilar entities tend to end up in different clusters. For example, clustering algorithms can be used to group images together according to the contents of the image. However, modern datasets are so large that many existing clustering algorithms cannot be feasibly used. This project aims to systematically address this situation by way of new clustering algorithms that scale to massive datasets with billions of entities. Clustering is widely used by scientists, companies, and government agencies. The toolkit developed in the project will be open-sourced and will make scalable, high-performance clustering more broadly accessible to scientists and practitioners by improving the efficiency and programming productivity of their clustering tasks. Results from the project will be integrated into courses that the investigators teach, and the researchers will recruit undergraduate students to participate in the project.<br/><br/>This three-institution collaborative project investigates a new approach for clustering pointsets by constructing sparse graphs that preserve relevant properties of the pointset. By carefully leveraging high-quality near-linear work graph clustering algorithms, very large datasets can be clustered in time that is nearly linear to the number of objects in the input with high accuracy. Particular attention will be paid to new algorithms for graph clustering and construction that utilize structure observed in practice, exploit parallelism, and enable dynamism with provable accuracy guarantees. A major contribution of the project will be an end-to-end clustering toolkit for graphs and pointsets that enables clustering to be scaled to inputs with billions of objects. The investigators will collaborate through regular remote meetings and seminars, student visits, joint publications, and annual in-person workshops.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Almadena Chtchelkanovaachtchel@nsf.gov7032927498
  • Min Amd Letter Date
    7/12/2024 - 6 months ago
  • Max Amd Letter Date
    7/12/2024 - 6 months ago
  • ARRA Amount

Institutions

  • Name
    Massachusetts Institute of Technology
  • City
    CAMBRIDGE
  • State
    MA
  • Country
    United States
  • Address
    77 MASSACHUSETTS AVE
  • Postal Code
    021394301
  • Phone Number
    6172531000

Investigators

  • First Name
    Julian
  • Last Name
    Shun
  • Email Address
    jshun@mit.edu
  • Start Date
    7/12/2024 12:00:00 AM

Program Element

  • Text
    Software & Hardware Foundation
  • Code
    779800

Program Reference

  • Text
    MEDIUM PROJECT
  • Code
    7924
  • Text
    HIGH-PERFORMANCE COMPUTING
  • Code
    7942