Collaborative Research: NeTS: Medium: Scalable Crawling of the Web as Experienced by Users

Information

  • NSF Award
  • 2403432
Owner
  • Award Id
    2403432
  • Award Effective Date
    10/1/2024 - 4 months ago
  • Award Expiration Date
    9/30/2028 - 3 years from now
  • Award Amount
    $ 400,000.00
  • Award Instrument
    Continuing Grant

Collaborative Research: NeTS: Medium: Scalable Crawling of the Web as Experienced by Users

Many systems that are central to modern society – such as web search engines, smart assistants, generative AI, and web archives – rely on the ability to automatically load (a.k.a. "crawl") large numbers of web pages quickly. However, "web crawler" software that has been traditionally used to crawl the web is now insufficient for three reasons. First, many pages require users to be logged in. As a result, a traditional crawler sees only the login page and is blind to content that actual users would see. Second, the number of web pages is ever-increasing, and interactive pages and web applications have significantly increased the amount of computation necessary for a client to identify all the resources on a typical page. In combination, these factors make it significantly more expensive than before to crawl either a large corpus of sites or to recrawl pages frequently to capture changes. Third, many pages are dynamic or interactive, and many use embedded third-party services such as maps, social media widgets, and language translation are either hampered or fail to work on crawled page copies. As a result, systems and studies that rely on content crawled from the web lack visibility into a large portion of the web, are unable to keep up with the rate at which they need to crawl pages and end up replaying crawled pages with poor fidelity.<br/><br/>To address these challenges, this project will develop Sprinter, a modern web crawler capable of capturing the web and its rich services as seen and experienced by users. Sprinter will crawl any page such that the content crawled is representative of what users see on the page. Its overheads will grow sub-linearly with the number of pages and the frequency of monitoring. Any page crawled using Sprinter will be renderable in a manner that closely approximates the original page, both visually and functionally. To develop Sprinter, the project will make research contributions along three dimensions. First, the project will use widespread support for authentication via single sign-on (SSO) providers such as Google and Facebook and generate representative browsing profiles from privacy-preserving network traces. Second, to make Sprinter’s crawling efficient, the project will devise techniques to reuse application computations across similar pages and to identify a small representative subset of pages that Sprinter needs to measure at high frequency. Lastly, to enable high-fidelity replay of the crawled copy of a page, the project will develop methods to crawl all of the page’s resources that will be needed to serve any common load of that copy. A major broader impact is in the research and use cases that Sprinter enables for the community. Further, Sprinter and the results of its crawls will be made available to other researchers.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

  • Program Officer
    Ann Von Lehmenavonlehm@nsf.gov7032924756
  • Min Amd Letter Date
    7/15/2024 - 7 months ago
  • Max Amd Letter Date
    7/15/2024 - 7 months ago
  • ARRA Amount

Institutions

  • Name
    University of Southern California
  • City
    LOS ANGELES
  • State
    CA
  • Country
    United States
  • Address
    3720 S FLOWER ST FL 3
  • Postal Code
    90033
  • Phone Number
    2137407762

Investigators

  • First Name
    Calvin
  • Last Name
    Ardi
  • Email Address
    calvin@isi.edu
  • Start Date
    7/15/2024 12:00:00 AM
  • First Name
    Harsha
  • Last Name
    Madhyastha
  • Email Address
    madhyast@usc.edu
  • Start Date
    7/15/2024 12:00:00 AM

Program Element

  • Text
    Networking Technology and Syst
  • Code
    736300

Program Reference

  • Text
    MEDIUM PROJECT
  • Code
    7924