As of 2012, America’s small businesses—any business employing fewer than 500 employees—were responsible for a staggering 64% of net new private-sector jobs, and employ nearly half of America’s workforce (1).
Yet, despite this, the health of small business in the U.S. is hampered by a lack of understanding about this rich and varied ecosystem. The root cause: a severe lack of reliable, rich and regular data.
The Common Crawl is an open repository of web crawl data that can be accessed and analyzed by anyone. By processing the entire Common Crawl, and applying a variety of intelligent steps to the data, we are able to turn the Web into a dataset of businesses. Discover more about the Common Crawl or our process below.
The Common Crawl The TBD HandbookClick to advance slides
This project was conducted as part of Capstone Course w210, MIDS program at UCBerkeley . The team: