Unlocking the Common Crawl to learn about innovative economic activities over space and time.

This project focuses on the development of an open source tool to effectively utilise web data from the Common Crawl for the purpose of understanding the birth and spreading of innovative economic activities in the United Kingdom. A large archive of geo-coded ‘.uk’ webpages from 1996 to 2010 was created by the Internet Archive in collaboration with the British Library using methods from data science and computational linguistics (Archive 2013). While these outputs are backward looking and based on older and easy-to-access data, we propose a computational working pipeline to create newer data streams for continuously generating vital knowledge about local economic activities. By developing tools to mine and, subsequently, model these data we address the lack of current, stream-like and granular enough data needed to create insights about the evolution of new economic activities, their location and colocation in space — see, for instance, industrial clusters — as well as the capacity of businesses to innovate.

In order to achieve this purpose, we leverage data from the Common Crawl, a constantly expanding large-scale web archive providing monthly data dumps of archived webpages of hundreds of terabytes from 2008 until the present date. We build computational tools to firstly, efficiently access these data. Secondly we filter webpages from UK’s second-level domain for commercial activities (‘.co.uk’). Finally, we geolocate the data using geographical references in the text using Named Entity Recognition techniques. To geocode the data, we use both structured sources of information such as postcodes, and less structured textual information such as indications to street and neighborhood names. Within the project, we address two main computational challenges: (1) the size of these data and the need to use big data workflows — e.g. Apache Spark — and (2) the data frequency (monthly updates) and the need to build timeseries to model the evolution of business dynamics and economic activities over space and time. In conclusion, we aim to deliver an accessible, easy to use open-source computational infrastructure to efficiently mine and wrangle monthly Common Crawl data dumps, as well as a dataset of yearly corpora of commercial websites in the UK including their geolocation for researchers to utilise freely.

References

Archive, T. U. W. (2013), ‘Jisc uk web domain dataset (1996-2010)’.

URL: http://data.webarchive.org.uk/opendata/ukwa.ds.2/


Giulia Occhini is a final year PhD candidate at The Alan Turing Institute and the University of Bristol. Together with her studies, she currently works part-time as a Research Data Scientist at the University of Bristol. Giulia’s research focuses on building Machine Learning-informed methodological pipelines for economic research. She is particularly interested in applying such pipelines to the study of spatial and demographic inequalities in the context of the ‘intangible economy’.