JGI Seed Corn Funding Project Blog 2022/23: Emmanouil Tranos
Where do websites go to die? Well, fortunately they don’t always die even if their owners stop caring about them. Their ‘immortality’ can be attributed to organisations known as web archives, whose mission is to preserve online content. There are quite a few web archives today with different characteristics – e.g. focusing on specific topics vs. archiving the whole web – but the Internet Archive is the oldest one. Even if you are not familiar with it directly, you might have come across the Wayback Machine, which is a graphical user interface to access webpages archived by the internet archive.
Although it might be fun to check the aesthetics of a website from the internet’s early days – especially considering the current 1990s revival – one might question the utility of such archives. But some archived websites are more useful than others. Imagine accessing archived websites from businesses located in a specific neighborhood and analysing the textual descriptions of the services and products these firms offer as they appear on their websites. Imagine being able to geolocate these websites by using information available in the text. Image doing this over time. And, image doing this programmatically for a large array of websites. Well, our past research did that and, therefore, serves as a proof-of-concept for the utility of web archives in understanding the geography of economic activities. Our models were successful in utilising a well-curated by The British Library and the UK Web Archive data set to understand how a well-known tech cluster – that is Shoreditch in London – evolved over time. Importantly, we were able to do this at a much higher level of detail in terms of the descriptions of the types of economic activities than if we had used more traditional business data.
The JGI project provided the opportunity to start looking forward. Our proof-of-concept research was useful in validating the value of such a research trajectory and revealing the evolving mechanisms of economic activities as we only focused on the 2000-2012 period. The next question is how to use this research framework in a current context.
Before I explain the challenge in doing this, let me tell you about the value of being able to do this. Our current understanding of the typologies of economic activities is based on a system called Standard Industrial Classification (SIC) codes. Briefly, businesses need to choose the SIC code that describes best what they do. Useful as they may be, SIC codes have not been updated since 2007 and, therefore, cannot capture new and evolving economic activities. In addition, there is built-in ambiguity in SIC codes as quite a few of them are defined as “… not elsewhere classified” or “… other than …”. Having a flexible system that can easily provide granular and up-to-date classifications of economic activities within a city or a region can be very useful to a wide range of organisations including local authorities, chambers of commerce and sector-specific support organisations.
The main challenge of building such a tool is data in terms of finding, accessing, filtering and modelling relevant data. Our JGI seedcorm project together with Rui Zhu and Giulia Occhini allowed us to pave the path for such a research project. Thanks to the Common Crawl, another web archive which offers all its crawled data openly every two months, we have all the data we need. The problem is that we have much more data than what we need as the Common Crawl crawls and scrapes the whole web providing a couple of hundred of terabyte of data every two months. And that is in compressed format! So, only accessing these data can be challenging set aside building a workflow which can do all the steps I mentioned above and – importantly – keep on doing these steps every few months once new data dumps become available.
Although we are nowhere close to completing such a big project, the JGI seedcorn funding allowed us to test some of the code and the data infrastructure needed to complete such a task. We are now developing funding proposals for such a large research programme and although a risky endeavour, we are confident that we can find the needle in the haystack and build a dynamic system of typologies of economic activities at a level of detail higher than current official and traditional data offer, which is based on open data and reproducible workflows.
Emmanouil Tranos
Professor of Quantitative Human Geography | Fellow at the Alan Turing Institute
e.tranos@bristol.ac.uk | @EmmanouilTranos | etranos.info | LinkedIn