Mapping Historic Hong Kong 

JGI Seed Corn Funding Project Blog 2023/24: Thomas Larkin

Introduction

Mapping Historic Hong Kong (MHHK) is a pilot project designed to spatially organise visual and written archival documents into an interactive platform that maps Hong Kong’s colonial development. 

Map plan for city of Victoria in Hong Kong
Figure 1: Source Map for Hong Kong, 1905. Plan of the City of Victoria, Hong Kong (Corrected for 1905), Directory and Chronicle of China, China Mail Office, Hong Kong, 1905

Project aims

This first stage of the MHHK project aimed to showcase the applications GIS mapping had for rethinking how scholars and the public access Hong Kong’s spatial history and heritage. Our goal for this seed project was to demonstrate how a mapping platform could serve as a ‘container’ for a wide range of spatial and archival historical data, while providing users with an intuitive way to parse information according to time and location. We prioritised developing the project’s capacity to link users to partner archives such as the Historical Photographs of China (HPC) project, and to incorporate flexible data formats. 

Page from registry files from the Hong Kong public records office
Figure 2: Example of Land Registry files collected by Alex Cheung from the Hong Kong Public Records Office

For our proof-of-concept we proposed to (1) build a suite of base maps from archival materials which were geographically corrected and overlaid using QGIS; (2) identify amongst these maps a viable case study year to demonstrate the project’s intended capabilities; and (3) to conduct sufficient archival research to populate one map with images and registry data so as to give a sense of the project’s functions. 

What was achieved

The project progressed in three stages in partnership with Mark McLaren and the ResearchIT team at the University of Bristol, and with a Research Assistant, Alex Cheung, affiliated with Bristol’s Hong Kong History Centre. 

Figure 3: The Public Record Office, where Alex Cheung collected primary documents

1) Thomas Larkin, as project Co-PI, developed a suite of four maps in QGIS which made up the base of the project. This process included scouting viable archival maps (figure I) that demonstrated a snapshot of Hong Kong’s growth between 1842 and 1997; warping these maps using the QGIS program so that they shared the same geographic coordinates and could be precisely overlaid; and then converting the information in the archival maps into assets (roads, lots, parks, landmasses, etc) that corresponded with a database. Larkin then worked with the HPC archivists to isolate a sample year (1905) that had sufficient photographs on file to represent the project’s visual and cross-archival capabilities. We ultimately elected to select images from within a date range (1905-1915) for this concept phase of the project. 

2) With a sample year isolated, Alex Cheung, the project RA, collected archived land registries (figure II) from Hong Kong’s Public Records Office (figure III) in May 2024. With Cheung’s contributions, we were able to add property data (memorial numbers, lot numbers, lot holders, lot types, etc) for roughly half the assets in our 1905 map. 

Prototype map visual for 1905 City of Victoria, Hong Kong
Figure 4: Mark McLaren’s first prototype build for the proof-of-concept platform
Second prototype map visual for 1905 City of Victoria, Hong Kong
Figure 5: Mark McLaren’s second build for the proof-of-concept platform
Current build for mapping 905 City of Victoria in Hong Kong. Includes some street names and on the left hand side photos of Nethersole Hospital and London Missionary Society
Figure 6: Mark McLaren’s current build for the proof-of-concept platform

3) Once enough images and land registry data were recorded for the relevant 1905 assets, the project was turned over to Mark McLaren of Research IT. McLaren worked through several builds as he received updated maps and data (figure IV), at one point even experimenting with using AI to locate images (figure V) to some effect. The most recent build (figure VI) offers an effective example of what our platform will be able to do, as users can filter the maps of Hong Kong by date – the shorelines overlaid over a contemporary map tell a particularly effective narrative of historical change – click the assets in the 1905 map to pull up registry data and images associated with the relevant location, and click the image thumbnails to be taken to the HPC archive where more detailed information can be found. The demonstrator includes a collapsible legend, asset filters, a date slider and sidebar that is populated with archival data and image thumbnails when a relevant lot is selected on the map. 

At the end of this exploratory project, we therefore have an effective proof of concept that demonstrates the efficacy of our platform as well as its clear potential to:  

  • Incorporate a diverse range of data and spatial information
  • Communicate this information effectively to the public and experts alike
  • Connect archives and provide a visual means of navigating their holdings
  • And collaborate with spatial projects on the history of Hong Kong

Future plans

The key opportunity arising from this project stems from both its ability to aggregate data from and drive traffic to other spatial projects and archives; and from the groundswell of interest in spatial histories amongst scholars of Hong Kong. To this end, MHHK was designed to be collaborative, and a trip to Hong Kong in June 2024 was leveraged to meet with future partners and coordinate our discrete spatial history initiatives into a central and collaborative platform of which MHHK would be a pillar. The reception to this plan was overwhelmingly positive, and we are currently exploring avenues for follow-on funding to expand the MHHK and other spatial history platforms with colleagues at the University of Hong Kong, Hong Kong Baptist University, and Lingnan in Hong Kong; the University of Bristol’s Hong Kong History Centre in the UK; and the University of Prince Edward Island’s GeoREACH Lab in Canada. 


Contact details & Links

PI’s: Thomas M Larkin (University of Prince Edward Island); Robert Bickers (University of Bristol)

Emails:

Project links:

Partner websites:

Children of the 90s and Synthetic Health Data 

JGI Seed Corn Funding Project Blog 2023/24: Mark Mumme, Eleanor Walsh, Dan Smith, Huw Day and Debbie Johnson

What is Children of the 90s? 

Children of the 90s (Co90s) is a multi-generational population-based study following the health and development of nearly 15,000 families living around Bristol, whose children were born in 1991 and 1992. 

Co90s initially recruited its participants during the early stages of the mum’s pregnancy and captures information prospectively, at key time points, using self-reported questionnaires, interviews, clinics and electronic health records (EHR). 

The Co90s supports about 20 project teams using NHS data at any one time.  

What is Synthetic Data? 

At its most basic, synthetic data is information generated artificially rather than recorded directly from real-world events. It is essentially a computer-generated version of the data that doesn’t contain any real data and preserving privacy and confidentiality. 

Privacy vs Fidelity

Generating synthetic data is frequently a balancing act between fidelity and privacy (Figure 1). 

“Fidelity”: how well does the synthetic data represent the real-world data?  

“Privacy”: can personal information be deduced from the synthetic data? 

Blue line with an arrowhead at each end. Left side is High privacy, low fidelity and the right side is low privacy, high fidelity
Figure 1: Privacy versus fidelity

Why synthetic NHS data: 

EHR data are incredibly valuable and rich data sources, yet there are significant difficulties to accessing this data, including financial costs and the time taken to complete multiple application forms and have these approved. 

Because the authentic NHS data is so difficult to access, it is also not unusual for researchers to have never worked with, or possibly even seen, this type of data before. They often face a learning curve to understand how the data is structured, what variables are present in the data and how those variables relate to each other. 

The journey for a project to travel (Figure 2) just to get NHS data typically goes through the following stages: 

Multiple coloured boxed with each stage a project has to go through to get NHS data from initial grant application to data access
Figure 2: The stages a project goes through to get NHS data

Each of these stages can take several months and are usually sequential. It not unheard of for projects to run out of time and/or money due to these lengthy timescales. 

Current synthetic NHS data: 

Recently, the NHS has released synthetic Hospital Episode Statistics (HES) data (available here; https://digital.nhs.uk/services/artificial-data) which is, unfortunately, quite limited for practical purposes. This is because a very simple approach was adopted; each variable is randomly generated independently from all others. While it is possible to infer broadly accurate descriptive statistics for single variables (e.g., age or sex), it is impossible to infer relations between variables (e.g., how the number of cancer diagnoses increases with age). In the terms introduced above, it has high privacy but low fidelity. As shown in the heatmap, Figure 3, we observe practically no association between diagnosis and treatment because synthetic NHS data is randomly generated variable-by-variable.  

Heat map with disease groupings on the right side and different treatments on the bottom
Figure 3: Heatmap displaying the relations between disease groupings (right side) and treatment (bottom) from the synthetic NHS data. The colour shadings represent the number of patients (e.g., the darker the shading, the higher the number). The similarity in shading within each diagnosis row shows that treatment and diagnosis were largely independent in this synthetic dataset.

What do researchers want from synthetic data? 

We developed an anonymous survey and asked 230 researchers experienced with EHR data, what would be important to them when considering using synthetic EHR data. Out of the 24 responding most were epidemiologists at fellow or professor level. Researchers were then invited to an online discussion group to expand on insights from the survey. Seven researchers attended.   

Most researchers had a more than 3 years of experience using EHRs both within and outside of cohort studies. Although few had much knowledge of synthetic EHR data, many had heard of synthetic EHR data and were interested in its application, particularly as a tool for training and learning about EHRs generally. 

The most important issues to researchers (Figure 4) were consistent patient details and having all the additional diagnosis & treatment codes rather than just the main ones: 

Horizontal bar chart showing different desirable quantities in synthetic EHR against the number of responses.
Figure 4: What researchers look for in synthetic EHRs

The most important utility for these researchers was to test/develop code and understand broad structure of the data, as shown below (Figure 5): 

Chart showing the priorities of researchers on a scale from first to last choice
Figure 5: Priorities of researchers when using synthetic data

This was reflected in their main concerns about maintaining the utility of the data in the synthetic version by producing high level of accuracy and attention to detail. 

During the discussion it was recognised that EHRs are “messy” and synthetic data should emulate this, providing an opportunity to prepare for real EHRs. 

Visual showing discussion points about emulate "messy" real data
Emulate “messy” real data discussion visual

Being able to prepare for the use of real EHRs was the main use case for synthetic data. No one suggested using the synthetic data as the analysis dataset in place of the real data.   

Visual showing factors to consider in relation to preparation for using real EHR data
Preparation for using real EHR data visual

It was suggested, in both survey responses and the discussion group, that any synthetic data should be bespoke to the requirements of each project. Further, it was observed that each research project only ever used a portion of the complete dataset, therefore synthetic data should be minimized also.  

“I think any synthetic data set based on any of the electronic health records should be stripped back to the key things that people use, because then the task of making it a synthetic copy [is] less.” (online participant) 

Summary

Following the survey and discussion with some researchers familiar with EHRs a few key points came through: 

  • Training – using synthetic data to understand how EHRs work, and to develop code. 
  • Fidelity is important – using synthetic data as way for researchers to experience using EHRs (e.g. the real data flaws, linkage errors, duplicates). 
  • Cost – the synthetic data set, and associated training, must be low cost and easily accessible.  

Next Steps

There is a demand for a synthetic data set with a higher level of fidelity than is currently available, and particularly there is a need for data which is much more consistent over time. 

The Co90s is well placed to respond to this demand, and will look to: 

Ask JGI Student Experience Profiles: Mike Nsubuga

Mike Nsubuga (Ask-JGI Data Science Support 2023-24) 

Embarking on a New Path 

Mike Nsubuga
Mike Nsubuga, first year PhD Student in Computational Biology and Bioinformatics

In the early days at Bristol, even before I began my PhD, I stumbled upon something extraordinary. AskJGI, a university initiative that provides data science support to researchers from all disciplines, caught my attention through a recruitment advert circulated by my PhD supervisor for support data scientists.

My journey started with hesitation. As a brand-new PhD student, who had just relocated to the UK, I questioned whether I was ready or suitable for such a role. Despite my reservations, my supervisor saw potential in me and encouraged me to seize this opportunity. Yielding to their encouragement, I applied, not fully realizing then how this decision would profoundly shape both my academic and professional paths. 

A World of Opportunities 

Joining AskJGI opened a door to a dynamic world brimming with ideas and innovations. My background in bioinformatics and computational biology meant that working on biomedical queries was particularly rewarding. These projects varied from analyzing protein expression data to studying infectious diseases, allowing me to use data science in meaningful ways. 

Among the initiatives I was involved in was developing models to predict protein production efficiency in cells from their genetic sequences. Our goal was clear yet impactful: to identify patterns in genetic sequences that indicate protein production efficiency. We employed advanced data analysis and machine learning techniques to achieve effective predictions. 

Additionally, I contributed to a project analyzing the severity of dengue infections by using statistical models to identify key biological markers. We pinpointed certain markers as critical for distinguishing between mild and severe cases of the infection. 

These projects showcased the transformative power of data science in understanding and potentially managing diseases, directly impacting public health strategies. 

Making Science Accessible: Community Engagement at City Hall

A highlight of my tenure with AskJGI was participating in Data Science Week at bustling Bristol City Hall. The event was not merely a showcase of data science but an opportunity to demystify complex concepts for the public. Engaging in lively discussions and simplifying intricate algorithms for curious visitors was incredibly fulfilling, especially seeing their excitement as they understood the concepts that are often discussed in our professional circles. 

Audience sitting in City Hall. Some audience members are raising there hand. There is a projector and a speaker at the front of the hall
AI and the Future of Society event as part of Bristol Data Week 2024

Fostering Connections and Gaining Insights 

AskJGI enhanced my technical skills and broadened my understanding of the academic landscape at the University of Bristol. The connections I forged were invaluable, sparking collaborations that would have been unthinkable in the more isolated environment of my earlier academic career. Reflecting on my transformative journey with AskJGI, I am convinced more than ever of the importance of interdisciplinary collaboration and the critical role of data science in tackling complex challenges. I encourage any researcher at the University of Bristol who is uncertain about their next step to explore what AskJGI has to offer. For PhD students looking to get involved, it represents not just a learning opportunity but a chance to make a significant societal impact. 

Unlocking big web archives: a tool to learn about new economic activities over space and time

JGI Seed Corn Funding Project Blog 2022/23: Emmanouil Tranos 

Where do websites go to die? Well, fortunately they don’t always die even if their owners stop caring about them. Their ‘immortality’ can be attributed to organisations known as web archives, whose mission is to preserve online content. There are quite a few web archives today with different characteristics – e.g. focusing on specific topics vs. archiving the whole web – but the Internet Archive is the oldest one. Even if you are not familiar with it directly, you might have come across the Wayback Machine, which is a graphical user interface to access webpages archived by the internet archive.  

Although it might be fun to check the aesthetics of a website from the internet’s early days – especially considering the current 1990s revival – one might question the utility of such archives. But some archived websites are more useful than others. Imagine accessing archived websites from businesses located in a specific neighborhood and analysing the textual descriptions of the services and products these firms offer as they appear on their websites. Imagine being able to geolocate these websites by using information available in the text. Image doing this over time. And, image doing this programmatically for a large array of websites. Well, our past research did that and, therefore, serves as a proof-of-concept for the utility of web archives in understanding the geography of economic activities. Our models were successful in utilising a well-curated by The British Library and the UK Web Archive data set to understand how a well-known tech cluster – that is Shoreditch in London – evolved over time. Importantly, we were able to do this at a much higher level of detail in terms of the descriptions of the types of economic activities than if we had used more traditional business data.  

The JGI project provided the opportunity to start looking forward. Our proof-of-concept research was useful in validating the value of such a research trajectory and revealing the evolving mechanisms of economic activities as we only focused on the 2000-2012 period. The next question is how to use this research framework in a current context.  

Before I explain the challenge in doing this, let me tell you about the value of being able to do this. Our current understanding of the typologies of economic activities is based on a system called Standard Industrial Classification (SIC) codes. Briefly, businesses need to choose the SIC code that describes best what they do. Useful as they may be, SIC codes have not been updated since 2007 and, therefore, cannot capture new and evolving economic activities. In addition, there is built-in ambiguity in SIC codes as quite a few of them are defined as “… not elsewhere classified” or “… other than …”. Having a flexible system that can easily provide granular and up-to-date classifications of economic activities within a city or a region can be very useful to a wide range of organisations including local authorities, chambers of commerce and sector-specific support organisations.  

The main challenge of building such a tool is data in terms of finding, accessing, filtering and modelling relevant data. Our JGI seedcorm project together with Rui Zhu and Giulia Occhini allowed us to pave the path for such a research project. Thanks to the Common Crawl, another web archive which offers all its crawled data openly every two months, we have all the data we need. The problem is that we have much more data than what we need as the Common Crawl crawls and scrapes the whole web providing a couple of hundred of terabyte of data every two months. And that is in compressed format! So, only accessing these data can be challenging set aside building a workflow which can do all the steps I mentioned above and – importantly – keep on doing these steps every few months once new data dumps become available.  

Although we are nowhere close to completing such a big project, the JGI seedcorn funding allowed us to test some of the code and the data infrastructure needed to complete such a task. We are now developing funding proposals for such a large research programme and although a risky endeavour, we are confident that we can find the needle in the haystack and build a dynamic system of typologies of economic activities at a level of detail higher than current official and traditional data offer, which is based on open data and reproducible workflows.  


Emmanouil Tranos 

Professor of  Quantitative Human Geography | Fellow at the Alan Turing Institute  

e.tranos@bristol.ac.uk | @EmmanouilTranos | etranos.info | LinkedIn 

Ask JGI Student Experience Profiles: Emma Hazelwood

Emma Hazelwood (Ask-JGI Data Science Support 2023-24) 

Emma Hazelwood
Emma Hazelwood, final year PhD Student in Population Health Sciences

I am a final year PhD student in Population Health Sciences. I found out about the opportunity to support the JGI’s data science helpdesk through a friend who had done this job previously. I thought it sounded like a great way to do something a bit different, especially on those days when you need a bit of a break from your PhD topic.

I’ve learnt so many new skills from working within the JGI. The team are very friendly, and everyone is learning from each other. It’s also been very beneficial for me to learn some new skills, for instance Python, when considering what I want to do after my PhD. I’ve been able to see how the statistical methods that I know from my biomedical background be used in completely different contexts, which has really changed the way I think about data. 

I’ve worked on a range of topics through JGI, which have all been as interesting as they have been different. I’ve helped people with coding issues, thought about new ways to visualise data, and discussed what statistical methods would be most suitable for answering research questions. In particular, I’ve loved getting involved with a project in the Latin American studies department, where I’ve been mapping key locations from conferences throughout the early 20th century onto satellite images, bringing to life the routes that the conference attendees would have taken. 

This has been a great opportunity working with a very welcoming team, and one I’d recommend to anyone considering it!