A Long Way Home – Lion

Recently I finished reading Saroo Brierley’s book, A Long Way Home and titled Lion in the feature film. In it he recounts his extraordinary life story of getting separated from his brother at a train station whose name began with ‘B’ a few hours from home in rural India in 1987. He was five years old and his brother Guddu had taken him the few hours from home to make some extra money sweeping out trains. Guddu leaves him on a bench in the train station to get some sleep while he goes to work. The young Saroo wakes a little later but can find no trace of Guddu, in his panic he hops on a train looking for Guddu, the doors shut and he (over what he believes to be 12-15 hours) ends up in Kolkata (then Calcutta). Here, against all odds he survives on the streets and eventually ends up being adopted by (from the sounds of it) a wonderful couple in Tasmania. In his 20s he begins trying to find his mother and siblings in India by using Google Earth to pinpoint both the town he boarded the train and in turn his hometown.

Howrah Junction Station, Kolkata

Howrah Junction Station, Kolkata

His task was an unenviable one, India is the second most populous country in the world with 1,367,139,484 people (17.5% of the world’s population, Wikipedia). Below is a population density map of India to give some context as to how the population is distributed. These figures are from present day, the population of India in 1987 when Saroo became lost was 819,800,000.

Population Density of India

Population Density of India

The whole story piqued my interest and I wondered if we were to try the search today using the benefit of open data and open-source software could we make it more efficient. I would like to heavily caveat what is to follow by stating that I am using software and data that would not have been as extensive, complete or fully-featured when Saroo began his search in 2011.

I kept very careful notes whilst I was reading the book and below is an exhaustive list of the criteria that he would follow from his recollections from his five year old self. The criteria is both from his hometown and the town that began with ‘B’ where he got on the train that eventually took him to Kolkata.

Initial Search Criteria

Initial Search Criteria

From the initial criteria above it became clear that a number of the criteria would not be usable in replicating the search; such as that it wasn’t in the colder north of India (too subjective) or that they lived side by side with Muslims (a common occurrence in India). Below is a refined list of criteria that I am going to use in order to try and replicate the search.

Usable Criteria for SearchUsable Criteria

Usable Criteria for Search

Saroo’s methodology started with tracing his steps backwards from Kolkata. He knew he got on the train at a station that sounded like ‘Berampur‘ and he thought he was on the train for approximately 12-15 hours. Based on this, he consulted Indian friends of his at college about how to start searching. One friend in particular,  Amreen whose father worked for the Indian Railway in New Delhi proved helpful. Her father made an educated guess that trains in India in the 1980s travelled at between 70kph-80kph. Based on this Saroo calculated that he would have travelled 1,000km in that time. He started searching methodically outwards from Kolkata to try and find the station that began with ‘B’ and hopefully, then, his hometown that was about an hour away from this.

If I was to try and recreate the search for Saroo’s hometown, I would need access to as much free geographical data as possible so I turned to OpenStreetMap and specifically the downloads available from Geofabrik. I downloaded the Protocolbuffer Binary Format (PBF) file of the entire of India. The first items I was interested in were the railway lines and railway stations of India. QGIS can load the PBF files natively but the entire of India is a bit of a stretch for it regardless of computing power available.

BASH Osmium Commands

BASH Osmium Commands

I used the Osmium tool to extract every railway station and line in India and then GDAL to convert them to the geopackage format. The first analysis I undertook was to follow Saroo’s and ascertain how many railway stations are within 1,000km of Howrah railway station. To give some context, OSM has listed 7,979 railway stations and 108,000km of railway line (the Wikipedia page lists 68,155 km (the discrepancy may be accounted for by in the inclusion of every siding and historic railway line etc. in the OSM data). The below map shows every railway station and line from OSM.

Railway Stations and Railway Lines

Railway Stations and Railway Lines

To give an idea of Saroo’s methodology of drawing a 1,000km buffer of Kolkata and working outwards that left him with 2,905 to search. Below is a map of all of these.

Every Railway Station within 1,000km of Kolkata

Every Railway Station within 1,000km of Kolkata

I decided not to use Saroo’s methodology of using a buffer distance from Kolkata. I had the entire OSM database for India at my disposal so I decided that the first and easiest step to undertake was to find all the railway stations that began with ‘B’ and contained ‘p’, ‘u’ and ‘r’.  I used QGIS’s inbuilt python functionality, PyQGIS. I wrote a small script that would use the regular expression module to find stations that matched the above criteria.

import time, re

start = time.time()
# Start Message
print("Program is Starting...")

layer = iface.activeLayer()
prov = layer.dataProvider()

if layer.dataProvider().fieldNameIndex("Relevant_Rail_Stations") == -1:
    layer.dataProvider().addAttributes([QgsField("Criteria_Test", QVariant.String)])
    layer.updateFields()
    
pattern = '^b.*.[p].*[u].*[r]*$'
   
# starting layer editing
layer.startEditing()

features = layer.getFeatures()

for feat in features:
    Regex_Stations_Search = feat['name']
    Regex_String = re.compile(pattern, re.IGNORECASE)
    Regex_Match = Regex_String.search(str(Regex_Stations_Search))
    if Regex_Match:
        layer.changeAttributeValue(feat.id(), 11, "Meets_Criteria")
    else:
        layer.changeAttributeValue(feat.id(), 11, "No_Match")
        
layer.commitChanges()
iface.vectorLayerTools().stopEditing(layer)

end = time.time()

print("Finished running - the program took " + str((round((end - start), 2))) + " seconds")

The above script took 15 seconds to run and it narrowed down the search field from 7,779 to 91 as shown below. I don’t think it is too much of a stretch to take the various ways that he thought it might have been spelled (as below) and to then take the common letters from those spelling and narrow down the search that way.

  1. Burampour
  2. Birampur
  3. Berampur
  4. Bramapour
  5. Berampur

Before and After 'B' Stations

Now that we’ve narrowed down the number of possible ‘B’-towns to 91, the next step is to try and use the hometown criteria to find the correct location. We know that Saroo’s hometown had a water tower, river, bridge, dam and a fountain in the park near the station. He believed the town’s name began with ‘G’ and sounded liked ‘Ginestlay’. I could proceed with the other criteria for the ‘B’-town however I felt that with a dam, bridge, river, fountain and water-tower there were enough unique entities that I could try at this stage to find the hometown without dedicating any further resources or time to the ‘B’-town.

The assumption that Saroo was working on from his friend Amreen’s father was that trains travelled between 70kph-80kph in the mid-80s. As processing power wouldn’t be an issue for what I was trying to do I decided to use the upper limit of 80kph for the speed of the trains. As the PBF file for the entire of India wouldn’t load correctly in QGIS I extracted all of the bridges, dams, rivers and water towers. Below are maps showing the distribution of each from OSM.

Rivers of India

Rivers of India

Dams of India

Dams of India

Bridges of India

Bridges of India

Water Towers of India

Water Towers of India

I omitted pedestrian over and underpasses at this stage as they were too prevalent to help narrow down the location. I now needed to buffer each of the possible ‘B’-towns by the distance the train would have travelled. Saroo knew that night that Guddu and he travelled for about an hour. I added a small bit of ‘fat’ to my buffer so I used 100km as the buffer distance. I then dissolved the buffers together and clipped them to the coastline. Saroo thought that he been on the train for 12-15 hours. I was thinking that it would be useful to do a buffer from Kolkata of perhaps 6 hours and omit everything inside this buffer as he knew he travelled a farther distance.

A quick GIF of this process is shown below.

GIF of Process

GIF of Process

Below is the final search area with the every dam, river, bridge, water-tower and train station within 100km of the ‘B’-town stations.

Final Search Area

Final Search Area

For the final analysis I wanted to find only the stations that had a water-tower, dam, river and bridge within 2km of the station. I picked 2km because I think that would be a reasonable distance for these features to be from a town-centre for 5 year old Saroo to walk to regularly enough to remember them distinctly. We didn’t want to include the ‘B’-town stations themselves so I simply clipped out the data for a distance of 10km from each ‘B’-town as Saroo said that Guddu and him were on the train for about an hour so 10km seems like a safe distance to clip.

Example of Final Search Area

Example of Final Search Area

 

2km Buffers of Each Station

2km Buffers of Each Station

The next step was to find each station that had our relevant criteria within the buffer area. For this I turned to PostGIS (I tried running the query the SQL query in QGIS but it crashed every time). I decided at the first iteration to only use—bridges, water towers and dams as my logic was that these would provide more conclusive than pedestrian over and underpasses and rivers which may have proved too common. Plus, the likelihood of a bridge being for a river was much greater than it being for a ravine, gorge etc. There were 3,223 stations within the 100km buffer of each ‘B’-town station as shown below.

Stations within Buffer Search Area

Stations within Buffer Search Area

The SQL query that I ran is below, this ran in about a second and returned each station that had a: bridge, water tower or dam within 2km. It narrowed down the number of possible stations from 3,223 to 24 as shown below.

CREATE VIEW Relevant_Stations 
AS 
SELECT stations_buffered.geom,
     , stations_buffered.name
     , stations_buffered.fid 
  FROM stations_buffered
INNER
  JOIN merged_bridge_river_dam 
    ON ST_Intersects(stations_buffered.geom, merged_bridge_river_dam.geom) 
   AND merged_bridge_river_dam.Feature_Ty IN ('Bridge','DAM','W_Tower')
GROUP
    BY stations_buffered.geom,
     , stations_buffered.name
     , stations_buffered.fid  
HAVING COUNT(DISTINCT merged_bridge_river_dam.Feature_Ty) = 3
24 Stations

The 24 Stations that have a ‘bridge’, ‘water tower’ and ‘dam’ within 2km.

An example of one of the 24 stations that match the criteria is shown below.

Example of Station Matching Criteria

Example of Station Matching Criteria

Saroo thought that the place he was from sounded like ‘Ginestlay’ so the next step was to find all the OSM tags for ‘place’ that started with ‘G’. I used the below Osmium command to extract all of these place names:

osmium tags-filter india-latest.osm.pbf place=* -o Placenames.osm.pbf

This extracted 193,057 place names for the entire of India. To ensure that I didn’t miss anything I converted the polygons and lines to centroids and merged everything together (this took about 2 minutes in QGIS). I then filtered this by every place that began with ‘G’ and used the ‘Select by Location’ tool to narrow down the list from 24 to 12 stations. I have included images of these 12 stations below. Some of these obviously were not where Saroo was from as they formed parts of large cities (such as Lucknow) but I’ve left them in for the sake of showing the full process.

Candidate Area 1

Candidate Area 1

Candidate Area 2

Candidate Area 2

Candidate Area 3

Candidate Area 3

Candidate Area 4

Candidate Area 4

Candidate Area 5

Candidate Area 5

Candidate Area 6

Candidate Area 6

Candidate Area 7

Candidate Area 7

Candidate Area 8

Candidate Area 8

Candidate Area 9

Candidate Area 9

Candidate Area 10

Candidate Area 10

Candidate Area 11

Candidate Area 11

Candidate Area 12

Candidate Area 12

Conclusion

The station that we were looking for, ‘Khandwa Junction’ with the neighbourhood of ‘Ganesh Talai’ is candidate area 3 above. There are a few observations that I’d like to make regarding the above work. Firstly, I tried to stay as faithful to the criteria that Saroo had to work with. Obviously, I had read the book and I knew the answer. I hope that the above doesn’t feel reverse-engineered. I can honestly say that I didn’t look at any of the OSM data for his hometown prior to starting this post.

I’m conscious of the fact that there is a high likelihood Saroo’s own search along with the media coverage of the movie are the reason the OSM data for Khandwa Junction exists at all. However I will state that all of the above steps are very easily customisable and the criteria easy to change. The parameters could be easily change to test another theory—such as using only a ‘B’ and ‘R’ for the ‘B’-town station perhaps. If we didn’t find the answer we were looking for on the first pass we could have used a DEM to find where train lines crossed a gorge outside of a ‘B’-town station. Or we could have used Python to find all the horseshoe shaped roads outside ‘B’-town stations. Finally, we could have used the data for pedestrian overpasses and underpasses and reran the above.

What I can say with certainly is that with the OSM data available today it was very straightforward to reduce the number of stations to be searched from 2,905 (those within 1,000km of Kolkata) to 12. This would then involve only searching 0.41% of the stations that Saroo originally had to sift through. Google Earth was the best tool available to Saroo at the time but there’s no universe in which it wasn’t a brute-force, sledgehammer to crack a nut tool. I hope with the above work I have shown that it is easy to greatly reduce the number of stations to search through using some decent logic and the power of open-source GIS software and data. I also hope that if anyone else is in a bind similar to Saroo’s that the above might in some way help to demonstrate how to search for the right answer and make it home.

Population Below the Line

We all know that the majority of Australia’s population lives in the eastern states. There’s a question I’ve been thinking about with a while — if you drew a line from just north of Brisbane to just west of Adelaide, what percentage of Australia’s population (including islands and Tasmania) live below that line? Well, I had a little time on this rainy Sunday so I decided to find out. I wrangled some data out of the ABS’ Table Builder and joined it with the geopackage of the SA2 geography (I chose this as I felt it provided the right spatial granularity without being too fine) and voilà, the answer to the question very few people asked 82.3%.

Population Below the Line

Population Below the Line

 

Unpopulated Areas

I was hiking at the weekend and it got me thinking about the unpopulated areas of Ireland. I’ve seen maps made for the 2011 census showing the square kilometres that have no usual resident population but I hadn’t seen one for the 2016 census so I put together the below. I purposefully omitted Northern Ireland because the data is nine years old. If anybody would like the replicate the below just leave a comment and I can do a YouTube tutorial or post on here on how I put it together.

 

The Unpopulated Areas of Ireland

The Unpopulated Areas of Ireland

Airbnbs in Ireland

I was reading this Guardian article the other day where they produced maps showing the number of Airbnb listings per 100 dwellings. I thought it was really interesting and I hadn’t seen Airbnb data mapped like that before. I had a few hours to spare yesterday so I set about replicating their method for Ireland. I used the 2016 census electoral divisions (to get the household numbers) and data for Ireland from Inside Airbnb.  I think at best this data is questionable because from the reading I’ve undertaken it seems to still list properties that were briefly on Airbnb a number of years ago and have long since been removed however this is the only data available so I went with it.

Below is the map, it was made with a combination of Bash, GDAL, QGIS, LibreOffice Calc and Illustrator.

Airbnbs per 100 Dwellings in Ireland

Airbnbs per 100 Dwellings in Ireland

Irish Census 2016 & Privacy

I’ve been looking at the 2016 census results with the last few years and there is a great deal of suppression of values for relevant Small Areas. The CSO suppress results or aggregate them depending on the number of people living in a Small Area. If the population is too small and could lead to individuals being identified, the data is suppressed. They are legally required to undertake this exercise under s33 of the Statistics Act, 1993.

I’ve been looking at a selection of variables and after reading this piece on the traveller accommodation crisis by RTÉ I decided to map the percentage travellers per Small Area. I have all this data in a PostGIS database but I’ll quickly run through how to do it without having to use PostGIS. I downloaded the Small Areas shapefile (generalised to 50m) and the CSV of all of the Small Area values from the CSO here. Instead of having to use a spreadsheet or QGIS to manually delete the 802 fields I didn’t need I used the pandas library, the python code below that took 0.3 seconds to run. It opens the relevant CSV and only selects the columns that I need and then strips the first 7 characters from the ‘GEOGID’ string as these are not needed for the join I’ll do in QGIS later.

import pandas as pd, time

start = time.time()
df = pd.read_csv('SAPS2016_SA2017.csv', usecols=['GUID', 'GEOGID', 'GEOGDESC', 'T1_1AGETT','T2_2WIT'])

df.GEOGID.apply(str)

df['GEOGID'] = df['GEOGID'].str[7:]

df.to_csv('SAPS2016_SA2017_New_GEOGID.csv')
end = time.time()
print(end - start)

I then opened the shapefile in QGIS, imported the CSV and joined them. This was  subsequently exported to a GeoPackage and I used GDAL’s ogr2ogr library to convert it to a GeoJSON in order to upload it to Carto.

ogr2ogr command to convert to GeoJSON
ogr2ogr command to convert to GeoJSON

Below is the resultant map with some formatting of headings undertaken to make it more legible. You can make it full screen using the button on the left. What struck me about this was how with a small amount of work it was very easy to visualise accurately the resident locations of one of the most vulnerable groups of society. Obviously this information is useful to local governments, state government agencies, NGOs and so forth but I question whether this data should been available to the general public regardless of it being aggregated to the Small Area geography.

 

Ireland’s Social Housing

Housing and all its intricacies have come to dominate the media discourse at home over the last few years. We’ve truly come out the other side of the recession and now the conversation is around the shortage of housing and where that has lead us. I’ve been thinking about this recently and in particular social housing. I think that most people assume that we built the majority of our social housing in the 1950s and 1960s. Collectively I think we assume we know when social housing was built but not where. This is where the census data can come in. Part 2 of question H3 in census 2016 asks ‘If renting, who is your landlord?’.

I have all of the census 2016 returns for each geographical unit in a PostGIS database so it was a simple exercise to add the households that rent from a local authority or voluntary/co-operative housing body and divide by the total number of houses. One inherent weakness to this method is that it doesn’t capture the social housing tenants that rent from a private landlord.

Ireland Census 2016-Question H3

I added a new column in PostGIS for the percentage social housing and I then symbolised this in QGIS. I used QGIS’s powerful Atlas generation tool. You’ll have to excuse the basemap, I’m aware that it’s a bit difficult to discern but in the interest of producing this entirely with opensource software I used OpenStreetMap as the basemap.

The next step will be to take the top-ten counties and use the Global Human Settlement Layer as a base to give an approximate indication of what epoch they were built in.

Australia-Durack Electoral Division

Like a lot of people, I spent a great deal of time following the 2019 federal election results. I was (and still am) very impressed with the Australian Electoral Commission’s Tally Room where results are easily available and downloadable. It was while I was browsing their site that I came across the Western Australian federal seat of Durack, what piqued my interest is that the stated area is 1,629,858km², I looked at its wiki page which states that it’s the largest electoral division in the world that practices compulsory voting. The Guardian have a good article about it which contains a graph that compares it in size to different countries in the world.

I decided to spend some of my weekend making a map of it, I downloaded the dataset from the the Australian Electoral Commission’s website and the country admin data and hillshade from the brilliant Natural Earth. Below is the result, free free to use as you’d like.

Durack Electoral Division, WA-Largest Electoral Division in the world that practices compulsory voting.
Durack Electoral Division, WA-Largest Electoral Division in the world that practices compulsory voting.

Perth, Australia

Australia passed the 25 million people mark shortly after 11pm on the 7th of August 2018. This got me thinking, what would a map of Perth look like showing each nationality? Over 28% of Australians was born abroad, what would this translate to in Perth terms?

I took a quick look online to see if anything already existed, the only thing I could find is the below from Perth’s Wikipedia page

One Dot per 100 persons, Perth, Wikipedia

It’s from 2008 and although a gallant effort, there are a few major problems, most notably the lack of a legend. So I decided to see if I could make something, if not better, than as good as the above.

My first job was to source the data, I knew from previously working with ABS data that their pre-built geopackages or datapacks wouldn’t contain the data I needed (question 12 from census ’16) but the geopackages were useful to download the geometry that I needed.

Question 12

I needed to use the Tablebuilder in order to collate the data that I needed for the geometry that I was going to use. This was the main learning area for me, I didn’t know enough about which unit of statisitical geography I wanted to use for this exercise. Luckily, the ABS  have a website where you can compare and contract each unit.

The ABS already had the hard working done in that one of their staticial units is ‘Greater Perth‘, I used this as my boundary and then chose the SA2 as the statistical unit. I went back to Tablebuilder and tried in vain to make sense of it; I found it very cumbersome and non-intuititve to use at the start and their introductory videos weren’t of any help. Fortuntately,  I found an amazing video on YouTube that explained Tablebuilder in great detail and once I’d watched that everything made sense, and I’m a Tablebuilder convert now!

I then used Tablebuilder to build the exact statistics that I needed (Country of Birth by SA2). I saved the table in Tablebuilder and downloaded it as a CSV file. In QGIS I then joined this with the SA2 geopackage file for WA and clipped it using the Greater Perth boundary that I had also downloaded. I then exported this layer as a new geopackage. I had previously found the top 8 nationalities by country of birth (using Tablebuilder) and then created new fields for each one where each number represented 200 persons born in that country. I then used the Random Points Inside Polygons tool to create random points for each nationality.

Generate Ramdom Points Inside Polygons using QGIS

I then used Adobe Color [sic] to pick a decent colour scheme for the various dots. I used Quick OSM in QGIS to download a layer with the towns in Greater Perth to be used for reference, this took about 10 seconds to do, Quick OSM is really useful.

Quick OSM in QGIS

Lastly, I used Google Fonts to download some nice fonts. I also used some styling effects in QGIS before I exported everything to Inkscape in order to add the text. Below is the finished product, the biggest flaw in what I have done is that there are overlapping points but I still think it gives a good overall understading of where people of different nationalities live in Greater Perth.