Mapping

How the mapping was done

This page is centered on the mapping process: the methodology, the code, the data, and all of the issues that arose during the project. As is common in digital humanities, this work fits into the lineage of “failure studies”; however, there are great bones to this project, and the core problem can be summarized simply as “unclean data.”

Map of Crusoe's island.
Map of Crusoe's island.

Methodology

The methodology for this project adapts the method described by Gabay and Vitali in their 2019 paper “A Theatre of Places: Mapping Seventeenth-Century French Theatre.” The first step was to build a small corpus. There are no fixed rules for corpus size, but Reppen notes that as long as a corpus captures “enough of the language for accurate representation” and can answer the research questions, it is sufficient (Reppen, 2022, p. 14). For this project, I use the full text of Robinson Crusoe together with four additional travel narratives. Taken together, these texts provide enough data to make a claims about whether there are geographical and travel similarities with Robinson Crusoe, while not attempting to represent all travel accounts ever. Following Gabay and Vitali’s emphasis on a clean corpus, I worked with plain-text editions from Project Gutenberg so that the automatic extraction and mapping pipeline would not be confused by markup or poor OCR.

I also implemented a spelling normalisation step using pyspellchecker, which works token by token. Because regularisation is so finicky with historical language data, I designed the function to be extremely conservative. any word that begins with a capital letter (including non-standard spellings of names and places) was forbidden to the code. Only lower-case forms are candidates for correction. As someone coming from Old English studies, I am very aware of the risks of over-aggressive lemmatisers and tokenisers, so preserving capitalisation was essential.

The core of the workflow is Named Entity Recognition (NER) and gazetteering. Using spaCy’s English model (en_core_web_sm), I run NER over the original text of each file and collect all entities with labels in {GPE, LOC, FAC} as candidate locations. For each candidate I stored the filename, the order it appeared, the character start and end finsih, and its latitude and longitude. All of this information was written in a single CSV file, which functions as a gazetteer, which is a fancy word for a geographic index.

The next step is geocoding. I use the Nominatim service (OpenStreetMap) with geopy to assign coordinates to each place string, with a small cache to avoid repeats Fictional places are handled separately: I maintain a list of names such as “Brobdingnag” and “Glubbdubdrib” and assign them fixed coordinates on a single “fictional island” so that they appear on the map but are visually distinct from real-world locations. This follows Gabay and Vitali’s strategy of hand-coding difficult or non-real places. There is a whole literature about handelling fictional places in mapping, but for simplicity, I decided to just stick with Gabby and Vitali's methodology.

From the gazetteer, I started playing around with movement. Using the order index in each text, I built a separate csv file that simply links one place to the next in reading order. I then used Folium to draw lines between those points, so each narrative turns into a kind of path across the map. At one point I tried to put all of these lines on at once to make a big web of connections. It did work, but the map quickly to noisy and I had to remove them (much sadness).

I did not want to use any map; I wanted to use a map that would be semi-time accurate and so I picked the 1790 Cassini world map. This was a great decision when considering the design of the project, but the computer did not agree at first. The original Cassini file needed to be converted and resized. Then the geographic boundaries had to be set by hand, and I spent a lot of time tweaking the image.

The Problem

The biggest problem was the data. After processing the text files and separating just the locations out I had 13673 locations to work with. This is not a lot of data for a corpus linguistics study but it is quite substantial for a mapping project and a digital humanities project that is attempting to intervene with a literary argument. As the maps stand, they are not authoritative enough that one can answer the research question that was posed. If the data was more organized and the user could be certain that the locations were accurate, then they could be used. But this is not the case.

It also came to my realization far too late in the game that the model would not just take out the locations that the characters visited. It would take all locations mentioned, even the ones that were never visited. This can explain many of the strange location dots on almost all of the map. As for the other strange locations, there could be a few factors contributing. The capitalization in 18th century literature could be one of them. On Project Gutenberg there less "random" capitals then in the more faithful editions, but there are still some. There also could be a name mentioned that happens to be the same as a town or a village. Ultimately. it is hard to trust the map without having a careful human reader verifying the NER's work.

The last and least devastating problem for this project is the access to clean txts of the primary sources. As some of the potential sources for Robinson Crusoe are semi-obscure and less well documented than others, it makes sense why finding a clean version is difficult. Nevertheless, we must endure.

Code

Please enjoy exploring this code. It is a copy of the original notebook, which was created in Google Colab with guidance. See the acknowledgement section for more information. You are welcome to download the code, but it will most likely not work straight out of the box, as it is meant to be run in phases.

Download full mapping script (.py)

# ============================================ # 0. Install required libraries # ============================================ !pip install pyspellchecker !pip install spacy !python -m spacy download en_core_web_sm !pip install folium # ============================================ # 1. Imports # ============================================ import os import re import csv import json import folium import spacy from spellchecker import SpellChecker from geopy.geocoders import Nominatim # ============================================ # 2. Config # ============================================ spacy_model_name = "en_core_web_sm" place_labels = {"GPE", "LOC", "FAC"} input_folder = "/content/workdir/input" output_folder = "/content/workdir/output" os.makedirs(output_folder, exist_ok=True) csv_path = os.path.join(output_folder, "locations.csv") # fictional island for Swift etc. FICTIONAL_ISLAND_LAT = 10.0 # change if you like FICTIONAL_ISLAND_LON = -140.0 fictional_places = { "Brobdingnag", "Glubbdubdrib", # add more here } # ============================================ # 3. Load input .txt files # ============================================ files_in = [f for f in os.listdir(input_folder) if f.lower().endswith(".txt")] if not files_in: print("No .txt files in", input_folder) else: print(f"Found {len(files_in)} file(s):", files_in) # ============================================ # 4. Spelling normalisation # (skip capital words, so Sonning stays Sonning) # ============================================ word_re = re.compile(r"\w+", flags=re.UNICODE) spell = SpellChecker(distance=1) def apply_case_pattern(src, tgt): if src.isupper(): return tgt.upper() if src[0].isupper() and src[1:].islower(): return tgt.capitalize() return tgt def normalise_spelling(text): def fix(match): w = match.group(0) # do not change capital words (names, etc.) if w[0].isupper(): return w lw = w.lower() # keep if known if lw in spell: return w corr = spell.correction(lw) if not corr: return w return apply_case_pattern(w, corr) return word_re.sub(fix, text) # ============================================ # 5. Load spaCy model # ============================================ print("\nLoading spaCy model:", spacy_model_name) nlp = spacy.load(spacy_model_name) # ============================================ # 6. Geocoder (geopy) # - timeout # - fictional names to island # - cache only good coords # ============================================ geolocator = Nominatim(user_agent="historical_ner_geocoder") geo_cache = {} # place_text -> (lat, lon) def geocode_place(name): """Return (lat, lon) for a place string, or (None, None).""" # cache first if name in geo_cache: return geo_cache[name] # fictional names first if name in fictional_places: coords = (FICTIONAL_ISLAND_LAT, FICTIONAL_ISLAND_LON) geo_cache[name] = coords return coords # real-world geocode try: location = geolocator.geocode(name, timeout=5) except Exception as e: print(" Geocoding error for", name, ":", e) return (None, None) if location is None: return (None, None) coords = (location.latitude, location.longitude) # cache only good coords geo_cache[name] = coords return coords # ============================================ # 7. Apply NER to each file (on ORIGINAL text) # then save normalised text # plus extra rule for capital words not PERSON # ============================================ all_rows = [] for name in files_in: in_path = os.path.join(input_folder, name) print(f"\nProcessing: {in_path}") # read ORIGINAL text with open(in_path, "r", encoding="utf-8") as f: orig_text = f.read() # ---- 7a. NER on original text ---- doc = nlp(orig_text) order_index = 1 # record place strings already added from NER place_strings = set() # NER places for ent in doc.ents: if ent.label_ in place_labels: place_text = ent.text lat, lon = geocode_place(place_text) all_rows.append({ "filename": name, "order_index": order_index, "place_text": place_text, "char_start": ent.start_char, # offsets in ORIGINAL text "char_end": ent.end_char, "latitude": lat, "longitude": lon, }) place_strings.add(place_text) order_index += 1 print(" Found", order_index - 1, "NER place entities in original text.") # ---- 7b. extra rule: capitalised PROPN not PERSON ---- # month names we do not want as places MONTH_NAMES = { "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" } # collect token index ranges for PERSON entities person_ranges = [] for ent in doc.ents: if ent.label_ == "PERSON": person_ranges.append(range(ent.start, ent.end)) def token_in_person(tok_i): return any(tok_i in r for r in person_ranges) # track extra capital candidates we already added extra_capitals = set() for token in doc: t = token.text if not t: continue # must start with capital if not t[0].isupper(): continue # simple: letters only if not t.isalpha(): continue # skip "I" if t == "I": continue # skip if in a PERSON span if token_in_person(token.i): continue # skip if already a NER place string if t in place_strings: continue # skip if already added by this rule if t in extra_capitals: continue # skip single-letter tokens if len(t) == 1: continue # must be a proper noun (cuts It, The, When, etc.) if token.pos_ != "PROPN": continue # skip months if t in MONTH_NAMES: continue # if we reach here, treat it as candidate place for manual work all_rows.append({ "filename": name, "order_index": order_index, "place_text": t, "char_start": token.idx, "char_end": token.idx + len(t), "latitude": None, "longitude": None, }) extra_capitals.add(t) order_index += 1 print(" Total entities (NER + capital PROPN rule):", order_index - 1) # ============================================ # 8. Save CSV (with coordinates) # ============================================ all_rows.sort(key=lambda r: (r["filename"], r["order_index"])) with open(csv_path, "w", encoding="utf-8", newline="") as f: writer = csv.DictWriter( f, fieldnames=[ "filename", "order_index", "place_text", "char_start", "char_end", "latitude", "longitude", ], ) writer.writeheader() for row in all_rows: writer.writerow(row) print("\nSaved CSV to:", csv_path) print("\n✅ All done.") # ============================================ # 9. Make map from CSV (unchanged) # ============================================ output_folder = "/content/workdir/output" csv_path = os.path.join(output_folder, "locations.csv") map_path = os.path.join(output_folder, "locations_map.html") points = [] with open(csv_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") place = row.get("place_text", "") fname = row.get("filename", "") order = row.get("order_index", "") # skip rows with missing coords if not lat or not lon or lat == "None" or lon == "None": continue try: lat = float(lat) lon = float(lon) except ValueError: continue points.append({ "lat": lat, "lon": lon, "place": place, "filename": fname, "order": order, }) print(f"Loaded {len(points)} point(s) with coordinates.") if not points: print("No valid coordinates in CSV. Stop.") else: mean_lat = sum(p["lat"] for p in points) / len(points) mean_lon = sum(p["lon"] for p in points) / len(points) m = folium.Map(location=[mean_lat, mean_lon], zoom_start=4) for p in points: popup_text = ( f"{p['place']}
" f"file: {p['filename']}
" f"order: {p['order']}" ) folium.Marker( location=[p["lat"], p["lon"]], popup=popup_text, ).add_to(m) m.save(map_path) print("Saved map to:", map_path) m # show in notebook # ============================================ # 9a. Build routes CSV: links between places # ============================================ locations_path = csv_path # same as before routes_path = os.path.join(output_folder, "routes.csv") # load locations with real coords loc_rows = [] with open(locations_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") # skip rows with missing coords if not lat or not lon or lat == "None" or lon == "None": continue try: row["latitude"] = float(lat) row["longitude"] = float(lon) row["order_index"] = int(row["order_index"]) except ValueError: continue loc_rows.append(row) # group by filename by_file = {} for row in loc_rows: fname = row["filename"] by_file.setdefault(fname, []).append(row) # build segments route_rows = [] for fname, rows in by_file.items(): # sort by order_index inside each file rows.sort(key=lambda r: r["order_index"]) # link each place to the next one for i in range(len(rows) - 1): a = rows[i] b = rows[i + 1] route_rows.append({ "filename": fname, "from_order": a["order_index"], "to_order": b["order_index"], "from_place": a["place_text"], "to_place": b["place_text"], "from_latitude": a["latitude"], "from_longitude": a["longitude"], "to_latitude": b["latitude"], "to_longitude": b["longitude"], }) print(f"Built {len(route_rows)} route segment(s).") # save routes CSV with open(routes_path, "w", encoding="utf-8", newline="") as f: writer = csv.DictWriter( f, fieldnames=[ "filename", "from_order", "to_order", "from_place", "to_place", "from_latitude", "from_longitude", "to_latitude", "to_longitude", ], ) writer.writeheader() for r in route_rows: writer.writerow(r) print("Saved routes CSV to:", routes_path) # ============================================ # 9b. Make map from CSV (markers + lines) # ============================================ output_folder = "/content/workdir/output" csv_path = os.path.join(output_folder, "locations.csv") map_path = os.path.join(output_folder, "locations_map.html") points = [] with open(csv_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") place = row.get("place_text", "") fname = row.get("filename", "") order = row.get("order_index", "") # skip rows with missing coords if not lat or not lon or lat == "None" or lon == "None": continue try: lat = float(lat) lon = float(lon) order_int = int(order) except ValueError: continue points.append({ "lat": lat, "lon": lon, "place": place, "filename": fname, "order": order_int, }) print(f"Loaded {len(points)} point(s) with coordinates.") if not points: print("No valid coordinates in CSV. Stop.") else: # center on mean mean_lat = sum(p["lat"] for p in points) / len(points) mean_lon = sum(p["lon"] for p in points) / len(points) m = folium.Map(location=[mean_lat, mean_lon], zoom_start=4) # markers for p in points: popup_text = ( f"{p['place']}
" f"file: {p['filename']}
" f"order: {p['order']}" ) folium.Marker( location=[p["lat"], p["lon"]], popup=popup_text, ).add_to(m) # lines: one path per file, in text order paths_by_file = {} for p in points: paths_by_file.setdefault(p["filename"], []).append(p) for fname, plist in paths_by_file.items(): plist.sort(key=lambda x: x["order"]) coords = [(p["lat"], p["lon"]) for p in plist] # draw a line if at least two points if len(coords) >= 2: folium.PolyLine( locations=coords, tooltip=fname, ).add_to(m) m.save(map_path) print("Saved map to:", map_path) m # show in notebook # ============================================ # 10. Make interactive map with per-text layers # ============================================ output_folder = "/content/workdir/output" csv_path = os.path.join(output_folder, "locations.csv") map_path = os.path.join(output_folder, "locations_map.html") points = [] with open(csv_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") place = row.get("place_text", "") fname = row.get("filename", "") order = row.get("order_index", "") # skip rows with missing coords if not lat or not lon or lat == "None" or lon == "None": continue try: lat = float(lat) lon = float(lon) order_int = int(order) except ValueError: continue points.append({ "lat": lat, "lon": lon, "place": place, "filename": fname, "order": order_int, }) print(f"Loaded {len(points)} point(s) with coordinates.") if not points: print("No valid coordinates in CSV. Stop.") else: # center on mean of all points mean_lat = sum(p["lat"] for p in points) / len(points) mean_lon = sum(p["lon"] for p in points) / len(points) m = folium.Map(location=[mean_lat, mean_lon], zoom_start=4) # group points by filename paths_by_file = {} for p in points: paths_by_file.setdefault(p["filename"], []).append(p) # one FeatureGroup per text for fname, plist in paths_by_file.items(): plist.sort(key=lambda x: x["order"]) fg = folium.FeatureGroup(name=fname) # markers for this text for p in plist: popup_text = ( f"{p['place']}
" f"file: {p['filename']}
" f"order: {p['order']}" ) folium.Marker( location=[p["lat"], p["lon"]], popup=popup_text, ).add_to(fg) # path line for this text if len(plist) >= 2: coords = [(p["lat"], p["lon"]) for p in plist] folium.PolyLine( locations=coords, tooltip=fname, ).add_to(fg) fg.add_to(m) # add layer control so you can switch routes on/off folium.LayerControl().add_to(m) m.save(map_path) print("Saved map to:", map_path) m # show in notebook import os import rasterio from rasterio.plot import reshape_as_image from PIL import Image tif_path = "/content/workdir/input/CassiniWorldGlobe1790.tif" png_path = "/content/workdir/output/cassini_world.png" os.makedirs(os.path.dirname(png_path), exist_ok=True) with rasterio.open(tif_path) as src: data = src.read() # (bands, rows, cols) img = reshape_as_image(data) # (rows, cols, bands) left, bottom, right, top = src.bounds Image.fromarray(img).save(png_path) # bounds for Leaflet / Folium cassini_bounds = [[bottom, left], [top, right]] cassini_center = [(bottom + top) / 2, (left + right) / 2] import rasterio from rasterio.plot import reshape_as_image from PIL import Image import os tif_path = "/content/workdir/input/CassiniWorldGlobe1790.tif" # <-- put your real path here png_path = "/content/workdir/output/cassini_world.png" os.makedirs(os.path.dirname(png_path), exist_ok=True) with rasterio.open(tif_path) as src: data = src.read() # (bands, rows, cols) img = reshape_as_image(data) # (rows, cols, bands) left, bottom, right, top = src.bounds Image.fromarray(img).save(png_path) cassini_bounds = [[bottom, left], [top, right]] cassini_center = [(bottom + top) / 2, (left + right) / 2] print("PNG saved:", png_path) print("Bounds:", cassini_bounds) print("Center:", cassini_center) m = folium.Map(location=cassini_center, zoom_start=2, tiles=None) folium.raster_layers.ImageOverlay( name="Cassini 1790", image=png_path, bounds=cassini_bounds, opacity=1, interactive=False, ).add_to(m) # ============================================ # 10. Make map with per-text layers + Cassini base # ============================================ output_folder = "/content/workdir/output" csv_path = os.path.join(output_folder, "locations.csv") map_path = os.path.join(output_folder, "locations_map.html") points = [] with open(csv_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") place = row.get("place_text", "") fname = row.get("filename", "") order = row.get("order_index", "") if not lat or not lon or lat == "None" or lon == "None": continue try: lat = float(lat) lon = float(lon) order_int = int(order) except ValueError: continue points.append({ "lat": lat, "lon": lon, "place": place, "filename": fname, "order": order_int, }) print(f"Loaded {len(points)} point(s) with coordinates.") if not points: print("No valid coordinates in CSV. Stop.") else: # Use the center from the TIFF you loaded earlier m = folium.Map(location=cassini_center, zoom_start=2, tiles=None) # Cassini PNG layer folium.raster_layers.ImageOverlay( name="Cassini 1790", image=png_path, bounds=cassini_bounds, opacity=1, interactive=False, ).add_to(m) # optional modern layer folium.TileLayer( tiles="OpenStreetMap", name="Modern", overlay=True, control=True, show=False, ).add_to(m) paths_by_file = {} for p in points: paths_by_file.setdefault(p["filename"], []).append(p) for fname, plist in paths_by_file.items(): plist.sort(key=lambda x: x["order"]) fg = folium.FeatureGroup(name=fname) for p in plist: popup_text = ( f"{p['place']}
" f"file: {p['filename']}
" f"order: {p['order']}" ) folium.Marker( location=[p["lat"], p["lon"]], popup=popup_text, ).add_to(fg) if len(plist) >= 2: coords = [(p["lat"], p["lon"]) for p in plist] folium.PolyLine( locations=coords, tooltip=fname, ).add_to(fg) fg.add_to(m) folium.LayerControl().add_to(m) m.save(map_path) print("Saved map to:", map_path) # ============================================ # 10. Make map with per-text layers + Cassini base (distributions map) # ============================================ output_folder = "/content/workdir/output" csv_path = os.path.join(output_folder, "locations.csv") map_path = os.path.join(output_folder, "distributions_map.html") points = [] with open(csv_path, "r", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: lat = row.get("latitude") lon = row.get("longitude") place = row.get("place_text", "") fname = row.get("filename", "") order = row.get("order_index", "") if not lat or not lon or lat == "None" or lon == "None": continue try: lat = float(lat) lon = float(lon) order_int = int(order) except ValueError: continue points.append({ "lat": lat, "lon": lon, "place": place, "filename": fname, "order": order_int, }) print(f"Loaded {len(points)} point(s) with coordinates.") if not points: print("No valid coordinates in CSV. Stop.") else: # use Cassini globe center from the TIFF step m = folium.Map( location=cassini_center, zoom_start=2, tiles=None, crs="EPSG4326", # lat/lon grid, matches the globe ) # Cassini PNG layer folium.raster_layers.ImageOverlay( name="Cassini 1790", image=png_path, bounds=cassini_bounds, opacity=1, interactive=False, ).add_to(m) # colour list for different texts colour_list = [ "red", "blue", "green", "purple", "orange", "darkred", "lightred", "beige", "darkblue", "darkgreen", "cadetblue", "darkpurple", "white", "pink", "lightblue", "lightgreen", "gray", "black", "lightgray", ] colour_by_file = {} colour_index = 0 paths_by_file = {} for p in points: paths_by_file.setdefault(p["filename"], []).append(p) for fname, plist in paths_by_file.items(): plist.sort(key=lambda x: x["order"]) # assign a stable colour for this text if fname not in colour_by_file: colour_by_file[fname] = colour_list[colour_index % len(colour_list)] colour_index += 1 colour = colour_by_file[fname] fg = folium.FeatureGroup(name=fname) # only points, no lines for p in plist: popup_text = ( f"{p['place']}
" f"file: {p['filename']}
" f"order: {p['order']}" ) folium.CircleMarker( location=[p["lat"], p["lon"]], radius=4, popup=popup_text, color=colour, fill=True, fill_color=colour, fill_opacity=0.9, ).add_to(fg) fg.add_to(m) folium.LayerControl().add_to(m) m.save(map_path) print("Saved map to:", map_path)