Background & Summary

With the advent of extensive on-line databases1,2,3 of curated historical information about productivity, population, trade, warfare, technology, etc., the scientific understanding of world-wide historical dynamics of states has made significant progress. For example, recent work has identified important causal regularities in the rise, spread, and fall of complex societies utilizing spatially-explicit models and exploring empirical evidence that incorporated geospatial and temporal information4,5,6,7. More traditional historical investigations have also engaged geospatial information, often concentrating on highly localized maps of particular cultures, or regional maps to highlight the spread of languages8 or growth in inter-regional exchange9,10. Similar efforts in economic history have utilized geo-spatial boundaries to explore the historical rise and spread of critical productive technologies and institutional packages argued as fundamental in the development of economic growth11,12. In lieu of good historical boundary data, however, these later efforts often use contemporary geo-spatial borders, which can mask important political developments in the past and can lead to measurement errors.

As these examples suggest, a wide range of studies across numerous disciplines depend upon and would benefit from a more comprehensive digital encoding of world-wide historical political geographies in time. Facilitating comparative analysis across social, spatial, and temporal bounds, including developing correspondences between different disparate geographical datasets, requires that the underlying geo-referenced data be represented as points, lines, and polygons in industry-standard digital encodings, such as shape or GeoJSON files.

Various attempts to construct such digital datasets have focused on a particular period or region of the world, digitizing maps at differing resolutions and sampling intervals13,14,15,16. However, these datasets are not comprehensive, even taken together17. Other efforts have sought to compile point-based data, which tend to be somewhat broader in scope but represent only one particular type of spatial information18,19,20,21. Yet other efforts22 catalog geo-referenced images of historical maps for visual inspection and summary but that limit automated computation of, for example, relative changes in political areas.

Several efforts, such as GeaCron23 or Running Reality24, have created more comprehensive world-wide digital representations of political entities over time. However, at present, their underlying data are not easily available to scholars. As of this writing they either require a license23,25,26, or they are encoded in a proprietary database scheme with limited options for exporting the data in other formats more amendable to computational analysis19. Further, to our knowledge, the references used in their construction are not available.

Here we describe Cliopatria, a comprehensive open-source geospatial digital dataset of worldwide polities — namely political units independent of higher authority which can range from city-states to empires, centralized or not1 — from 3400BCE to 2024CE. We describe the construction and contents of the initial version of the database, describe its validation, discuss how it compares to other digital databases, and note limitations and important considerations in its use. Subsequent versions of Cliopatria, which address these limitations (and any inaccuracies), will follow the established review procedures of the Seshat: Global Historical Databank project1.

Methods

We initially created Cliopatria from a set of composite digital illustrations (map images) originally developed by one of us (AT) in 201427. An extensive record was maintained of the documents used in the image set’s construction; these references, organized by modern state region, are listed in Table 1. The final image set consisted of 508 individual images, each associated with a specific year. An example map image, with its associated legend, is shown in Fig. 1. The complete map image set is available as part of the Cliopatria repository28.

Table 1 References used in the construction of the image set for Cliopatria.
Fig. 1
Fig. 1
Full size image

Example image for 1727CE. A partial legend of polities and their associated color is visible in the upper left corner. Modern interior land boundaries are shown in blue. Colored historical political regions are not aligned with those boundaries.

To create these images the political boundaries for a given year found in the original source maps, typically in bound volumes, were redrawn by hand, as accurately as possible, onto a common, digital base map used by all images. Beginning with the Sumerian city states in 3400BCE, subsequent images were copied from the immediately preceding image and modified by hand to reflect documented incremental changes, additions, and deletions of polities in different geographical regions during a subsequent year as the literature suggested. Although various general world atlases29,30,31,32 suggested where political change occurred, more specialized regional sources (cited in Table 1) were consulted to confirm or resolve the detailed changes and to identify plausible and mutually consistent borders of abutting polities. Polities were included in an image when one or more written sources attested to its existence and provided an indication of its location and extent in particular years. As a consequence certain potential pre-historical polities (e.g., the ‘Xia Dynasty’ prior to 1600BCE) are not included. With rare exceptions (e.g., the Vatican, Singapore, various island states) polities occupy at least 5000 km2 and have a duration of at least 50 years.

The image set began in 3400BCE and ended in 2014CE but we extended the dataset to 2024CE. The images depict the intervening years irregularly depending on the information in the original sources and the number of events and major border changes that occurred in the year. Figure 2 shows the time difference (in years) between each image. Initially the images change information every few hundred years but the pace of change accelerates, sometimes changing on a yearly basis. Figure 2 thus provides a qualitative picture of periods of relatively stability of political boundaries compared with those with more frequent changes.

Fig. 2
Fig. 2
Full size image

Original image sampling intervals. The time difference, in years, between images. Century moving average depicted in red. Note log scale.

Most boundaries not associated with explicit treaties (e.g., the Peace of Westphalia in 1648) are necessarily approximate and subject to differing interpretations, even between text sources and digital repositories. In sparsely populated areas (e.g., nomadic confederations), boundaries were drawn conservatively in an attempt to reflect actual settlement patterns outlined in the sources. Further, political boundaries can change within a polity’s lifetime, typically as the result of documented occupation or treaties, and these are reflected in changes in images in the appropriate year (e.g., the widely-attested expansion of the Roman Republic under Julius Caesar into Hispania and Transalpine Gaul circa 50BCE is documented in images for this and preceding years29). The dataset does not currently encode possible border uncertainty and territorial disputes; this is discussed below.

To create the initial GIS dataset from these images, we developed Python code that converted the hand-colored regions on the images into polygons associated with the names in the accompanying legend. Then, with the generous assistance of researchers at the Seshat Databank project1, we extensively reviewed and hand-edited both the names and polygons and their associations to other datasets, notably Seshat, to form the Cliopatria database.

The original images have several unique advantages permitting the automated conversion to labelled polygons. First, each image uses an identical background image of the world. Land is marked in a grey; ocean and lakes in a light blue. The background map indicates coastal boundaries in black and modern internal land borders in blue and some currently disputed borders in red. The separately colored areas of historical polities, however, are not aligned with these modern interior boundaries. The map uses a (somewhat distorted but corrected) spherical Robinson projection (ESRI:53030), which permitted recovery of the approximate latitude and longitude of each image pixel. The dimensions of the image (2400 × 4800 pixels) provide a resolution of approximately 40 km2 at the equator.

Second, all text is restricted to the legend region in the upper left corner of the image and is not embedded in the world map itself. As a rule, the introduction of a new entity is announced in the legend, associated with a small rectangle of its color.

The uniformity of the background image permitted automated expansion of the entity color into adjacent inner border pixels. Initial polygons of uniform colors not associated with the background map were retrieved from the modified raster image. Certain small artifacts (of different colors) resulting from the original illustration process were identified and either associated with a related color (and hence entity) or were removed from the image.

Although the initial automated production of polygons from raster images yielded serviceable results, the distortions of the background image and the relatively coarse resolution provided by the images sometimes yielded polygons that are not always aligned with coastal and land region datasets. Further, to eliminate border artifacts from the raster-based images we automatically smoothed the resulting polygons and their shared borders to a 0.07° resolution. Subsequent releases will improve these alignment and resolution issues.

To associate an initial set of entity names with their accompanying colored polygon, we parsed the legend region using optical-character recognition (OCR) using the Tesseract library33, retrieving the text associated with each colored rectangle. The OCR process was largely successful but required detailed review and hand-editing to correct parsing artifacts (as when letters were distorted if they overlapped map boundaries) or when special characters were required. The legend area itself is constrained and did not always permit the listing of all the name or color changes in an image. Thus, the initial OCR legend data structure was subsequently edited by hand to add missing polities or disambiguate the names of polities in different regions.

The legend organized the world (and the polities) into four broad regions: Western Eurasia, Eastern Eurasia, Africa, and North/South America. While the location and extent of the latter two regions was clear, there was no clear boundary between Western and Eastern Eurasia. This led to some initial automated mis-assignment of names to polygons largely in Eastern Europe and in the Transcaucasian region. These were reviewed and corrected by hand.

Each polity was assigned one of 1194 unique colors, with images infrequently reusing the same color for different polities that existed at the same time in different regions of the world or at different times. Because colors were reused at different times and different regions of the world, it was possible for the initial automated process to mislabel polygons. For example, both the Chinese Jin and the Near-Eastern Neo-Assyrian polygons share the same color in the 750BCE image. However, the Neo-Assyrian polygons are partially in the Eastern Eurasia region, which led to automatically (mis)labeling these polygons as ‘Jin’ (or vice versa). To identify these issues we projected each Eurasian polity’s polygons individually (‘by name’) inspecting whether their extent over all their image years was consistent with the historical record; in the example above we would have found that the ‘Jin’ had an erroneous Near-Eastern presence, which was then corrected.

Certain polity names are reused in history at different times, e.g., ‘Jin’ refers to several Chinese states and dynasties over several millennia. Where possible we used known historical names utilized by experts in the relevant historical field to distinguish the different polities (e.g., ‘Western Jin’ from ‘Former Jin’).

In addition to polities the original images captured the occupation of territories by various leaders (e.g., Julius Caesar, Tokugawa Ieyasu), armies (e.g., the New Model Army, the Red Army), groups (e.g., the English Royalists and Parliamentarians) and the location of certain events (e.g., the Taiping Rebellion, the Sepoy Rebellion). We were often able to associate these entities with a particular polity (e.g., associating Harald Fairhair as the leader of the Old Kingdom of Norway from 866CE to 870CE). If this was the case we included the territory as part of the associated polity, otherwise it was not included in the current release of Cliopatria, pending further review.

Figure 3 shows the number of polities recorded per image. Substantial changes in the number of polities over a short period of time reflects both sampling choices and the dynamics of empires absorbing and then releasing independent polities over their lifetimes.

Fig. 3
Fig. 3
Full size image

Number of entities depicted in each dataset year. Data reflects both the overall historical increase in number of polities but also a varying sampling choice about the scale of polities to be included. A substantial drop in polities typically reflects the expansion and occupation of states by a larger empire; a jump in polities reflects the creation or independence of states after the collapse of an empire. Notable examples of these dynamics are included (red lettering and arrows).

For the initial release we have confined the database to the years recorded in the original image set and have largely respected the original choice of spatial and temporal resolution of political entities, which varies by region and the availability of original maps. Upon review, we improved the names of political entities and we sought to improve the representation of entities of certain areas, notably the Indian subcontinent. These improvements were based on expert historical knowledge provided by Seshat researchers. Subsequent releases will relax these constraints as additional suggestions, reviews, and investigations reflect, in accordance with standard Seshat review procedures1, modifications that increase historical accuracy and capture disputes and uncertainty.

Data Records

The Cliopatria dataset is publicly available in a Zenodo repository28.

Cliopatria is distributed as a single data file, ‘cliopatria.geojson’. This file currently consists of approximately 15 K records structured as shown in Table 2. Data for each entity (e.g., ‘Roman Empire’) is contained in one or more rows, depending on how the associated data about the entity changes. Each row reports the Name of the entity, its polygons (geometry, projection EPSG:4326), and that geometry’s Area (in km2 using equal-area projection EPSG:6933).

Table 2 Example entity database entries.

Each row indicates a range of years between FromYear to ToYear to which the associated row data applies. Years are recorded as integers, negative for BCE, positive for CE. Data, including polygons, for any entity for any year (not just original image years) between 3400BCE and 2024CE can be obtained finding the row (if any) containing the Name of the entity where the year of interest is between the row’s FromYear and ToYear, inclusive.

Each row also records an associated Wikipedia page (phrase) describing the entity in those years; the latter URL can be composed by embedding the phrase in “http://en.wikipedia.org/<phrase>”. For certain polities in particular years, an associated Seshat polity id (SeshatID) may be provided; access to the structured data about that polity can be found via the URL “http://seshat-db.org/core/polity/<polity_id>”.

In addition to associating an entity with a Seshat polity, some polities were parts of a larger political entity (e.g., the British Raj in India from 1859CE to 1947CE was part of the British Empire); thus polities can also have an associated (supra-) polity. Information about these associated entities are used to form composite polities, which are denoted in the database by enclosing their name in parentheses, e.g., “(Roman Empire)”. In addition, Seshat records some intra-polity relations, such as personal unions and political allegiances, which are also represented in the database as composites. Each entity, for a range of years, will list the composite entries it contributes to, if any, under the MemberOf column; each composite entity will list the member entities that contribute to it under Components. Polygons in geometry for composite entities duplicate those of its members. Examples of associated polity information and some resulting composites are shown in Table 2.

As noted, rows for an entity are added whenever any associated data for the entity changes; typically this happens because the spatial extent of the entity changes over time. There are, however, occasions when a (typically small) polity (e.g., the County of Navarre) is temporarily incorporated into a larger polity (e.g., the Kingdom of France) only for that larger polity to then shrink or collapse and expose the original polity once again. Thus there may be multiple rows for an entity with substantial gaps between the years recorded.

Technical Validation

We validated the database largely by visual inspection and comparison against both the original and additional map images. We reviewed the image start and stop years for different entities with the original sources and with other databases, notably Seshat1. We also prepared various statistical summaries of the Cliopatria dataset to compare against previous such computations.

For example, in 1978 Taagepera34 prepared several extensive tables and figures based on his hand measurements of polity area from physical maps. We prepared equivalent tables and figures from our dataset; see Fig. 4, Tables 35. Our results are similar to Taagepera’s except that our database lists more steppe nomadic empires and those tend to replace his candidates for the largest empires during the Medieval period.

Fig. 4
Fig. 4
Full size image

Size of polities versus time. The lower curve represents the area of the single largest polity. The upper curve is the sum of the areas of the three largest polities. Compare data to Taagepera34 Figs. 1 and 2. There is no essential difference. Maximum land area of the Earth is 133 M km2, excluding Antarctica; note log scale.

Table 3 Areas of the World’s Three Largest Empires before 0CE.
Table 4 Areas of the World’s Three Largest Empires after 0CE.
Table 5 The 20 Largest Empires of States That Ever Existed.

Bennett7 observed that the dramatic increase of polity size after 500BCE (also identified by Taagepera34) was not associated with an increase in polity duration. This pattern is confirmed by our more extensive database. Figure 5 shows the size and duration distribution of nearly 700 large-scale, long-duration polities over the 5400 years between 3400BCE and 2024CE. The very large polities exceeding 20 M km2 that arose after 1500CE (e.g., the British Empire) can clearly be seen but, again, the duration of most polities, including the largest, are just a few centuries.

Fig. 5
Fig. 5
Full size image

The distribution of the size and duration of polities from 3400BCE to 2024CE. Size in millions of km2; duration in centuries. Includes only polities that reach at least 100 K km2 and last for at least 50 years. Compare to Bennett7, their Fig. 5(A).

For those Cliopatria polities with associated Seshat database entries, we compared the area of Cliopatria’s entity polygons with Seshat polity territory data, if any, for the specific image years. The Seshat data comprise previously-collected, independent estimates of polities verified by historians and thus can serve as an indication of the variance in sizes present in Cliopatria. (Of course this comparison does not address the specific location and boundary extents for a polity; see below). The results of the comparison are shown in Fig. 6. The match is very good with a nearly 1:1 linear fit explaining nearly 90% of the variance, increasing our confidence in both datasets. We expect this value to increase as discrepancies are investigated. Indeed, performing this comparison identified several recent mis-coded entities in Seshat which had not yet been as thoroughly checked as older data and which are now corrected in Seshat. This demonstrates that Cliopatria, even in its early stages, is able to draw attention to discrepancies between databases leading to resolution. Further, research35 using the Seshat data has demonstrated that polity territory is a key proxy for social scale. The increased quantity and quality of the Cliopatria area data and its higher temporal resolution will permit more comprehensive investigations into the historical dynamics of social scale.

Fig. 6
Fig. 6
Full size image

Comparison of unique paired Cliopatria and Seshat territory size estimates. Note log scales on both axes. Linear fit is shown in orange.

We also selectively compared Cliopatria’s entity polygons against several other available historical geospatial databases, both as a validation check of the Cliopatria records against previously-released resources as well as an indication of discrepancies between Cliopatria and different sources. Overall we find that while Cliopatria is at least comparable in data quality and coverage as other available databases, and often surpasses them in scope and comprehensiveness, there are gaps and disagreements between encodings in certain regions, especially in the existence and extent of Eurasian nomadic steppe empires.

For example, Fig. 7 shows the polygons associated with the Avar Khaganate around 600CE for three different databases, including Cliopatria. While they all agree the Avars in this period occupied much of modern Romania and Hungary, their extent into modern Ukraine and Poland varies widely. This example shows that disagreements between databases can be substantial owing, no doubt, to the underlying procedures and sources referenced in their construction. The prevalence and magnitude of uncertain border locations of historical entities tends to increase into the past. Further, the apparent stability of borders of ancient polities over hundreds of years (as with Old, Middle, and New Kingdom Egypt) may reflect limited historical records rather than the actual stability of the state itself. This is typical of working with historical data, which is often fundamentally uncertain based on the simple paucity of records in addition to differing underlying concepts of border and control.

Fig. 7
Fig. 7
Full size image

Comparison of the Avar Khaganate around 600CE. Putative extent of the Avars according to three different datasets. Background image from Google Earth showing modern state boundaries.

Many of the original source maps, which are themselves drawn by hand, employ different cartographic display techniques (e.g., stipple patterns, blurred edges, etc.) to suggest both the uncertain extent and location of some (but not all) borders. However, every known digital encoding of historical polity data, including Cliopatria, use industry-standard digital graphical primitives (raster encodings or polygons formed by latitude and longitude pairs) that are unable, by themselves, to capture this uncertainty which could then be used to inform display techniques or analytic computations. Further there is neither consistent discourse among historians about specific historical border uncertainties nor clear estimates of their location or rough magnitudes over time.

In spite of our attempts to reflect the most current historical knowledge, we acknowledge that Cliopatria’s representation of world history reflects only one version of the territories held by past polities. Thus, we warn users that currently unquantified uncertainty exists and they may need to account for it somehow in their analyses. We hope the availability and improvement of Cliopatria by the scholarly community will yield both improved borders based on documented input from historians and some broadly acceptable encodings of any residual uncertainty or disputes suitable for different computations, even if they are simply explicit alternative representations of the same polities. Indeed, one of the primary motivations of compiling the Cliopatria dataset and providing it as open-source material is to foster such productive, collaborative dialogue with other users and makers of historical geo-spatial information.