Introduction

The development of informatics techniques has increased the availability of using surnames from population datasets that allow the analysis of large population groups at the country level (Rodríguez-Larralde et al., 1998; Barrai et al., 2000; Rodríguez-Larralde et al., 2000; Rodríguez-Larralde et al., 2003; Barrai et al., 2004; Dipierri et al., 2005; Manni et al., 2005; Dipierri et al., 2011; Rodríguez-Larralde et al., 2011; Longley et al., 2011; Cheshire et al., 2011; Carrieri et al., 2020) and at the continent level (Scapoli et al., 2007; Cheshire et al., 2011). Traditionally voter records and telephone directories served as databases (Rodríguez-Larralde et al., 2003; Cheshire et al., 2011; Dipierri et al., 2016; Carrieri et al., 2020) and, more recently, for population censuses (Rodríguez-Díaz et al., 2015, 2017; Posch et al., 2024) representing an excellent source of information for population studies and even the socioeconomic implications of this structure (Posch et al., 2024). Even street names have been used as a source of information for sociological analysis (Creţan and Matthews, 2016). Anthropologists and geographers have mainly carried out this type of study for different purposes, but always hoping that the large sample size will minimize possible deviations in the use of surnames as an estimator (Barrai et al., 2000; Rodríguez-Larralde et al., 2003).

When analyzing populations, many of the conclusions are based on the assumption that surnames arise in a single location, and can therefore be used as monophyletic markers. (Rodríguez-Larralde et al., 2003; Manni et al., 2005). However, since the first doubts were cast (Rogers, 1991) this assumption has continued to be controversial and apart from proving to be the most controversial aspect of isonymy the most recent studies seem to support the reliability of the method (Sykes and Irven, 2000; Gagnon and Heyer, 2001; Esparza et al., 2006; King et al., 2006; Boattini et al., 2007; Lisa et al., 2007; Mateos, 2007; King and Jobling, 2009a; King and Jobling, 2009b; Alvarez et al., 2010; Rodríguez-Díaz and Blanco-Villegas, 2010; Balanovskaia et al., 2011; Longley et al., 2011; Liu et al., 2012; Dipierri et al., 2016; Toledo et al., 2017 Carrieri et al., 2020; Kamel et al., 2023). Thus, isonymy, which currently has ample bibliographic support, turns out to be an excellent, fast, cheap and reliable alternative for the study of human populations, provided that the surnames used are properly selected since, even if the capacity to do so exists, the analysis of a complete population record (including the totality of the surnames) is not always desirable (Cheshire, 2014).

In this sense, Manni et al. (2005) proposed the application of a data mining technique (Self Organized Maps) to large biodemographic databases. The aim was to be able to analyze these databases without the need to resort to genealogical records, in order to unravel historical population processes because, even today, the scarcity of data and information are major challenges facing research on historical population migrations (Fan et al., 2023). The technique allowed the identification of groups of surnames with the same origin; in other words, it became possible to distinguish the autochthonous names of each zone, also enabling discrimination between monophyletic and polyphyletic surnames. In this way, it was possible to use monophyletic surnames as reliable markers (Manni et al., 2005; Boattini et al., 2010; Rodríguez-Díaz and Blanco-Villegas, 2010; Boattini et al., 2012; Rodríguez-Díaz et al., 2015, 2017; Kamel et al., 2023). The work carried out to date endorses the reliability of the method. Nevertheless, the databases employed had limits that reduced the reach of the conclusions drawn. In some cases, the size of the database was limited owing to the small size of the geographic area addressed (Boattini et al., 2010; Rodríguez-Díaz and Blanco-Villegas, 2010) while in others the surnames were only recently established there (Manni et al., 2005). Validation of the methodology was established only a short time ago using the Italian surnames dataset (Boattini et al., 2012). In that work, the validity of the method was checked by comparing the origin identified for each surname with pre-existing databases, with excellent results (Boattini et al., 2012). However, so far, we are not aware of any large-scale attempts to introduce the results obtained from individual surname studies, for the demographic analysis of broad regional geographies, which would allow the exclusion of polyphyletic surnames from the analyses, which as Cheshire (2014) indicates could be of great value. Therefore, in order to complete the methodology, it would only be necessary to apply it to a specific population in order to demonstrate its possibilities when it comes to showing the historical population movements that have taken place in its interior and that, in short, have determined its current population structure.

The population chosen here for such purposes was the Spanish one, which is ideal for this type of study. For centuries, the influence of Spain on other countries has been very important, both in Europe and S. America (Mateos and Tucker, 2008). This influence is mainly reflected in the widespread presence of the Spanish surname system in Latin America (Rodríguez-Larralde et al., 2000; Dipierri et al., 2005, 2011; Carrieri et al., 2020), although not only in those countries (Scapoli et al., 2007; Cheshire et al., 2011). This widespread presence of Spanish surnames and an inherited surname system assimilated by the native population’s surnames system makes the methodological and population conclusions derived from this study broadly applicable when studying other populations.

In addition, the Spanish population has been subject to certain very special conditions. Geographically, the country is located at the southern end of Europe and is isolated from it by the Pyrenees. Tanto es así que hasta bien entrado el siglo XX (1986), la migración externa ha sido una constante en la historia de España (Valero-Matas et al., 2014). This isolation has been further exacerbated by prominent orographic contrasts (Bycroft et al., 2019), Spain having a particularly complicated geophysical relief compared with other European countries. This high diversity is also seen in the linguistic field: within the Spanish population there are also different official regional languages and their variants (García, 2007; Goebl, 2010). All these conditions have led the Spanish population to be under the constant pressure from a variety of factors that have culminated in a particularly conserved structure (Adams et al., 2008; Rodríguez-Díaz et al., 2017). The current genetic diversity is the result of events and linguistic structures deeply rooted in the historical past of the Iberian Peninsula (Bycroft et al., 2019).

Both characteristics of the Spanish population—an inherited surname system widely represented around the world and a particularly well-preserved population structure—make the Spanish population a subject of special interest for a study such as ours.

For all these conditions and, given that the literature on internal migration in Spain is not very conclusive, other authors (Maza et al., 2019) have used the Spanish case as a kind of experimental laboratory to analyze internal migration in very recent times and endorse the decision to apply this novel methodology in the analysis of historical migrations.

We aim to address the question of whether the methodology proposed by Manni (Manni et al., 2005) for identifying the origin of surnames can be useful for analyzing large-scale internal movements within populations and gaining an in-depth understanding of their history and population structure. If satisfactory results are obtained through the application of this methodology to the Spanish population, it would validate a new approach for studying migratory movements in populations where genealogical information is nonexistent, inaccessible, or unmanageable. Additionally, it would establish a study protocol applicable to a surname system widely distributed worldwide. Both of these factors would make the potential results broadly generalizable.

Materials and methods

Study area

Spain is located on the Iberian Peninsula at the extreme south-west end of Europe. It has a surface area of de 504,645 Km2 and it is surrounded by the sea to the north, south and east. It borders Portugal in the west and France in the north.

Geophysically, Spain is located at a considerable altitude above sea level, with a mean of 660 m. The territory can be considered mountainous in comparison with other European countries.

The population of Spain is currently 47 million people, distributed unevenly across the territory and mainly on the coastal areas, leaving the interior with a low population density (with the exception of Madrid, the administrative capital).

From the administrative point of view Spain is organized in 15 Regional Autonomies and 47 Provinces. The official language is Spanish but other co-official languages are also spoken in some areas (Catalonia, Galicia and the Basque country).

In populations with such cultural and demographic complexity (Fig. 1), working with surnames offers a clear advantage. Surnames are passed down from one generation to the next within families through a process of inheritance, allowing them to be used as markers for population and lineage. At the same time, however, a surname is ultimately just a word. This means that in diverse populations like Spain, where cultural diversity is well-preserved in terms of language, surnames can directly reflect this linguistic and cultural diversity.

Fig. 1
Fig. 1
Full size image

Provincial map and distribution of co-official languages. Provincial map of the national Spanish territory. It shows the distribution of each co-official language and the identifying number of each province, which will be referenced going forward.

Databases

In the Spanish system of surname transmission, individuals inherit two surnames (Mateos and Tucker, 2008). Everybody inherits the father’s first surname (which becomes their first surname) and the first surname of the mother (which is the individual’s second surname; e.g., Nicolás Fernández García, where the father’s surname is, Fernández and the mother’s is García). In view of the availability of the two surnames of each individual, the data base was constructed using both the first and second surnames, choice that according to some authors (Pettener et al., 1998; Colantonio et al., 2003; Dipierri et al., 2011; Rodríguez-Larralde et al., 2011; Barrai et al., 2012; Carrieri et al., 2020) duplicates the amount of information and contributes to the robustness of the analysis, given that the differences observed on various occasions (as in the present study) between the distributions of the first and second surnames have always been negligible. This fact is entirely expected if we consider that the surname inherited through the maternal line is actually the surname inherited through the paternal line of the previous generation.

The data on surnames were provided by the Spanish National Statistics Institute (INE) and came from the 2008 census. The database included all the surnames of each municipality as long as they appeared a minimum of five times. The initial database included 56,976,706 entries, corresponding to 87,148 different surnames. The INE database contains a vast amount of information corresponding to the entire Spanish population in 2008, but it has certain limitations. Firstly, it does not include all Spanish surnames; those that are not repeated at least five times in a single municipality are excluded, even though they represent a small portion of the information. Secondly, the database provides a snapshot of the population at a specific point in time, meaning it includes the entire population at that moment without providing information about the bearers of each surname. This characteristic presents certain problems; for instance, we have no way of knowing if all the bearers of a surname are adults or if they represent a single individual and their descendants, indicating a single lineage.

Data correction and treatment

The initial database was revised meticulously. We found repeated surnames, different graphs of the same surname, spaces between words, spelling errors and compound forms. All these faults are a huge problem when attempting to perform statistical processing. To avoid such drawbacks, all surnames were revised with bibliographic (Faure et al., 2001; Solís, 2002) and cartographic support and were corrected as many times as necessary.

In the next step, we removed all the surnames that did not appear a minimum of 20 times in the database in order to avoid excessive noise in the statistical procedures (Manni et al., 2005; Boattini et al., 2012). In all, once the data had been treated there were 51,419,788 data (33,753 different surnames). When discussing statistical noise, we refer, for example, to the potential influence of surnames that have appeared in Spain due to recent immigration processes. These surnames retain two characteristics that allow us to identify them. On the one hand, surnames historically established in Spain often show Castilianized spellings, which differ from those used in other countries and are more characteristic of recent immigration. On the other hand, the geographical distribution of historically established surnames tends to follow clear dispersion patterns, whereas surnames from more recent immigration appear in scattered populations and at low frequencies. In this way, we ensure that we are working exclusively with surnames that are historically established and representative of the Spanish population.

Data processing

After data treatment, a double-entry matrix was created in which the rows (i) corresponded to each surname and the columns (j) to each province. Accordingly, each cell (ij) corresponded to the frequency represented by each surname in the total population of each population.

We then performed a transformation of the frequencies in two steps (Boattini et al., 2012; Rodríguez-Díaz et al., 2015, 2017):

In the first step, we attempted to prevent the smallest populations from having excessive weight. To accomplish this, we used the expression:

$${f}_{i}=\frac{{{fabs}}_{{ij}}}{\log ({{pop}}_{j})}$$

where \({{fabs}}_{{ij}}\) is the absolute frequency of surname “i” in province “j”, and \({{pop}}_{j}\) is the total population of the province “j”

In the second step, we tried to avoid surname grouping as a function of how numerous they were, using the expression:

$${{wf}}_{i}=\frac{{f}_{i}}{\Sigma {f}_{i}}$$

where \({f}_{i}\) is the result of the previous expression.

Grouping of surnames

The surnames were grouped as a function of their geographic distribution using a Cluster-type data mining procedure, the self-organizing maps of Kohonen, or SOM (Kohonen, 1982; 1984).

SOM are unsupervised neural learning networks that allow the statistical recognition of patterns to be obtained. Here they were used to recognize the patterns in the geographic distribution of the surnames. This procedure allows the surnames to be grouped as a function of their distributing and permits their origins to be identified. Its application in the field of biodemography was developed by Manni et al. (2005) but it is a methodology that has been tested (Boattini et al., 2012) and found to afford good results (Boattini et al., 2010; Rodríguez-Díaz and Blanco-Villegas, 2010; 2015; 2017).

In our case, the software used was the “Kohonen” R Project package (Wehrens and Buydens, 2007). With this software we classified the surnames in a rectangular matrix whose size had to be decided. The criterion should involve choosing a size that is not so large that it will prevent interpretation nor too small or the results will not be representative. To achieve this, the criterion adopted here after testing different sizes was to use the smallest matrix in which an empty cell would appear (Boattini et al., 2011). Thus, we used the smallest size for which all the groups were already representative (this is why empty cells begin to appear), which in the present case was 17 cells wide, with 1000 repetitions.

In sum, the SOM consisted of an entry layer of 33,753 vectors (one vector per surname) and a layer of 289 cells (groups of surnames with a similar geographic distribution).

Origin of surnames

Finally, each group of surnames was represented graphically by gradient maps using the ArcGIS 10.0 software, which allows their geographic distribution to be observed.

On looking at these maps, it is possible to identify the origin of each population group as a function of the surname in question. The method is based on the assumption that the closer we approach its origin, the more numerous a population group will be ((Manni et al., 2005, Longley et al., 2011; Boattini et al., 2012; Cheshire (2014). Thus, observing the gradient map of the distribution of each surname (Fig. 2) it may be assumed that the population group bearing that surname will have its origin at the place where the occurrences of that surname are most frequent (Fig. 1). When identifying the origin of each surname, we must consider that we are working with geographic information. This means we only obtain information about the geographic origin of each surname, not the historical moment when it originated or when the movements occurred. The authors of the method (Manni et al., 2005) emphasize this point and question whether the dispersion pattern itself could be used to infer the historical depth of the origin and dispersion of each surname. This could be of particular interest in future developments of the method.

Fig. 2: SOM matrix.
Fig. 2: SOM matrix.
Full size image

Each cell is a gradient map of Spain, representing the geographic distribution of the corresponding group of surnames.

Migration matrices

Once the origin of each surname had been established, the second step was to assume that each surname found outside its origin would, at some time in the past, have moved out of that area (Boattini et al., 2012). Starting out from this, we constructed a migration matrix (Bodmer and Cavalli-Sforza, 1968), with as many rows (“I”) and columns (“j”) as provinces included in the study (47 ×47). Accordingly, each cell (“ij”) reflected the number of times that a surname with an origin in population “i” appeared in the population “j”.

This methodology allows the study of the historical population movements that have taken place within a population and what has been the contribution of each of the subpopulations to the structure of the entire population (Boattini et al., 2012).

Historical censuses

The results obtained by isonymy were compared with the historical population. In particular, we performed regressions between the historical censuses of the National Institute of Statistics (www.ine.es). To go even further back in time, we consulted the 1787 Floridablanca census at the digital Library of the Royal Academy of History (www.biobliotecadigital.rah.es).

Results

SOM

To analyze the internal movements of the Spanish population, a first and crucial aspect is to identify the origin of each surname and its pattern of dispersion.

These origins were identified by organizing the surnames as a function of their distribution pattern using neuronal networks. In this way, it was not necessary to identify the origin of each surname and study its movements but we were able to study the origin of each group. The Spanish surnames (Fig. 2, Table 1) were organized in 289 groups, of which 27 were identified as groups of polyphyletic surnames; 4 remained blank, and of the remaining 258 groups the origin was identified. In other words, thanks to the use of SOM we were able to determine the origin of 31,752 of the 22,753 (29,289,329 data). Each of the provinces was seen to have at least one group of surnames with their origin in it (all the provinces were represented in the sample of surnames of known origin).

Table 1 Summary table of the SOM. The table shows the number of surnames and the number of data grouped in each cell, together with the origin of each grouping of surnames.

In comparative terms (Table 2), if the size of the population is taken into account Spain has relatively few surnames, which means that a few polyphyletic surnames will represent a greater part of the population.

Table 2 Comparative table showing the main results obtained in Spain (present work), the Netherlands (Manni et al., 2005) and Italy (Boattini et al., 2012).

Characterization of movement

With the origin of each surname identified, we were able to build migration matrices. The analysis of these allowed the migratory processes of each province to be characterized. Thus, (Fig. 3) we determined that the western and southern zones of Spain are the ones out of which proportionally more people have emigrated over time. By contrast, in the north and east of the Peninsula expansion away from the origin has been much less pronounced.

Fig. 3
Fig. 3
Full size image

% of subjects with surnames original to each province found outside it.

Additionally, it would appear that four provinces are outstanding as favorite destinations of population movements. These are Vizcaya, in the north; Madrid, in the center (two of the main economic centers of the country), Valencia-Alicante, in the east, and Seville-Malaga, in the south. These would be the receivers of population movements. It seems that it would be possible to consider that the southern and western zones would be a source of population while the northeast would be the sink.

Migration distances

With knowledge of the general characteristics of the internal movements of the Spanish population, we analyzed the distance covered by means of PCA analysis. In this, we analyzed all the internal movements on the basis of the distance separating the origin of the surname from the destination. The first aspect revealed by this analysis is that, the isolation model is not homogeneous for the whole of the Spanish population.

Some trends have deformed this (Fig. 4); in particular, four different trends. The first corresponds perfectly to what would be expected from a model of isolation due to distance, and includes populations located in the north-west, north, and north-east.

Fig. 4: Principal component analysis.
Fig. 4: Principal component analysis.
Full size image

The circles correspond to the populations where the movements started out from. Triangles represent distances.

In the second group, which is mainly localized around the center of the Peninsula, the model of isolation due to distance is deformed by the high frequency of short-distance movements (1–200 Km).

In the third group, formed by population in the south and west of the Peninsula, it is the medium-distance movements (200–600) that deform the model of isolation due to distance.

Finally, the fourth group, corresponding to the periphery of the Peninsula, is characterized by long-distance movements (more than 600).

Direction and sense of the movements

With knowledge of the different types of movement (Fig. 4) as a function of distance, we used PCA (Fig. 5) to analyze them separately. In this, we separated two populations (Fig. 5, circles) as a function of the destinations (Fig. 5, triangles) towards which the population has migrated.

Fig. 5: Direction and sense of the migrational movements.
Fig. 5: Direction and sense of the migrational movements.
Full size image

AC Principal Component Analysis of the short-distance movements (less than 200 km). The circles represent the origins of the movements; the triangles represent destinations (A short-distance movements (less than 200 Km); B medium-distance movements (between 200 and 600 Km); C longer-distance movements (more than 600 Km); D Summary map of the main destinations of each type of movement.

First (Fig. 5A), we studied the short-distance migratory movements (less than 200 Km; these represent 18.67% of all movements). Three of them were apparently the most characteristic destinations of these movements (Fig. 1 and Fig. 5 A, 7, 23 and 28), all of them located in the N-W half of the Peninsula.

Then (Fig. 5B), we analyzed the medium-distance movements (200–600 Km), representing 23.65% of the total. Now there seemed to be 7 important destinations, although they can be grouped in two: 4 corresponding to destinations in the south and east of the Peninsula and 3 centered in the middle and north. Each group of destinations mainly received populations from the half of the Peninsula in which they were located (the south-eastern centers received population from that part of Spain; those in the north-west received immigrants from the same area (NW)).

Finally, (Fig. 5C) we explored long-distance movements (more than 600 Km, representing 13.6% of the total). Here it is important to note that the provinces in the center of the country have few destinations at distances of more than 600 Km. Accordingly, the migratory movements mainly occurred at the periphery of the country. Two groups of destinations stand out: those situated in the north-east of Spain and those located in the south-west.

The remaining 44% correspond to surnames that have remained in the province of origin.

In the case of long-distance movements, there has been a transfer of the population between the north-west of Spain and the south-east part.

In the three PCAs performed, one group of destinations located in the opposite direction to the other destinations and to all the origins emerged (Fig. 5A). In the short- and medium-distance movements this group is formed precisely by the populations surrounding the three most important destinations in the category of movements. By contrast, in long-distance movements it was the populations located in the west. In all three cases, this group of populations represents the populations that have received the least migratory movements.

Receiving centers

Attending to the movements received (Fig. 5D), the receiving centers can be classified in three categories:

  • Centers of national importance. These have received at least at least two different types of migratory movements. This means that the reach of their “gravitational field” is distributed across the whole country. Madrid is the only one of these destinations that loses importance in long-distance movements. This is reasonable if it is considered that it is located in the center of the Peninsula.

  • Regional centers. These are destinations whose reach is regional and whose importance is seen only at short and medium distance. Two of these centers are located in the north-west and the migrant population received by them is mainly from that half of the Spanish territory. The third is on the south-east coast, and the movements received are precisely from that region.

  • Long-distance centers. Located on the SE coats, these have only received long-distance migrations, of less relevance than the previous ones.

Main migratory movements

The main movements within the Spanish population are represented; specifically the two major migratory movements in each province (Fig. 6) and the two major immigrant movements (Fig. 7).

Fig. 6
Fig. 6
Full size image

Map of most important migratory movements of each province.

Fig. 7
Fig. 7
Full size image

Map of most important emigrant movements of each province.

Analysis (Fig. 6) allows the same relevant destinations to be detected as in the analysis of movements by distance (Fig. 5) and these destinations have the same fields of attraction. Likewise (Fig. 7), the existence can be seen of two main emitting sources: one is the north-west zone of Spain and the other is the south-east. It is also seen that the movements originating in each of these foci remain in their own half of the country.

These representations also provide information about major “streams” (Fig. 6). The north-western half of the country mainly moved towards the north or the center of the Peninsula, and the south-east moved mainly around the coast and, to a lesser extent, towards the center. Moreover, it seems that these movements followed what might be termed “population corridors”. The two main ones are coastal, following the Cantabrian coast in the north and the Mediterranean coast in the south-east (Figs. 6 and 7). Although less evident and probably less relevant, another corridor can be seen in the west of the country.

Autochthony

The movements of populations alter their composition and hence one of the most interesting parameters to analyze is the autochthony of the populations, or the proportion of surnames present in a given population that have their origin in it (Fig. 8), and its relationship with the movements of the population.

Fig. 8
Fig. 8
Full size image

Percentage of the population with autochthonous surnames in each province.

In Spain, there are two zones in which the proportion of autochthonous surnames is especially high. Both zones are located on the coast. The most autochthonous zone in Spain is the Cantabrian coats in the north (which is some cases surpasses 60% of autochthonous surnames). The second one is on the Mediterranean coast, in the southeast.

Above, three corridors were identified in which a large part of the movements can be seen. The two most autochthonous zones of the country correspond precisely to the two coastal corridors. By contrast, the west corridor crosses a much less autochthonous zone.

The rest of Spain shows values ranging from 20 to 40%, with the single exception of Madrid and its surroundings in the center, and Barcelona in the north-east, both showing values that do not surpass 20%.

Historical background

Once we had obtained the data describing the structure of the Spanish population, their relations and internal movements, we were interested in addressing the issue of what kind of historical meaning could be extrapolated from these results. To accomplish this (Fig. 9) we compared the autochthonous population of each province with the historical size of these populations.

Fig. 9: Left: Plot of the variation of the population of each province around the population mean (1.0) for the whole period (1787–2000).
Fig. 9: Left: Plot of the variation of the population of each province around the population mean (1.0) for the whole period (1787–2000).
Full size image

Right: Level of significance between the population currently bearing autochthonous surnames and the historical population size of each province.

On one hand, the size of the provincial populations increased slightly but steadily up to 1950, after which it underwent sharp changes. On the other hand, in the correlations between the autochthonous population and the historical population size two observations are important. The first is that the significance rises with the antiquity of the census and the second is that it was precisely from 1950 that this correlation ceased being significant.

Discussion

SOM

The first part of the study’s objective, testing the methodology for analyzing population movements, requires its use to clarify the origin of each surname. The methodological basis of SOMs is simple: SOMs group surnames according to their geographic distribution in such a way that this can be studied in groups. Each surname group becomes more and more frequent the closer it gets to its origin (Cheshire and Longley, 2012). Accordingly, it is possible to distinguish three basic types of surnames (Manni et al., 2005).

  1. a.

    Surnames whose distribution extends throughout the area studied without obeying any apparent pattern. These are considered polyphyletic surnames and they tend to identify many individuals.

  2. b.

    Surnames whose distribution shows an ambiguous pattern that does not allow a clear origin to be established. These are ambiguous surnames that, for the rest of the procedure, cannot be considered monophyletic.

  3. c.

    Surnames whose distribution follows a clear pattern and whose origin can be established. These are monophyletic surnames, and contain valuable information for the population study.

The first two types of surname do not contain information relating to the origin of their bearers and hence could not be used in the study. Only the third type of surnames (monophyletic), for which we have been able to establish a unique geographic origin, can be used to analyze population movements.

Regarding certain population aspects, it was interesting to compare the raw findings for the three populations in which this methodology has been used (Tabla 1): The Netherlands (Manni et al., 2005) and Italy (Boattini et al., 2012). First, in comparative terms the low diversity of surnames in the Spanish population is noteworthy (Spain: 0.656 surnames/1000 inhabitants; The Netherlands 6.046/1000 and Italy 6.406/1000). This would suggest a lower diversity in Spain and has been reported in previous works (Rodríguez-Larralde et al., 2003; Scapoli et al., 2007; Adams et al., 2008; Cheshire et al., 2011; Rodríguez-Díaz et al., 2015; 2017) and would be expected in view of its isolated geographical situation (i.e., it is a peninsula at the extreme south-western end of the continent, separated from it by the Pyrenees) and the orographic features that have led it to become an amalgam of isolated parts. At this point, it seems pertinent to recall the socioeconomic repercussions that have been observed in relation to population diversity, and more specifically, those associated with low surname diversity (Posch et al., 2024).

Continuing with the comparison of the proportion of polyphyletism (2.62%), is similar to that of The Netherlands, where 1.46% of the surnames are polyphyletic, and differs from that of Italy, where 21.05% are polyphyletic. The fact that there are so few polyphyletic surnames in Spain points to the notion of a highly settled and regionalized population, or at least one with a well marked structure (Adams et al., 2008).

However, these few polyphyletic surnames represent a huge proportion of the Spanish population (63.12%). By contrast in The Netherlands a similar percentage of surnames represents a considerably smaller portion of the population (24.29%) and in Italy many more surnames represent a percentage of the population similar to the Spanish case. The comparison shows that the Spanish population is less diverse than the Dutch one.

Although it is true that this phenomenon could be due to the difference in the origin of the surnames, which in the cases of Italy and Spain is very old (13th century in Spain and 14th century in Italy), whereas in the case of The Netherlands it is much younger (19th century), it is also true that in this first impression it seems that the Spanish population shows a low diversity, a direct consequence of several factors that deriving both from its historical outflow of population to other countries (Encarnación, 2004) and from its great isolation and regionalization.

After identifying the origin of each surname, we can assume that individuals carrying a surname found outside this origin descend from a lineage that left the original population at some point in history. Based on this assumption, we can observe the historical internal movements of the Spanish population, the distances traveled, the direction of the flows, and how these have contributed to the current structure.

Migratory movements

An initial look shows that migratory movements are not a homogeneous phenomenon in the Spanish population (Fig. 3). There are two large zones showing very different kinds of behavior: a zone in which the majority of the original population (more than 60%) has left and has relocated across the whole of the western and southern zones, and the other, in which there is less of the original population (less than 50%) than the outsider one. This is located in the center and north-east.

This geographic distribution of migration shows that population movements have by no means been homogeneous and geographically asymmetrical (Santiago-Caballero, 2021). The genetic differences between populations are not random but are influenced by the physical characteristics and natural barriers of the terrain, such as mountains and rivers, which have historically limited the movement of people (Bycroft et al., 2019). It seems a priori that part of Spain has emitted population (Emitter) that could have colonized other parts of the country (Receiver). This movement has generally been from more rural populations towards more industrial and affluent areas (Bover and Velilla, 2019). The availability of economic resources often appears to be one of the most important reasons behind migratory movements (O’Brien et al., 2022). The nature of these movements would therefore have governed the structure of the Spanish population and merits a detailed analysis.

Distance, sense and direction of migratory movements

In general, it is considered that populations obey a model of isolation due to distance (Malecot, 1955). In fact, along general lines it is known that this is what has happened in the case of Spain (Rodríguez-Larralde et al., 2003; Rodríguez-Díaz et al., 2017). Nevertheless, in Spain, as is the case of Italy (Boattini et al., 2012), the movements are far from being homogeneous; neither are they reduced homogeneously as distance increases and neither do they obey a single model for the whole of the geography of the country (Fig. 4). Indeed, quite the opposite: the movements can be classified in four groups that can be analyzed individually to see how much and at what level they have contributed to the formation of the structure of the Spanish population.

  • Isolation due to distance. It is remarkable (Fig. 4) that the populations that best fit to the model the model of isolation due to distance coincide with those in which a language other than Spanish is spoken (Fig. 1). It would appear that although languages have not played a relevant role in the global structure of the Spanish population ((Rodríguez-Díaz et al., 2017), they could have played a secondary role at a lower geographical level in the same way as has been seen for other populations studied (Manni et al., 2004; Boattini et al., 2011).

  • Short-distance movements. These represent a very low percentage of the population: 18.87% of all movements. This, together with the fact that they represent somewhat uninfluential movements means that they have been of less importance in the structure of the population. They are better represented in population located around important centers (Fig. 5, A and D) on the northwest of the country. The attraction of these centers is so important that it has altered isolation due to distance.

  • Medium-distance movements. These are the most important (they represent 23.65% of the movements). Most Spanish provinces are separated by this distance range, such that this group is the most representative of interpopulation relations. Spain is seen to be divided into two halves: the northwestern parts and the southeastern one are not related to each other and movements occur from the provinces to centers located in the same half as these.

  • Long-distance movements. These are the least representative movements (13.36%) and have occurred at the periphery because the peripheral populations are the only ones separated by such large distances.

The nature and distance of these population movements seem to depend on the chronological period. In recent times (20th and 21st centuries), these distances are associated with the employment opportunities offered by each territory (Bover and Velilla, 2019).

Main movements

Detailed analysis of the main movements allowed us to observe the relationship between the Spanish populations, representing the two main destinations (Fig. 6) and the two main origins (Fig. 7) of the movements of each province.

Both the destinations (Fig. 6) and the origins (Fig. 7) reveal the existence of two main migratory arcs. Both migrations have moved along the coast; the first following the Mediterranean and the second following the Cantabrian Sea. It appears that these two arcs are those that have provided the backbone of the Spanish population, dividing it into two halves that would reach the limits of areas of influence of these arcs. On a second plane, it is possible to observe the existence of a third arc (less relevant) in the west (Figs. 6 and 7), which matches the “Ruta de la Plata” perfectly. This is an ancient communication route that was taken up by the Romans and was later used as a transhumance route (Martínez, 2003), constituting currently a population exchange route between all the towns along its route and, at present, it is maintained as an important corridor that runs through the peninsula from north to south. The role of transhumance routes as population itineraries has already been evidenced in other environments (Orrù et al., 2018) and, in fact, there are indications that, precisely in this area of the western peninsula there mixing population with different origins (Adams et al., 2008).

It seems that the presence of these large population movements is what has led to the Spanish population structure already described (Rodríguez-Díaz et al., 2017) and observed again in the present work.

Autochthony

The degree of autochthony varies considerably from one zone to another as a result of the influence of geographic or historical factors that have given rise to different migration patterns (Manni et al., 2005). In the case of Spain, it seems that autochthony is concentrated on the coasts, coinciding with the Mediterranean and Cantabrian arcs, regions that have traditionally experienced lower historical emigration (Valero-Matas et al., 2014).

It also appears that there are two types of corridor. Along the coastal corridors autochthony is very high, while the western corridor is not very autochthonous. A feasible explanation for this phenomenon is that each coastal corridor would have “articulated” its own half of the Spanish population (Rodríguez-Díaz et al., 2017). The Spanish population is divided into two halves and each extends out of a coastal arc. Through such arcs movements within a single population have occurred (northwestern half/southeastern half) while the western corridor would involve a route of exchange between these two differentiated populations and therefore has more allochthonous population. Recent studies carried out on internal migration in the Spanish population have shown the influence of climatic aspects as factors determining population mobility (Maza et al., 2019), so that the displacement within two arcs, as described, can be framed in this line.

Historical background

To validate our results, we have compared them with what is known about the historical Spanish population from available historical records and genetic studies.

With insight into the structure of the population and how it has been conformed, the most pressing question is when this process actually occurred. First, the results reported here are consistent with those described by the National Geographic Institute and the National Institute of Statistics (www.ign.es, www.ine.es), and they become even more consistent as the records of the migratory movements become older (the oldest correspond to the decade between 1960 and 1970). This was to be expected from the isonymy methodology used, which reflects the results of a historical process.

A similar situation is found for the comparison between autochthonous surnames and the historical censuses of the provinces (Fig. 8). The fact that the number of bearers of autochthonous surnames correlates better with the population size as the age of the census used increases confirms the notion that autochthonous surnames are a faithful reflection of the original population of each province and is an indication of the precision underlying the identification of the origin of each surname.

Until 1950 (Fig. 9), this correlation is significant. Sometime around then, rural emigration began in Spain and the population lost stability (Fuster and Colantonio, 2002). In fact, on observing the evolution of the Spanish population in each province (Fig. 9), it is clearly seen that all the provinces maintained a stable population subject to gentle growth up to 1950. Then, after that year some provinces suddenly began to lose part of their populations in favor of others and the population sizes changed sharply: population stability had disappeared.

This phenomenon again suggests that the long-distance migratory movements have been recent phenomenon and is coherent with the notion that the Spanish population is highly conserved (Adams et al., 2008; Rodríguez-Díaz et al., 2017). It traces back to the historical events of the Muslim era and the Reconquista, which can be placed between the 9th and 11th centuries (Bycroft et al., 2019), and therefore reflect a structure that predates the established surname system. In this scenario, it appears the Spanish interpopulation relations have been of limited reach, both as regards intensity and distance, and that they have persisted until very recent dates within zones clearly delimited by geographic determinants, which would explain why the Spanish population is clearly divided into two differentiated parts (Rodríguez-Larralde et al., 2003; Cheshire et al., 2011; Rodríguez-Díaz et al., 2017). Traditionally, population movements have been observed inside these zones and there has been little exchange between them until very recent times.

Conclusions

The results obtained demonstrate a reliable methodology that can be used (though not exclusively) in populations with a surname system similar to that of Spain. Additionally, the findings regarding the internal structure of the Spanish population and the origins of Spanish surnames may also be of interest to populations that share surnames of Spanish origin.

The application of this new methodology has allowed us to distinguish the surnames with a clear origin (monophyletic) of the Spanish population, to be used as geographic markers, in such a way that we have been able to know the origin of each group of surnames and to highlight their mobility. The coherence with previously reported results, with analyses carried out prior to the migratory movements, and the correlation between autochthony and the oldest censuses point to the precision (bearing in mind the geographical level chosen) of the technique when attempting to identify the origins of the surnames.

Within the Spanish population several types of movement have taken place. Those of short-medium distance have been the most frequent and most determinant in the current structure. This mobility has been confirmed mainly within two geographically differentiated regions. In the northwest the movements have occurred along the Cantabrian arc up to where its influence reached. Symmetrically, in the southeast the population followed the Mediterranean arc, also arriving as far as the reach of its influence. The exchange between both areas has been relatively scarce and has mainly been seen in relation to the west, following the ancient “Ruta de la Plata” corridor.

In light of the good stability of the population until relatively recently (1950) and their reduced importance, long-distance movements seem to have been a more recent phenomenon, with a less marked contribution to the population structure.