Matchmaker, Matchmaker, Make Me a Match: Migration of Populations via Marriages in the Past

Matchmaker, Matchmaker, Make Me a Match: Migration of Populations via Marriages in the Past

2014년 한국물리학회 가을학술논문발표회 통계물리학분과회 [FG-09]

SHL, R. Ffrancon, D. M. Abrams, B. J. Kim (김범준), and M. A. Porter, Phys. Rev. X 4, 041009 (2014).

Sang Hoon Lee (이상훈) Department of Energy Science (에너지과학과)

Sungkyunkwan University (성균관대학교) [email protected]

http://sites.google.com/site/lshlj82

from https://twitter.com/AcademicsSay/status/524571824492150784

https://twitter.com/AcademicsSay/status/524571824492150784

from https://twitter.com/AcademicsSay/status/524571824492150784KPS meeting author list

https://twitter.com/AcademicsSay/status/524571824492150784

Kim from Gyeongju (경주 김) ≠ Kim from Gimhae (김해 김)

Korean family names are subdivided into 본관 named after their geographical origins

Kim from Gyeongju (경주 김) ≠ Kim from Gimhae (김해 김)

Korean family names are subdivided into 본관 named after their geographical origins

Tales of two Korean researchers’ origins

Sang Hoon Lee (이상훈)Beom Jun Kim (김범준)Kim from Gimhae (김해 김)

Lee from Hakseong (학성 이)







Fig. 1. Examples of (a) ergodic and (b) non-ergodic clans. We color the regions of South Korea based on the percentage of the population made up of members of the clanin 2000. We use arrows to indicate the origins of the two clans: Gimhae on the left and Ulsan (“Hakseong” is the old name of the city) on the right. In this map, we use the2010 administrative boundaries [28]. (See the SI for discussions of data sets and data cleaning.)

Fig. 2. (a) Scatter plot of the number of clan entries in jokbo 1 versus the correspondingcentroid in 2000 using the gravitational-model flux with parameters ↵ = 1 and � = 0. Wecompute the line using a linear regression to find the fitting parameter, aG ⇡ 4.2(2)⇥10�10

with a 95% confidence interval, to satisfy the expression Ni = aGGij , where Gij is thegravity-model flux andNi is the total number of entries from clan i in the jokbo. (b) The sameclan entries compared to the radiation model. We compute the line using a linear regression tofind the fitting parameter, aR ⇡ 0.049(2), to satisfy the expression Ni = aRRij , whereRij is the radiation-model flux and Ni is the total number of entries from clan i in the jokbo.In both panels, we color the points using the number of administrative regions occupied by thecorresponding clans, which we show in Figs. 4(a) and (b). The red markers (outliers) on bothpanels correspond to the clan corresponding to jokbo 1 (the case i = j).

6 www.pnas.org/cgi/doi/10.1073/pnas. Footline Author

family/origin population density as a percentage of total population density

Really?

from the census data (in 1985 and 2000—the only two years when the individual 본관’s regional data is available) showing the list of populations of 본관 for each administrative regions of South Korea, provided by Korean Statistical Information Service (KOSIS): http://kosis.kr/

http://kosis.kr





family/origin population density as a percentage of total population density

Really?

from the census data (in 1985 and 2000—the only two years when the individual 본관’s regional data is available) showing the list of populations of 본관 for each administrative regions of South Korea, provided by Korean Statistical Information Service (KOSIS): http://kosis.kr/

“ergodic” vs “non-ergodic” 본관 ...

http://kosis.kr

족보 in Korean culturefamily tree of (mostly) male descendants

spouses’ 본관 and birth year: “by-product” data sets



10

10-9

10-6

10-3

100

100 102 104 106 108

P(k

)

k

1900-19901600-1630

Census

Figure 6. Comparison between the actual family books (points) and RGFpredictions (lines).

1

1.1

1.2

1.3

102 104 106 108

γ

M

Figure 7. Power-law exponent as a function of M . The solid line was obtainedby analyzing the family books from 1900–1990, and the dotted line is connectedwith the census data in 2000.

Mtot = 165 020 women each having one family name out of Ntot = 194 and where kmax = 32 316are named Kim. These three numbers Mtot, Ntot and kmax uniquely determine PM(k) within theRGF model, as explained in section 3 [6]. The middle full curve in figure 6 gives the predictedsize distribution and the pluses denote the actual (binned) data points. The agreement betweenthe RGF prediction and the data is very good, in particular in view of the fact that the predictionis based solely on the three numbers Mtot, Ntot and kmax. The prediction for the exponent � in(1) is � = 1.12. As explained in section 3, the RGF model allows you to predict how the PM(k)for either a smaller m < M or larger m > M . Figure 7 displays the predicted change of � whenstarting from the data given for the period 1900–1990. The solid curve gives the change whenm is decreasing and the dotted line when it is increasing. The fact that � changes with the sizeof the data set is a fundamental feature of the RGF model and distinguishes it from the usualgrowth models, which in general give scale-invariant and hence size-independent � [6]. The leftfull curve in figure 6 is the prediction for the 1600–1630 data only using the data for 1900–1990and the number m = 384, which is the number of women getting married into the ten familiesduring the period 1600–1630. The actual name-frequency distribution for these women is givenby the crosses and the agreement is again quite good. In particular, note that the data are indeedconsistent with the slightly steeper slope for smaller k caused by a slightly larger � = 1.22(compare figure 7). The rightmost curve in figure 6, in the same way, gives the prediction basedon only using the data for the married women in 1900–1990 and the total population size in theyear 2000 given by m = 4.6 ⇥ 107. The census data from the year 2000 are also plotted and theagreement between the prediction and the data is again very good. This time the � = 1.07 is

New Journal of Physics 13 (2011) 073036 (http://www.njp.org/)

S. K. Baek (백승기), P. Minnhagen, and B. J. Kim (김범준), New J. Phys. 13, 073036 (2011) and references therein.



10

10-9

10-6

10-3

100

100 102 104 106 108

P(k

)

k

1900-19901600-1630

Census

Figure 6. Comparison between the actual family books (points) and RGFpredictions (lines).

1

1.1

1.2

1.3

102 104 106 108

γ

M

Figure 7. Power-law exponent as a function of M . The solid line was obtainedby analyzing the family books from 1900–1990, and the dotted line is connectedwith the census data in 2000.

Mtot = 165 020 women each having one family name out of Ntot = 194 and where kmax = 32 316are named Kim. These three numbers Mtot, Ntot and kmax uniquely determine PM(k) within theRGF model, as explained in section 3 [6]. The middle full curve in figure 6 gives the predictedsize distribution and the pluses denote the actual (binned) data points. The agreement betweenthe RGF prediction and the data is very good, in particular in view of the fact that the predictionis based solely on the three numbers Mtot, Ntot and kmax. The prediction for the exponent � in(1) is � = 1.12. As explained in section 3, the RGF model allows you to predict how the PM(k)for either a smaller m < M or larger m > M . Figure 7 displays the predicted change of � whenstarting from the data given for the period 1900–1990. The solid curve gives the change whenm is decreasing and the dotted line when it is increasing. The fact that � changes with the sizeof the data set is a fundamental feature of the RGF model and distinguishes it from the usualgrowth models, which in general give scale-invariant and hence size-independent � [6]. The leftfull curve in figure 6 is the prediction for the 1600–1630 data only using the data for 1900–1990and the number m = 384, which is the number of women getting married into the ten familiesduring the period 1600–1630. The actual name-frequency distribution for these women is givenby the crosses and the agreement is again quite good. In particular, note that the data are indeedconsistent with the slightly steeper slope for smaller k caused by a slightly larger � = 1.22(compare figure 7). The rightmost curve in figure 6, in the same way, gives the prediction basedon only using the data for the married women in 1900–1990 and the total population size in theyear 2000 given by m = 4.6 ⇥ 107. The census data from the year 2000 are also plotted and theagreement between the prediction and the data is again very good. This time the � = 1.07 is

New Journal of Physics 13 (2011) 073036 (http://www.njp.org/)

S. K. Baek (백승기), P. Minnhagen, and B. J. Kim (김범준), New J. Phys. 13, 073036 (2011) and references therein.

geographical information from 본관: honestly, too valuable pieces of information to overlook (and just be satisfied with the distributions as if we don’t know anything about them), don’t you think? :)

Marriage record in 족보 as a proxy of human migration due to marriage

Marriage record in 족보 as a proxy of human migration due to marriage

Studies on human migration patterns: two types of generative models

where mi is the population at i and rij is the distance between i and j

J. Q. Stewart and W. Warntz, J. Regional Sci. 1, 99 (1958);W.-S. Jung (정우성), F. Wang, and H. E. Stanley, EPL 81, 48005 (2008).

(i ! j) where Ri is the total population outgoing from i,sij is the population in the circle of radius rij centered at i(excluding the source i and destination j’s population),and N is the total population

F. Simini, M. C. González, A. Maritan, and A.-L. Barabási, Nature 484, 96 (2012); A. P. Masucci, J. Serras, A. Johansson, and M. Batty, Phys. Rev. E 88, 022812 (2013).

radiation model: flux Rij =Ri

1�mi/N

mimj

(mi + sij)(mi +mj + sij)

gravity model: flux Gij = Am↵

i m�j

r�ijcontrolling the effect of geography (or lack thereof)gravity model flux

radiation model flux

In practice . . .

이(李) 진주 12636이(李) 진해 490이(李) 차성 2173이(李) 창녕 4132이(李) 창원 719이(李) 천안 2297이(李) 철성 4871이(李) 청도 983이(李) 청송 814이(李) 청안 13549이(李) 청양 743이(李) 청주 34756이(李) 청평 656이(李) 청해 12002이(李) 춘천 372이(李) 충남 995이(李) 충주 4022이(李) 칠성 1209이(李) 태안 4084이(李) 태원 670이(李) 토산 491이(李) 통진 227이(李) 평산 3394이(李) 평양 579이(李) 평창 65945이(李) 평택 1099이(李) 평해 205이(李) 풍천 731이(李) 하동 1052이(李) 하빈 15058이(李) 하산 828이(李) 하음 431이(李) 학성 20964이(李) 한산 136615이(李) 한성 1342이(李) 한양 1096이(李) 함경 792이(李) 함안 37597이(李) 함양 1851이(李) 함평 125419이(李) 함풍 6413이(李) 함흥 786이(李) 합천 115462이(李) 해남 878이(李) 해령 650이(李) 해주 2064이(李) 행주 870이(李) 헌양 786이(李) 홍성 1163이(李) 홍주 14897이(李) 홍천 566이(李) 화산 1775이(李) 화평 1498이(李) 황주 291

the census data (in 1985 and 2000) showing the list of populations of 본관 for each administrative regions of South Korea, provided by Korean Statistical Information Service (KOSIS): http://kosis.kr/


1�mi/N

mimj



i m�j

r�ij

http://kosis.kr

In practice . . .

이(李) 진주 12636이(李) 진해 490이(李) 차성 2173이(李) 창녕 4132이(李) 창원 719이(李) 천안 2297이(李) 철성 4871이(李) 청도 983이(李) 청송 814이(李) 청안 13549이(李) 청양 743이(李) 청주 34756이(李) 청평 656이(李) 청해 12002이(李) 춘천 372이(李) 충남 995이(李) 충주 4022이(李) 칠성 1209이(李) 태안 4084이(李) 태원 670이(李) 토산 491이(李) 통진 227이(李) 평산 3394이(李) 평양 579이(李) 평창 65945이(李) 평택 1099이(李) 평해 205이(李) 풍천 731이(李) 하동 1052이(李) 하빈 15058이(李) 하산 828이(李) 하음 431이(李) 학성 20964이(李) 한산 136615이(李) 한성 1342이(李) 한양 1096이(李) 함경 792이(李) 함안 37597이(李) 함양 1851이(李) 함평 125419이(李) 함풍 6413이(李) 함흥 786이(李) 합천 115462이(李) 해남 878이(李) 해령 650이(李) 해주 2064이(李) 행주 870이(李) 헌양 786이(李) 홍성 1163이(李) 홍주 14897이(李) 홍천 566이(李) 화산 1775이(李) 화평 1498이(李) 황주 291

the census data (in 1985 and 2000) showing the list of populations of 본관 for each administrative regions of South Korea, provided by Korean Statistical Information Service (KOSIS): http://kosis.kr/

population from 2000 census data

marriage “flux” in 족보

distance between two population centroids in 2000

brides’ 본관 index i

grooms’ 본관 index j: identical (constant) for a single 족보 by definition


1�mi/N

mimj



i m�j

r�ij

http://kosis.kr





Gravity & radiation models

linear fit: marriage flux Ni / Gij

linear fit: marriage flux Ni / Rij

↵ ⇡ 1.0749 and � ⇡ �0.0349! ↵ = 1 and � = 0 (to include the case i = j)



1�mi/N

mimj



i m�j

r�ij

No dependency on the distance and ? “population-product model” (the 본관 of the 족보 is ergodic, so the husbands could be almost everywhere in the country, which made the geographic factors irrelevant)

/ m↵i











1�mi/N

mimj



i m�j

r�ij


/ m↵i











1�mi/N

mimj



i m�j

r�ij

outlier: women from the 본관 of 족보 are severely underrepresented. In fact, marrying a spouse from the same 본관 had been considered as a taboo in Korean culture and even illegal in South Korea until 2005!


/ m↵i











1�mi/N

mimj



i m�j

r�ij

outlier: women from the 본관 of 족보 are severely underrepresented. In fact, marrying a spouse from the same 본관 had been considered as a taboo in Korean culture and even illegal in South Korea until 2005!

a simple measure of ergodicity: number of different regions occupied by 본관

“ergodic” vs “non-ergodic” 본관


6

0 2000

0.02

number of regions occupied

pro

ba

bili

ty d

ist.

(cl

an

s) (a)

0 2500

0.02

radius of gyration (km)

pro

ba

bili

ty d

ist.

(cl

an

s) (c)

0 2000

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(b)

0 2500

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(d)

FIG. 3. Distribution of the number of di↵erent administrative regionsoccupied by clans. (a) Probability distribution of the number of dif-ferent administrative regions occupied by a Korean clan in the year2000. (b) Probability distribution of the number of di↵erent admin-istrative regions occupied by the clan of a Korean individual selecteduniformly at random in the year 2000. The di↵erence between thispanel and the previous one arises from the fact that clans with largerpopulations tend to occupy more administrative regions. Note thatthe rightmost bar has a height of 0.17 but has been truncated for vi-sual presentation. (c) Probability distribution of radii of gyration (inkm) for clans in 2000. (d) Probability distribution of radii of gyra-tion (in km) for clans of a Korean individual selected uniformly atrandom in 2000. The di↵erence between this panel and the previ-ous one arises from the fact that clans with larger populations tendto occupy more administrative regions. Solid curves are kernel den-sity estimates (from Matlab R2011a’s ksdensity function with aGaussian smoothing kernel of width 5).

have jokbo are fairly ergodic, so the variables associated withthe j indices (i.e., the grooms) in Eqs. (1) and (2) have alreadylost much of their geographical precision, which is consistentboth with the values ↵ = 0 and � = 0 (the population prod-uct model). Again see the scatter plots in Fig. 2, in which wecolor each clan according to the number of di↵erent admin-istrative regions that it occupies. Note that the three di↵erentergodicity diagnostics are only weakly correlated (see Fig. I2).

Our observations of clan bimodality for Korea contrastsharply with our observations for family names in theCzech republic, where most family names appear to be non-ergodic [25] (see Fig. I3). One possible explanation of theubiquity of ergodic Korean names is the historical fact thatmany families from the lower social classes adopted (or evenpurchased) names of noble clans from the upper classes nearthe end of the Joseon dynasty (19th–20th centuries) [20, 52].At the time, Korean society was very unstable, and this pro-cess might have, in essence, introduced a preferential growthof ergodic names.

In Fig. 4, we show the distribution of the di↵usion constants

−5 200

0.7

diffusion constant (km2/year)

pro

ba

bili

ty d

istr

ibu

tion

FIG. 4. Distribution of estimated di↵usion constants (in km2/year)computed using 1985 and 2000 census data and Eq. (3). Thesolid curve is a kernel density estimate (from Matlab R2011a’sksdensity function with default smoothing). See the Appendixfor details of the calculation of di↵usion constants.

that we computed by fitting to Eq. (3). Some of the values arenegative, which presumably arises from finite-size e↵ects inergodic clans as well as basic limitations in estimating di↵u-sion constants using only a pair of nearby years. In Fig. I4,we show the correlations between the di↵usion constants andother measures.

C. Convection in Addition to Di↵usion as Another Mechanismfor Migration

The assumption that human populations simply di↵use is agross oversimplification of reality. We will thus consider theintriguing (but still grossly oversimplified) possibility of si-multaneous di↵usive and convective (bulk) transport. In thepast century, a dramatic movement from rural to urban areashas caused Seoul’s population to increase by a factor of morethan 50, tremendously outpacing Korea’s population growthas a whole [53]. This suggests the presence of a strong at-tractor or “sink” for the bulk flow of population into Seoul, ashas been discussed in rural-urban labor migration studies [54].The density-equalizing population cartogram [55] in Fig. I5clearly demonstrates the rapid growth of Seoul and its sur-roundings between 1970 and 2010.

If convection (i.e., bulk flow) directed towards Seoul hasindeed occurred throughout Korea while clans were simulta-neously di↵using from their points of origin, then one oughtto be able to detect a signature of such a flow. In Fig. 5(a),we show what we believe is such a signature: we observethat the fraction of ergodic clans increases with the distancebetween Seoul and a clan’s place of origin. This would be un-expected for a purely di↵usive system or, indeed, in any othersimple model that excludes convective transport. By allow-ing for bulk flow, we expect to observe that a clan’s mem-bers preferentially occupy territory in the flow path that islocated geographically between the clan’s starting point andSeoul. For clans that start closer to Seoul, this path is short;for those that start farther away, the longer flow path ought

Distribution of administrative regions occupied by 본관





ergodic

non-ergodic


6

0 2000

0.02


pro

ba

bili

ty d

ist.

(cl

an

s) (a)

0 2500

0.02


pro

ba

bili

ty d

ist.

(cl

an

s) (c)

0 2000

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(b)

0 2500

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(d)





−5 200

0.7


pro

ba

bili

ty d

istr

ibu

tion






Distribution of administrative regions occupied by 본관





ergodic

non-ergodic

19

Surnames, Geoforum 42, 506 (2011).[25] J. Novotný and J. A. Cheshire, The Surname Space of the Czech

Republic: Examining Population Structure by Network Anal-ysis of Spatial Co-Occurrence of Surnames, PLOS ONE 7,e48568 (2012).

[26] An original jokbo usually includes more details about clans, butwe only use information that we mention explicitly.

[27] Google Maps, https://developers.google.com/maps/.

[28] In fact, it is known that clans that originated in the southernpart of Korea have been more abundant in recent Korean history(including the period that is spanned by our data set) [20].

[29] Korean Statistical Information Service, http://kosis.kr/ (Korean version) and http://kosis.kr/eng/ (En-glish version).

[30] Statistical Geographic Information Service (:üx>⇢t�o�&Ò⌦ò–"fq�€º in Korean), http://sgis.kostat.go.kr/ (Ko-rean version; no English version available).

[31] J. P. Sethna, Statistical Mechanics: Entropy, Order Parame-

0 2000

500


dis

tance

move

d (

km) (a)

0 2500

500


dis

tance

move

d (

km) (b)

0 2000

250


rad. of gyr

atio

n (

km) (c)

FIG. I2. Scatter plots of (a) the distance between the clan-originlocation and the population centroid versus the number of admin-istrative regions, (b) the distance between the clan-origin locationand the population centroid versus the radius of gyration, and (c)the radius of gyration versus the number of administrative regions.The corresponding Pearson correlation values are (a) r ⇡ 0.20 (from3 481 clans that include all of the required information; the p-valueis p ⇡ 5.7 ⇥ 10�34), (b) r ⇡ 0.066 (from 3 481 clans that includeall of the required information; the p-value p ⇡ 9.4 ⇥ 10�5), and (c)r ⇡ �0.26 (from 3 481 clans; the p-value is p ⇡ 1.8⇥10�53). Note thatcorrelations over limited ranges may be di↵erent and significant: forexample, in panel (c), the two metrics for ergodicity are significantlypositively correlated when rg < 50 km (r ⇡ 0.39, p ⇡ 2.3 ⇥ 10�4).

ters and Complexity (Oxford University Press, Oxford, UnitedKingdom, 2006)

[32] J. van Lith, Ergodic Theory, Interpretations of Probability andthe Foundations of Statistical Mechanics, Stud. Hist. Philos.Sci. 32, 581 (2001).

[33] Because we only have ten family books, our data allows onlyten distinct possibilities for “clan j” (the family into which awoman marries).

[34] S. Cho, Chapter 8. Traditional Korean Culture in Cultural andEthnic Diversity: A Guide for Genetics Professionals eds FisherNL (The Johns Hopkins University Press, Maryland, UnitedStates, 1996).

[35] F. Simini, M. C. González, A. Maritan, and A.-L. Barabási, AUniversal Model for Mobility and Migration Patterns, Nature(London) 484, 96 (2012).

[36] A. P. Masucci, J. Serras, A. Johansson, and M. Batty, Grav-ity versus Radiation Models: On the Importance of Scale andHeterogeneity in Commuting Flows, Phys. Rev. E 88, 022812

0 2060

0.02


pro

ba

bili

ty d

ist.

(cl

an

s) (a)

0 2500

0.02


pro

ba

bili

ty d

ist.

(cl

an

s) (c)

0 2060

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(b)

0 2500

0.02


pro

ba

bili

ty d

ist.

(in

div

idu

als

)

(d)

FIG. I3. (a) Probability distribution of the number of di↵erent admin-istrative regions occupied a Czech family name in 2009. Note thatthe leftmost two bars have heights of 0.17 and 0.03 but have beentruncated for visual presentation. (This data was initially analysedin Ref. [25].) (b) Probability distribution of the number of di↵erentadministrative regions occupied by the clan of a Czech individualselected uniformly at random in 2009. The di↵erence between thispanel and the previous one arises from the fact that clans with largerpopulations tend to occupy more administrative regions (c) Probabil-ity distribution of radii of gyration (in km) of Czech family namesin 2009. Note that the leftmost bar has a height of 0.11 but has beentruncated for visual presentation. (d) Probability distribution of radiiof gyration (in km) of Czech family names of a Czech individualselected uniformly at random in 2009. The di↵erence between thispanel and the previous one arises from the fact that clans with largerpopulations tend to occupy more administrative regions. Observethat the distributions in panels (a) and (b) are starkly di↵erent fromthe distributions in panels (a) and (b) from Fig. 3. Solid curves arekernel density estimates (from Matlab R2011a’s ksdensity func-tion with a Gaussian smoothing kernel of width 5).

surnames in Czech Republic

non-ergodic

data from J. Novotný and J. A. Cheshire, PLOS ONE 7, e48568 (2012).

then interpreted as a result of the subsequent resettlement andindustrialization led immigration into these areas, but also of somestate policies that have contributed to the spatial concentrations(and often also segregations) of Roma minority groups [23].

An intriguing exception to these explanations is a typical Czechsurname Vlcek that can also be found in Figure 6 because of itssignificant revealed relatedness with Wolf. From all of thesurnames considered, the name Wolf has been found as thenearest neighbour of Vlcek, with the 56.7% probability that one ofthese surnames concentrates in the region where another one isconcentrated. The high co-occurrence of these two surnames inthe identical regions seems to be attributable to their commonmeaning – Vlcek literally means ‘‘small Wolf’’ in the Czechlanguage. The bi-lingual naming practices or secular nametransformations taking place in these historically multi-ethnicregions (German and Czech) offers the most likely explanation forsuch commonalities.

Analysis of co-occurrence in municipalitiesIn the second stage of our analysis we examined the co-

occurrence of Czech surnames at the finest spatial level of 6,244municipalities. We began with the calculation of the pairwiseindices of revealed relatedness (Di,j,mun) among 5,660 surnamesselected on the basis of the highest revealed relatedness at moreaggregate spatial level. This sample of surnames covers almost ahalf of the Czech male population. Given the significantly highernumber of spatial units considered for this second stage of ouranalysis, the values of Di,j,mun are generally lower than Di,j,reg in thefirst stage which focused on co-occurrence in 206 micro-regionsonly. At the same time, the size distribution of these second stageresults is even more skewed to the right; the maximum Di,j,mun

(from the total of more than 32 million of observations)corresponds to 0.687, while only 0.011% of all observationsexceed 50% of the maximum value. These differences between thefirst and second stage results are understandable and go hand inhand with the expectation that the surname network based on themunicipality level calculations will be more fragmented.

This has been confirmed by the fact that a majority of the mostsignificant Di,j,mun proximity observations occur among relatively

rare surnames that are typically concentrated in a few nearbymunicipalities. This is especially the case of Silesian surnames thataccount for almost all Di,j,mun observations at the very top of thedistribution of results. As such, in order to get a reasonablenetwork representation, we again had to impose some restrictionsin relation to the minimal size of surnames shown as nodes and thestrength of links between them. After applying the criteria from theprevious section, we found the frequency of at least 150 bearersand the links determined by Di,j,mun .0.23 to be optimal. Thesurname network based on these parameters and generated by aweighted force-directed algorithm is depicted in Figure 7 (Fig-ure S5 depicts a high resolution version with labels of individualsurnames and the size of nodes scaled by their population size).

In general, the second stage or municipality level surnamenetwork has reproduced the macro-division of the Czech surnamespace identified in the first stage and described above. Theproportions between the sizes of the main clusters are howeverdifferent with the previously mentioned dominance of the densegroup of Silesian surnames (B2). Regarding Moravian surnames,again the commonality of verb-derived surnames emerges, as theyform the majority of names in the B1 area of the network. TheBohemian part of the surname space (A) is structured into threemain groups of surnames. The A1 cluster comprises some of themost frequent surnames and those prevalent across most ofBohemian regions, whilst the separation from the secondarycluster (A2) is hardly discernible. By contrast, two other core areasare well recognizable and represent northern and easternBohemian names (A3) more specifically and surnames concen-trated mainly in municipalities in the north-west and west ofBohemia (A4).

The general congruence in macro-structure of the surnamenetworks constructed here and in the first stage of our analysis isan important finding (generally similar macro-structure was alsofound when the Ji,j,reg and Ji,j,mun were considered instead of theDi,j,reg and Di,j,mun, respectively). However, the main value of thissecond stage municipality level exercise should be seen inindividual details uncovered with respect to local parts of thesurname network. A number of interesting examples of pairs ofsurnames that have been found as potentially closely related,

Figure 4. Spatial concentration of individual communities of high degree surnames. Individual maps show regional variation in thepercentage of high degree surnames from particular core communities (A1, A2, A3, A4, B1, B2) as listed in Table 5 concentrated in a given region. Forexample, if the percentage of high degree surnames for A1 (the upper left map) corresponds to 100, then all the surnames listed in the A1 group inTable 5 are concentrated in a given region (that is, all of them satisfy LQi,r .1 for the region in question).doi:10.1371/journal.pone.0048568.g004

The Surname Space of the Czech Republic

PLOS ONE | www.plosone.org 8 October 2012 | Volume 7 | Issue 10 | e48568

More sophisticated measure of ergodicity & simple diffusion dynamics

“clan density anomaly”:

�i(rk, t) = ci(rk, t)�mi(t)

N(t)⇢(rk, t) ,

where rk is the two-dimensional coordinate vector of region k,ci(rk, t) is clan i’s population density in region k at time t,mi(t) is the total population of clan i in the country at time t,N(t) is the total population (all clans) in the country at time t,and ⇢(rk, t) is the total population density (all clans) in region k at time t.

본관

본관

본관

본관

본관


“clan density anomaly”:

�i(rk, t) = ci(rk, t)�mi(t)

N(t)⇢(rk, t) ,

where rk is the two-dimensional coordinate vector of region k,ci(rk, t) is clan i’s population density in region k at time t,mi(t) is the total population of clan i in the country at time t,N(t) is the total population (all clans) in the country at time t,and ⇢(rk, t) is the total population density (all clans) in region k at time t.

본관

본관

clan i’s radius of gyration of anomaly:

rgi =

s1

�i,tot(t)

X

k

[krk � ri,C(t)k2�i(rk, t)Ak] ,

where �i,tot(t) =P

k �i(rk, t)Ak,ri,C(t) = [1/�i,tot(t)]

Pk rk�i(rk, t)Ak,

and Ak is the area of region k.

본관

본관

본관

본관


assuming a simple diffusive process over time: measuring the diffusion constants by numerically applying the partial differential equation of ϕi above at the two endpoints (“boundary conditions”) in time

assuming that the spatial variation of (mi/N)⇢ is much smaller than that of ci

a very simple di↵usion model described by Fick’s second law:

the flux of clan members

~Ji / ~r�i

(individuals move preferentially away from high concentrations of their family),

then

@ci@t

=

~r · ~Ji / r2�i

(assuming nor spatial variation in the constant of proportionality), or

@�i

@t= Dir2�i,

with the di↵usion constant Di with dimensions [length

2/time].

본관


assuming a simple diffusive process over time: measuring the diffusion constants by numerically applying the partial differential equation of ϕi above at the two endpoints (“boundary conditions”) in time

assuming that the spatial variation of (mi/N)⇢ is much smaller than that of ci

a very simple di↵usion model described by Fick’s second law:

the flux of clan members

~Ji / ~r�i

(individuals move preferentially away from high concentrations of their family),

then

@ci@t

=

~r · ~Ji / r2�i

(assuming nor spatial variation in the constant of proportionality), or

@�i

@t= Dir2�i,

with the di↵usion constant Di with dimensions [length

2/time].

본관 density anomaly of Cho from Haman (함안 조)

3

FIG

.3.Example

ofclananomaly

picture

forclan1,1985vs.

2000,usingDanny’s

anomaly

defi

nition.Circleis

centeredat

CM

positionwithradiussetto

radiusofgyration.Heredi↵usivee↵

ects

are

visible.

FIG

.4.Example

ofclananomaly

picture

forKim

from

Gim

hae,

1985vs.

2000,usingDanny’s

anomaly

defi

nition.Circleis

centeredatCM

positionwithradiussetto

radiusofgyration.(N

ote

positionandsize

ofcircle

are

reasonable.)

large �

small �6

0 2000

0.02


pro

babili

ty d

ist. (

clans) (a)

0 2500

0.02


pro

babili

ty d

ist. (

clans) (c)

0 2000

0.02


pro

babili

ty d

ist. (

indiv

iduals

)

(b)

0 2500

0.02


pro

babili

ty d

ist. (

indiv

iduals

)

(d)





−5 200

0.7


pro

babili

ty d

istr

ibutio

n






distribution of estimated diffusion constants

본관

7

distance from clan originlocation to Seoul (km)

fract

ion e

rgodic

(a)

0 3250.3

0.7

distance from clan origin locationto current clan centroid (km)

fract

ion e

rgodic

(b)

0 3250

0.7

FIG. 5. Fraction of ergodic clans and distance scales of clans. (a)Fraction of ergodic clans versus distance to Seoul. The correlationbetween the variables is positive and statistically significant. (ThePearson correlation coe�cient is r ⇡ 0.83, and the p-value is p ⇡0.0017.) For the purpose of this calculation, we call a clan “ergodic”if it is present in at least 150 administrative regions. We estimate thisfraction separately in each of 11 equally-sized bins over the displayedrange of distances. The gray regions give 95% confidence intervals.(b) Fraction of ergodic clans versus the distance between location ofclan origin and the present-day centroid. We measure ergodicity as inthe left panel, and we estimate the fraction separately for each rangeof binned distances. (We use the same bins as in the left panel.) Thecorrelation between the variables is significantly positive up to 150km (r ⇡ 0.94, p ⇡ 0.0098) and is significantly negative (r ⇡ �0.98,p ⇡ 2.4 ⇥ 10�4) for larger distances.

to contribute to an increased number of administrative regionsoccupied and hence to a greater aggregate ergodicity. We plotthe fraction of ergodic clans versus the distance a clan hasmoved (which we estimate by calculating distances betweenclan-origin locations and the corresponding modern clan cen-troids) in Fig. 5(b). This further supports our claim that bothconvective and di↵usive transport have occurred. In additionto the fraction of ergodic clans, we also compared the radii ofgyration rg to the distance to Seoul and to the distance betweenlocation of clan origin and the present-day centroid, as shownin Fig. I6, and the latter shows the same tendency as the frac-tion of clans, which further supports our claim. We speculatethat the absence of statistical significance of the correlationbetween rg and the distance to Seoul is caused by the sam-pling problem, where many small clans are not included be-cause their centroids cannot be determined (see Appendix B).

We assume that clans that have moved a greater distancehave also existed for a longer time and hence have undergonedi↵usion longer; we thus also expect such clans to be more er-godic. This is consistent with our observations in Fig. 5(b) fordistances less than about 150 km, but it is di�cult to use thesame logic to explain our observations for distances greaterthan 150 km. However, if one assumes that long-distancemoves are more likely to arise from convective e↵ects thanfrom di↵usive ones, then our observations for both short andlong distances become understandable: the fraction of movesfrom bulk-flow e↵ects like resettlement or transplantation islarger for long-distance moves, and they become increasingly

dominant as the distance approaches 325 km (roughly the sizeof the Korean peninsula). We speculate that the clans thatmoved farther than 150 km are likely to be ones that origi-nated in the most remote areas of Korea, or even outside ofKorea, and that they have only relatively recently been trans-planted to major Korean population centers, from which theyhave had little time to spread. This observation is necessarilyspeculative because the age of a clan is not easy to determine:the first entry in a jokbo (see Table A-I for our ten jokbo) couldhave resulted from the invention of characters or printing de-vices rather than from the true birth of a clan [20].

Ultimately, our data are insu�cient to definitively ac-cept or reject the hypothesis of human di↵usion. However,as our analysis demonstrates, our data are consistent withthe theory of simultaneous human “di↵usion” and “convec-tion.” Furthermore, our analysis suggests that if the hypoth-esis of pure di↵usion is correct, then our estimated di↵u-sion constants indicate a possible time scale for relaxation toa dynamic equilibrium and thus for mixing in human soci-eties. In mainland South Korea, it would take approximately(100000 km2)/(1.5 km2/year) ⇡ 67000 years for purely dif-fusive mixing to produce a well-mixed society. A convectiveprocess thus appears to be playing the important role of pro-moting human interaction by accelerating mixing in the pop-ulation. In spite of such limitations, we try to estimate andquantify the centrality of Seoul in the context of a popula-tion flow network model and find other distinctive characteris-tics of the population flow patterns of ergodic and non-ergodicclans. For details, see Appendix H.

V. CONCLUSIONS

The long history of detailed record-keeping in Korean cul-ture provides an unusual opportunity for quantitative researchon historical human mobility and migration, and our inves-tigation strongly suggests that both “di↵usive” and “convec-tive” patterns have played important roles in establishing thecurrent distribution of clans in Korea.

By studying the geographical locations of clan origins injokbo (Korean family books), we have quantified the extent of“ergodicity” of Korean clans as reflected in time series of mar-riage snapshots. This illustrates the need to investigate the dis-tribution of individual clans in more details. Additionally, bycomparing our results from Korean clans to those from Czechfamilies, we have also demonstrated that these ideas can giveinsightful indications of di↵erent mobility and migration pat-terns in di↵erent cultures. Our ergodicity analysis using mod-ern census data clearly illustrates that there are both ergodicand non-ergodic clans, and we have used these results to sug-gest two di↵erent mechanisms for human migration on longtime scales. Many processes involve the attractiveness of thecenter versus di↵usion away from it, so this type of modelingframework can be applied to a lot of data sets, we believe.

A noteworthy feature of our analysis is that we used bothdata with high temporal resolution but low spatial resolution(jokbo data) and data with high spatial resolution but low tem-poral resolution (census data). This allowed us to consider

“diffusion” vs “convection” of human migration?

본관본관

7


fract

ion e

rgodic

(a)

0 3250.3

0.7


fract

ion e

rgodic

(b)

0 3250

0.7






V. CONCLUSIONS





diffusion part?

본관본관

7


fract

ion e

rgodic

(a)

0 3250.3

0.7


fract

ion e

rgodic

(b)

0 3250

0.7






V. CONCLUSIONS





diffusion part?

본관본관

7


fract

ion e

rgodic

(a)

0 3250.3

0.7


fract

ion e

rgodic

(b)

0 3250

0.7






V. CONCLUSIONS





convection part?diffusion part?

본관본관

7


fract

ion e

rgodic

(a)

0 3250.3

0.7


fract

ion e

rgodic

(b)

0 3250

0.7






V. CONCLUSIONS





convection part?diffusion part?

cannot be expected in a purely diffusive system!

본관본관

Summary and Outlook• Human migration (long time scale) throughout history by

combining two complementary data sets:Korean 족보 (good temporal resolution) +modern census data (good spatial resolution)

• marriage pattern (short time scale: “snapshots”) based on 족보: describable by simple human migration models?

• the “ergodicity” (a result of migration) of 본관

• diffusive vs convective migration

• More data (more 족보, census, other countries’ data) → better model/prediction

Thank you for your attention!SHL, R. Ffrancon, D. M. Abrams, B. J. Kim (김범준), and M. A. Porter, Phys. Rev. X 4, 041009 (2014).

Hawoong [ Jeong (from Dongrae)]: [(동래)정]하웅 (족보 data), Josef Novotný (Czech family name data)

Acknowledgments

Mason Porter(University of Oxford)

Beom Jun [Kim (from Gimhae)]: [(김해)김]범준 (Sungkyunkwan University)

Danny Abrams(Northwestern University)

Robyn Ffrancon(University of Gothenburg)


Hawoong [ Jeong (from Dongrae)]: [(동래)정]하웅 (족보 data), Josef Novotný (Czech family name data)

Acknowledgments

Mason Porter(University of Oxford)

Beom Jun [Kim (from Gimhae)]: [(김해)김]범준 (Sungkyunkwan University)

Danny Abrams(Northwestern University)

Robyn Ffrancon(University of Gothenburg)

easy to determine. The first entry in a jokbo (see Table I forour ten jokbo) could have resulted from the invention ofcharacters or printing devices rather than from the true birthof a clan [20].Ultimately, our data are insufficient to definitively accept

or reject the hypothesis of human diffusion. However, asour analysis demonstrates, our data are consistent with thetheory of simultaneous human “diffusion” and “convec-tion.” Furthermore, our analysis suggests that if the hypoth-esis of pure diffusion is correct, then our estimated diffusionconstants indicate a possible time scale for relaxation to adynamic equilibrium and thus for mixing in human soci-eties. In mainland South Korea, it would take approximatelyð100 000 km2Þ=ð1.5 km2=yearÞ ≈ 67 000 years for purelydiffusive mixing to produce a well-mixed society. Aconvective process thus appears to be playing the importantrole of promoting human interaction by accelerating mixingin the population. Despite the limitations imposed by ourdata, we try to estimate and quantify the centrality of Seoulusing a network-flow model for population, and we findsuggestive differences between the flow patterns of ergodicand nonergodic clans. For details, see Appendix H.

V. CONCLUSIONS

The long history of detailed record-keeping in Koreanculture provides an unusual opportunity for quantitativeresearch on historical human mobility and migration, andour investigation strongly suggests that both “diffusive”and “convective” patterns have played important rolesin establishing the current distribution of clans in Korea.By studying the geographical locations of clan originsin jokbo (Korean family books), we have quantified theextent of “ergodicity” of Korean clans as reflected intime series of marriage snapshots. This underscores theutility of investigating the location distributions of indi-vidual clans. Additionally, by comparing our results fromKorean clans to those from Czech families, we have alsodemonstrated that our approach can give insightful indi-cations of different mobility and migration patterns indifferent cultures. Our ergodicity analysis using moderncensus data clearly illustrates that there are both ergodicand nonergodic clans, and we have used these results tosuggest two different mechanisms for human migration onlong time scales. Many mobility processes involve abalance between diffusive spreading and the attractivenessof one or more central locations (and between more generaldiffusive and convective fluxes), so we believe that ourapproach in the present paper will be valuable for manysituations.A noteworthy feature of our analysis is that we used both

data with high temporal resolution but low spatial reso-lution (jokbo data) and data with high spatial resolution butlow temporal resolution (census data). This allowed us toconsider both the patterns of human movement on shorttime scales (mobility via individual marriage processes)

and their consequences for human locations on long timescales (human migration via clan ergodicity). An in-teresting further wrinkle would be to compare suchmobility-derived time scales for human mixing patternsto genetically-derived patterns [56–59].From a more general perspective, our research has

allowed us to test the idea of using a physical analogyfor modeling human migration—an idea put forth (but notquantified) as early as the 19th century [1–10]. Physics-inspired ideas have been very successful for the study ofhuman mobility, which occurs on shorter time scales thanhuman migration, and we propose that Ravenstein wascorrect when he posited that such ideas are also useful forhuman migration.

ACKNOWLEDGMENTS

We thank Hawoong Jeong (정하웅) for providing datafrom Korean family books and Josef Novotný for providingdata on surnames in the Czech Republic. We thank TimEvans for introducing us to helpful references and ErikBollt, Valentin Danchev, Sandra González-Bailón, JamesIrish, Philip Kreager, Michael Murphy, and TommyMurphy for helpful comments and discussions. We thankMarc Barthelemy and Richard Morris for details about theirwork on constructing flow networks [64], and we thank theanonymous referees for their helpful comments and sug-gestions. M. A. P. and S. H. L. acknowledge support fromthe Engineering and Physical Sciences Research Council(EPSRC) through Grant No. EP/J001759/1. B. J. K. wassupported by a National Research Foundation of Korea(NRF) grant funded by the Korean government(No. 2014R1A2A2A01004919). D. M. A. was supportedby Grant No. 220020230 from the James S. McDonnellFoundation. S. H. L. did the majority of his work atUniversity of Oxford.S. H. Lee and R. Ffrancon contributed equally to

this work.

APPENDIX A: JOKBO DATA

In our investigation, we examine ten digitized jokbothat were first studied in Ref. [21]. In Table I in themain text, we give basic information about the ten jokbo,and we now summarize the results of some of ourcomputations.First, we apply the same gravity-model fit that we used

for jokbo 1 to all of the jokbo data, and the results do notdeviate much from those for jokbo 1. That is, γ ≈ 0 andα ≈ 1, so we can apply the population-product model withα ¼ 1. The largest deviations in the two parameter valuesare α ≈ 1.4930 (for jokbo 7) and γ ≈ 0.5377 (for jokbo 6).Interestingly, we could not find any empirical value ofγ < 0.6 reported in the literature [4,5,45–49], and it seemsto be extremely rare to report any empirical values at all forgravity-model parameters. As one can see in Fig. 6, the

MATCHMAKER, MATCHMAKER, MAKE ME A MATCH: … PHYS. REV. X 4, 041009 (2014)

041009-9


introduced in . . .

Sang Hoon Lee/Sungkyunkwan University


Sang Hoon Lee ( ), Robyn Ffrancon, Daniel M. Abrams, Beom Jun Kim ( ), and Mason A. PorterPhys. Rev. X 4, 041009 (2014)Published October 16, 2014

Synopsis: Wedding Registries Reveal MigrationPaths

Patterns in human movement have been studied using traffic data, mobile phone records, and even dollar billcirculation. A new investigation uses centuries-old Korean family books, called jokbos ( in Korean), as arecord of emigration by female brides. The analysis of this movement, as well as more recent census data,shows that both diffusion- and convection-like phenomena may play a role in mixing different populations.

Understanding human mobility is important for improving city planning and for responding to outbreaks ofdisease. Previous modeling work has suggested that the statistical patterns in human movement resemblecertain physical phenomena, such as the diffusion of molecules in liquids. However, these patterns are drawnprimarily from short-term (a day to at most a year) tracking data, so it’s unclear whether the same models applyto longer-term migrations that affect cultural and genetic dissemination.

In Korea, marriages have traditionally involved the relocation of the bride to her groom’s home. Sang Hoon Leeof Sungkyunkwan University, Korea, and his colleagues analyzed of these marriages catalogued inten jokbos that date as far back as the th century. The books don’t provide specific locations, so theresearchers inferred migration paths from the geographic regions associated with each bride and groom’s clannames. Because the bride migration rate between regions was primarily dependent on clan populationdensities, the team tried to explain migration with a model in which Koreans moved randomly away (diffused)from high concentrations of their own clan. However, the estimated diffusion rate was too slow to explain howwell certain clans have spread throughout the country. The team concluded that a directed convective-like flowtowards the capital city Seoul enhanced the mixing rate. According to the authors, the combination of diffusionand convection may explain other human flow patterns.

This research is published in Physical Review X.

–Michael Schirber

ISSN 1943-2879. Use of the American Physical Society websites and journals implies that the user has readand agrees to our Terms and Conditions and any applicable Subscription Agreement.

200 00013

introduced in . . .









–Michael Schirber


200 00013

introduced in . . .









–Michael Schirber


200 00013

introduced in . . .









–Michael Schirber


200 00013