Proposal preview

The big data revolution in economic history

Economic history has always been a strongly data-oriented field of research.
As such, the big data revolution should have profound implications for economic historians. Increases in computing power, advances in distributed computing, and new methods developed in the fields like machine learning and natural language processing are changing the landscape. Data has become far larger in volume, is more varied, and is routinely linked to other datasets. It is widely believed that these developments will revolutionise the process of scientific discovery (King 2011).

The creation of these large volumes of data is a recent phenomenon though, so it is not evident that a field studying the past should be part of that development. However, historians are now also faced with new kinds of data such as large volumes of text or data gathered by crowd-sourcing. Moreover, detailed, promising data on new regions is becoming available (Dong et al. 2015; Fourie, 2016) and research is becoming ever more interdisciplinary (Turchin et al. 2015).

Indeed, a number of highly successful big data projects exist in economic history. Clio-Infra has brought together historical macro-data to analyse global inequality and wellbeing (Van Zanden et al. 2014). The census microdata from the North Atlantic Population Project the Integrated Public Use Microdata project (Ruggles et al. 2011; Ruggles et al. 2015) are key to recent economic historical research (e.g. Abramitzky et al. 2014; Long and Ferrie 2013). To an extent, even crowd-sourcing has a long tradition in economic history, with volunteer work on the English parish records allowing the construction of the famous Cambridge family reconstitution data (Wrigley and Schofield 1989).

While the field’s record is strong, it is important to ensure that economic history benefits from the new possibilities that are opening up in a world of big data. To this end, this session will bring together scholars working on large datasets and new technologies in economic history and adjacent fields. It will have both methodological papers to reflect and disseminate best practice as well as research papers focusing on the actual answers these new techniques can provide.

References

Abramitzky, Ran, Boustan, and Katherine Eriksson. 2014. “A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration.” Journal of Political Economy 122 (3): 467–506.
Dong, Hao, Cameron Campbell, Satomi Kurosu, Wenshan Yang, and James Z. Lee. 2015. “New Sources for Comparative Social Science: Historical Population Panel Data From East Asia.” Demography, May, 1–28.
Long, Jason, and Joseph Ferrie. 2013. “Intergenerational Occupational Mobility in Great Britain and the United States Since 1850.” The American Economic Review 103 (4): 1109–37.
Fourie, Johan. 2016. “The Data Revolution in African Economic History.” Journal of Interdisciplinary History 47 (2): 1–20.
King, Gary. 2011. “Ensuring the Data-Rich Future of the Social Sciences.” Science 331 (6018): 719–21.
Ruggles, Steven, Steven, Katie Genadek, Ron Goeken, Josiah Grover, and Matthew Sobek. 2015. Integrated Public Use Microdata Series: Version 6.0 [Machine-Readable Database]. Minneapolis, Minn.: University of Minnesota.
Ruggles, Steven, Evan Roberts, Sula Sarkar, and Matthew Sobek. 2011. “The North Atlantic Population Project: Progress and Prospects.” Historical Methods: A Journal of Quantitative and Interdisciplinary History 44 (1): 1–6.
Turchin, Peter, Rob Brennan, Thomas Currie, Kevin Feeney, Pieter Francois, Daniel Hoyer, Joseph Manning, et al. 2015. “Seshat: The Global History Databank.” Cliodynamics: The Journal of Quantitative History and Cultural Evolution 6 (1).
Wrigley, Edward Anthony, and Roger S. Schofield. 1989. The Population History of England 1541-1871. Cambridge University Press.

Organizer(s)

  • Auke A Rijpma Universiteit Utrecht a.rijpma@uu.nl
  • Pieter PW Francois University of Oxford and University of Hertfordshire pwfrancois79@gmail.com

Session members

  • Johan Fourie, Stellenbosch University
  • Jeanne Cilliers, Lund University
  • Erik Green, Lund University
  • Dieter Von Fintel, Stellenbosch University
  • James Z Lee, Hong Kong University of Science and Technology
  • Cameron Campbell, Hong Kong University of Science and Technology
  • Hao Dong, Peking University
  • Patrick Manning, University of Pittsburgh
  • Sangkuk Lee, Ajou University
  • Jong Hee Park, Seoul National University
  • Evan Roberts, University of Minnesota
  • Peter Turchin, University of Connecticut
  • Javier Mejia, Los Andes University
  • Richard Zijdeman, International Institute of Social History
  • Ruben Schalk, Universiteit Utrecht
  • Albert Meroño-Peñuela, Vrije Universiteit Amsterdam
  • Laurens Rietveld, Vrije Universiteit Amsterdam
  • Bram Van den Hout, International Institute of Social History
  • Auke Rijpma, Utrecht University

Discussant(s)

Papers

Panel abstract

Economic history has always been a strongly data-oriented field. As such, the big data revolution should have profound implications for economic historians. Increases in computing power, advances in distributed computing, and new methods developed in the fields like machine learning and natural language processing change the landscape. Data has become far larger in volume, is more varied, and is routinely linked to other datasets. A number of highly successful big data projects exist in economic history. While the field’s record is strong, it is important to ensure that economic history benefits from the new possibilities that are opening up in a world of big data. This session will bring together scholar working on large datasets and new technologies in economic history and adjacent fields, featuring methodological papers to reflect and disseminate best practice as well as research papers focusing on the actual answers these techniques can provide.

1st half

Big (panel) data: The Cape of Good Hope Panel project

Johan Fourie, Erik Green, Jeanne Cilliers, Auke Rijpma and Dieter von Fintel

The Cape of Good Hope Panel, under construction since 2015, will be an annual longitudinal dataset of a complete settler population over more than 150 years, tracking multiple generations across a period that stretches beyond any one lifetime. The dataset will include inputs such as household size and labour employed and outputs such as grain, wine and stock. Matching this dataset to probate inventories, auction rolls and genealogical information, we will be able to construct, for the first time on this scale, an intergenerational panel that includes measure of income and wealth, over more than five generations. The Cape of Good Hope Panel will thus create a ‘laboratory’ where social scientists can identify the causal effects of exogenous demand, supply and institutional shocks on human behaviour, and help us to understand the persistence of wealth and inequality within families, and the capacity for society to change over the long run.

The Cape of Good Hope Panel, under construction since 2015, will be an annual longitudinal dataset of a complete settler population over more than 150 years, tracking multiple generations across a period that stretches beyond any one lifetime. The dataset will include inputs such as household size and labour employed and outputs such as grain, wine and stock. Matching this dataset to probate inventories, auction rolls and genealogical information, we will be able to construct, for the first time on this scale, an intergenerational panel that includes measure of income and wealth, over more than five generations. The Cape of Good Hope Panel will thus create a ‘laboratory’ where social scientists can identify the causal effects of exogenous demand, supply and institutional shocks on human behaviour, and help us to understand the persistence of wealth and inequality within families, and the capacity for society to change over the long run.

Record linkage and crowd-sourcing: Solving classification problems to create big data

Evan Roberts

The accessibility of machine-readable population data for studying economic and social history has grown significantly in the past two decades, largely as a result of collaboration with genealogists. Yet there are still significant amounts of systematic social data that remains in published analog format, or in archival collections. Thus research with big population data is currently addressing two important and related challenges. Firstly, there is now sufficient big data in some jurisdictions that multi-generational panels of individual life courses can be constructed through machine linkage techniques. Secondly, in areas where population data has not yet been made machine-readable, scholars are beginning to use citizen science (crowd-sourcing) techniques to collect large datasets. Despite some apparent differences, both citizen science and record linkage share the common challenge of large scale "entity resolution" and "classification".

The accessibility of machine-readable population data for studying economic and social history has grown significantly in the past two decades, largely as a result of collaboration with genealogists. Yet there are still significant amounts of systematic social data that remains in published analog format, or in archival collections. Thus research with big population data is currently addressing two important and related challenges. Firstly, there is now sufficient big data in some jurisdictions that multi-generational panels of individual life courses can be constructed through machine linkage techniques. Secondly, in areas where population data has not yet been made machine-readable, scholars are beginning to use citizen science (crowd-sourcing) techniques to collect large datasets. Despite some apparent differences, both citizen science and record linkage share the common challenge of large scale "entity resolution" and "classification".

Producing a High-Ranking Official for Preserving the Mass of a Lineage Status in Joseon Korea

Sangkuk Lee, Seongmin Mun, Jae-Hun Jung

This paper investigates how a lineage inherited their social status to next generation in pre-modern Korea. We construct a huge dataset of multiple generations of 15,000 successful candidates of civil examination based on the Mungwa-bangmok, the list of successful candidates in the civil service examination, throughout Joseon dynasty (1392- 1910). We add multilateral kin information of 15,000 successful candidates, such as patrilineal and matrilineal, from related various family records, jokbo, and the Annals of Joseon Dynasty. Using this historical big data, we argue that in Joseon Korea, helping members pass the civil examination was the most conventional way for lineages to increase or sustain their power. Moreover, we present empirical evidence on each lineage’s strategies for maintaining their social and political position through success in the civil examination, and the preference of lineage for the quality of offices held by descendants over the quantity. In doing so, we test our...

This paper investigates how a lineage inherited their social status to next generation in pre-modern Korea. We construct a huge dataset of multiple generations of 15,000 successful candidates of civil examination based on the Mungwa-bangmok, the list of successful candidates in the civil service examination, throughout Joseon dynasty (1392- 1910). We add multilateral kin information of 15,000 successful candidates, such as patrilineal and matrilineal, from related various family records, jokbo, and the Annals of Joseon Dynasty. Using this historical big data, we argue that in Joseon Korea, helping members pass the civil examination was the most conventional way for lineages to increase or sustain their power. Moreover, we present empirical evidence on each lineage’s strategies for maintaining their social and political position through success in the civil examination, and the preference of lineage for the quality of offices held by descendants over the quantity. In doing so, we test our hypothesis for multigenerational mobility processes at the lineage level, not individual level, because success at obtaining a high-ranking office was the product of lineage strategy, just individual strategy.

Big Projects or Small? Scope, scale, and documentation in historical big-data projects

Patrick Manning

The potential benefits of big-data approaches for economic history through are immense. Nevertheless, economic-historical data exhibit heterogeneity on multiple scales, thus complicating analysis considerably. Topics of analysis are heterogeneous; digitization of small files is complex; primary data vary by language, space, time, weights and measures; and theory and analysis address multiple variables. This presentation addresses heterogeneity in research by reviewing projects from the scale of CLIO-INFRA to individual big-data projects, with attention to their varying objectives, design, organizational structure, and results. The presentation will explore also explore efforts at collaboration within projects and among projects, as well as challenges of funding big-data research in economic history. The conclusion emphasizes the need for strategic discussion among researchers in economic history, to achieve the maximum benefit in new research results while working with limited resources.

The potential benefits of big-data approaches for economic history through are immense. Nevertheless, economic-historical data exhibit heterogeneity on multiple scales, thus complicating analysis considerably. Topics of analysis are heterogeneous; digitization of small files is complex; primary data vary by language, space, time, weights and measures; and theory and analysis address multiple variables. This presentation addresses heterogeneity in research by reviewing projects from the scale of CLIO-INFRA to individual big-data projects, with attention to their varying objectives, design, organizational structure, and results. The presentation will explore also explore efforts at collaboration within projects and among projects, as well as challenges of funding big-data research in economic history. The conclusion emphasizes the need for strategic discussion among researchers in economic history, to achieve the maximum benefit in new research results while working with limited resources.

2nd half

Evolutionary Pathways to Statehood: an Analysis with Seshat: Global History Databank

Peter Turchin

Over the past 10,000 years human societies evolved from “simple”—small egalitarian groups, integrated by face-to-face interactions, —to “complex”—huge anonymous societies with great differentials in wealth and power, extensive division of labor, elaborate governance structures, and sophisticated information systems. One aspect of this “major evolutionary transition” that continues to excite intense debate is the origins and evolution of the state—a politically centralized territorial polity with internally specialized administrative organization. Different theories proposed by early theorists and contemporary social scientists make different predictions about causal processes driving the rise of state-level social organization. I will use Seshat: Global History Databank to empirically test predictions of several such theories. I will present results of a dynamical regression analysis that estimates how the evolution of specialized governance structures was affected by such factors as social scale (population, territorial expansion), social stratification, provision of public goods, and information systems.

Over the past 10,000 years human societies evolved from “simple”—small egalitarian groups, integrated by face-to-face interactions, —to “complex”—huge anonymous societies with great differentials in wealth and power, extensive division of labor, elaborate governance structures, and sophisticated information systems. One aspect of this “major evolutionary transition” that continues to excite intense debate is the origins and evolution of the state—a politically centralized territorial polity with internally specialized administrative organization. Different theories proposed by early theorists and contemporary social scientists make different predictions about causal processes driving the rise of state-level social organization. I will use Seshat: Global History Databank to empirically test predictions of several such theories. I will present results of a dynamical regression analysis that estimates how the evolution of specialized governance structures was affected by such factors as social scale (population, territorial expansion), social stratification, provision of public goods, and information systems.

Social Networks and Entrepreneurship. Evidence from a Historical Episode of Industrialization

Javier Mejia

This paper explores the relationship between social networks and entrepreneurship by constructing a dynamic social network from archival records. The network corresponds to the elite of a society in transition to modernity, characterized by difficult geographical conditions, market failures, and weak state capacity, as in late 19th- and early 20th-century Antioquia (Colombia). With these data, I estimate how the decision to found industrial firms related to the position of individuals in the social network. I find that individuals more important bridging the network were more involved in industrial entrepreneurship. The rationale of these results is that industrial entrepreneurship was a highly-complex activity that required a wide variety of complementary resources. Networks operated as substitutes of markets in the acquisition of these resources. Thus, individuals with network positions that favored the combination of a broad set of resources had a comparative advantage in industrial entrepreneurship. I run several tests to prove...

This paper explores the relationship between social networks and entrepreneurship by constructing a dynamic social network from archival records. The network corresponds to the elite of a society in transition to modernity, characterized by difficult geographical conditions, market failures, and weak state capacity, as in late 19th- and early 20th-century Antioquia (Colombia). With these data, I estimate how the decision to found industrial firms related to the position of individuals in the social network. I find that individuals more important bridging the network were more involved in industrial entrepreneurship. The rationale of these results is that industrial entrepreneurship was a highly-complex activity that required a wide variety of complementary resources. Networks operated as substitutes of markets in the acquisition of these resources. Thus, individuals with network positions that favored the combination of a broad set of resources had a comparative advantage in industrial entrepreneurship. I run several tests to prove this rationale.

Big Data for East Asian Economic and Social History

James Lee, Cameron Campbell, Dong Hao

New historical microdata for East Asia are transforming the socio-economic history of the region. New historical population databases allow for large-scale comparative historical demographic studies in the tradition of the Eurasia Population and Family History Project. In addition, equally large new databases of government personnel and educated elites are also ready or soon to be ready for public release. We introduce the China Multigenerational Panel Databases and three new databases to show that these new data allow us to extend the study of inheritance from tracing wealth over two generations to a deeper understanding of pedigree over multiple centuries and different historical contexts. They also extend the study of work from entry and exit to a comparative understanding of profession, work place, and career, that is the patterns of promotion and demotion, especially in face of crisis, between different regions and historical periods.

New historical microdata for East Asia are transforming the socio-economic history of the region. New historical population databases allow for large-scale comparative historical demographic studies in the tradition of the Eurasia Population and Family History Project. In addition, equally large new databases of government personnel and educated elites are also ready or soon to be ready for public release. We introduce the China Multigenerational Panel Databases and three new databases to show that these new data allow us to extend the study of inheritance from tracing wealth over two generations to a deeper understanding of pedigree over multiple centuries and different historical contexts. They also extend the study of work from entry and exit to a comparative understanding of profession, work place, and career, that is the patterns of promotion and demotion, especially in face of crisis, between different regions and historical periods.

Linked Data for economic and social historians

Albert Meroño-Peñuela, Laurens Rietveld, Auke Rijpma, Ruben Schalk, Bram van den Hout, Roderick van der Weerdt, Richard Zijdeman

Linked Data (LD) is a recent data technology that is finding rapid uptake in business, government, and academia. In this paper we discuss LD’s suitability as a solution to a number of challenges in data usage in economic and social history: sharing requirements, reproducibility, interoperability, data heterogeneity, and large datasets. We evaluate LD by applying it to representative use cases: the Historical Sample of the Netherlands (HSN), various price and income datasets, occupational coding schemes, and an eighteenth-century dataset on migrants to Dutch cities that is currently being extracted from the archives. We further describe an ecosystem of applications for working with these datasets as LD. We find that because of LD’s graph data model, semantics, and use of web technologies for referencing, LD can be a very efficient and powerful tool, especially for research based on the 'long tail' of economic and social history datasets.

Linked Data (LD) is a recent data technology that is finding rapid uptake in business, government, and academia. In this paper we discuss LD’s suitability as a solution to a number of challenges in data usage in economic and social history: sharing requirements, reproducibility, interoperability, data heterogeneity, and large datasets. We evaluate LD by applying it to representative use cases: the Historical Sample of the Netherlands (HSN), various price and income datasets, occupational coding schemes, and an eighteenth-century dataset on migrants to Dutch cities that is currently being extracted from the archives. We further describe an ecosystem of applications for working with these datasets as LD. We find that because of LD’s graph data model, semantics, and use of web technologies for referencing, LD can be a very efficient and powerful tool, especially for research based on the 'long tail' of economic and social history datasets.