Abstract
This study employs sequence analysis to explore the educational pathways of individuals born in China between 1976 and 1988, a cohort that witnessed substantial educational expansion. The study constructs a typology for classifying these educational trajectories and quantifies the prevalence of each category within the cohort. Utilizing decision tree analysis, the study investigates the relationship between different educational pathways and various background characteristics. Unlike the “waning coefficients” commonly observed in Mare model and its variants, this approach unveils the substantial influence of cumulative advantage and disadvantage in shaping educational trajectories, a process heavily impacted by individuals’ social backgrounds. Despite some exceptions and complexities, several discernible patterns become apparent. For instance, individuals hailing from rural settings generally exhibit a decreased likelihood of progressing along superior educational trajectories throughout their academic endeavors when juxtaposed with their urban counterparts. Moreover, elevated levels of parental education persistently enhance children’s prospects for accessing superior educational pathways, irrespective of their urban or rural origins. This methodology serves as a valuable instrument for scrutinizing the general features and diversity of educational trajectories, providing a complementary perspective to existing research on educational stratification and inequality.
Similar content being viewed by others
Background and questions
Motivated by a confluence of factors, such as the commercialization of education, initiatives aimed at bolstering domestic demand, and strategies devised to alleviate employment pressures, China has undertaken a policy of educational expansion since the 1990s, culminating in an impressive proliferation in higher education enrollment. Empirical data reveal that enrollments at standard colleges and universities swelled from 1.0836 million to 5.9748 million over the decade spanning from 1998 to 2008, with the gross enrollment rate in higher education institutions surging from 9.8% to 23.3%. By the year 2018, these statistics had escalated even further, with enrollments reaching 7.9099 million and the gross enrollment rate soaring to 48.1%, implying that the popularization of higher education was approaching completion.Footnote 1
In the last decades, sociologists have demonstrated a growing fascination with scrutinizing this unusual social project. Most studies (Yang, 2006; Wu, 2010; Li, 2010; Gruijters, 2022, etc.), with a few exceptions (e.g., Liu, 2006), have predominantly concluded that the expansion policy has exacerbated educational inequality, particularly in higher education. However, beyond the variations in theoretical frameworks and the available data, this trend is, to some extent, also a result of the methodologies and analytical models adopted by researchers. Namely, the problem lies in studies either exclusively concentrating on the terminal outcome of educational pathways (specifically, university admittance), or dissecting these trajectories into segmented stages for analysis as though they were separate occurrences. Consequently, there is a lack of a holistic perspective on educational trajectories in the existing literature, resulting in a limited discussion concerning the heterogeneity.
Sequence analysis, enhanced by cluster analysis and a tree-based model, will be used in this paper to develop a typology of educational trajectories and expose the relationship between background characteristics and trajectory clusters. Being a nonparametric approach, sequence analysis has the advantage of conceptualizing educational attainment as holistic trajectories rather than as a series of discrete transitions. This approach facilitates the identification of patterns and heterogeneity in the accumulation of advantages/disadvantages throughout the educational trajectories. Such a method is poised to methodologically enrich the analytical toolkit available, thereby contributing to the refinement and expansion of scholarly inquiry within this domain.
Literature review, research ideas, and working hypotheses
Theoretical contexts and methodological dilemmas of educational transition research
Historically, research into educational and status achievement has frequently employed linear regression models that treat the duration of schooling as a continuous dependent variable, as illustrated by seminal works like those of Duncan (1967) and Blau and Duncan (1967). Mare (1979, 1980, 1981) pioneered the educational transition model, deploying a series of logit models applicable to each distinct phase of the educational career. This approach has several asserted advantages, such as conceptually independent probabilities of educational persistence and the ability to estimate models separately for each stage. A pivotal implication derived from the educational transition model is the effect of “waning coefficients,” signaling a diminishing influence of family characteristics on the likelihood of school enrollment as students’ progress through the educational system, a trend that may diminish to irrelevance post-secondary education. This pattern suggests an incremental progression towards equity in the course of educational transitions.Footnote 2 Later sociological inquiries, for example, those by Raftery and Hout (1993), De Graaf and Ganzeboom (1993), Shavit and Blossfeld (1993), and Ayalon and Shavit (2004), have predominantly adopted Mare model. According to the “maximally maintained inequality” (MMI) (Raftery and Hout, 1993), the importance of family background decays to zero in the case of full popularization of certain levels of education.Footnote 3
Methodologically, Mare model presents two interrelated problems. Firstly, the logit model’s distributional assumptions regarding heteroskedasticity can lead to coefficients being subject to unobserved heterogeneity, even when there is no correlation between the two. This issue poses significant challenges in distinguishing between genuine coefficients (β) and scaled estimates (β/σ), as noted by Mood (2010) and Holm and Jæger (2011). Consequently, this difficulty complicates the comparison of coefficients across different models that operate on varying scales, leading to potential misinterpretations of the effects being studied. Heterogeneous choice models (Allison, 1999; Hauser and Andrew, 2006; Williams, 2009) were proposed to address the problem, but they are sensitive to model misspecification (Williams, 2010; Mare, 2006). Scholars such as Keele and Park (2006) and Williams (2009) suggested that it was preferable to estimate standard choice models without accounting for heteroskedasticity if the source of heteroskedasticity remains ambiguous.Footnote 4
Secondly, the educational transition model faces a severe issue with selectivity-based endogeneity, which worries econometricians. Cameron and Heckman (1998, 2001) criticized Mare model for causing dynamic selection bias due to the omission of unobserved heterogeneity and the “myopia” of agents. Hence, the “waning coefficients” effect could merely be a statistical artifact. Their development of a dynamic discrete choice model (DDCM) revealed that family background has long-term effects on educational attainment. Furthering this critique, Holm and Jæger (2011) underscore the necessity of conceptualizing the entirety of an educational pathway as a selective process in order to circumvent biased estimations resulting from selectivity issues. They introduced an advanced probit choice model that accommodated correlated residuals among transitions. The model revealed that Mare model underrepresents the impact of family background, thus reinforcing the constant inequality assumption.
The predominant methodological framework in educational attainment research, particularly for discrete choice models, is deeply rooted in econometric principles. This paradigm presumes that individuals behave as rational agents, endeavoring to optimize their utility in accordance with economic theory. DDCM exemplifies a single-agent model that necessitates strong assumptions, such as rational agents maximizing discounted utility expectations at all stages, optimal decisions adhering to a steady-state Markov process (Rust, 1987), and transition-specific instrumental variables (IVs) are necessary for model identifiability, which may prove to be either unrealistic or difficult to operationalize. It is hard to differentiate between sample selection bias and scale effect or to distinguish “state dependence” from unobserved heterogeneity (Lucas, 2001; Mare, 2006; DiPrete and Eirich, 2006). Mare (2011) argued that rectifying unobserved heterogeneity without an underlying model would result in misleading estimates.
The evolution of methodological approaches has significantly contributed to theoretical advancements in this domain, with several sociologists refining analysis models in alignment with Mare’s contributions. Lucas (2001) proposed “effectively maintained inequality” (EMI), illustrating how the influence of social background persists, even in the context of universal education, by treating “dropout” as a potential outcome within an ordered probit regression model. Buis (2011) developed the sequential logit model within a more integrated framework, which is capable of addressing the entirety of educational inequality as a weighted sum of inequality across various transitions.Footnote 5
Lucas’ adoption of a probit model, which assumes the normality of residuals, circumvents the aforementioned comparability issues. However, the exclusion of dropouts due to the risk of not entering subsequent stages may still introduce bias into the estimates as a result of dynamic selectivity—a limitation also applicable to Buis’ model. Subsequent studies by Buis (2017) and Lucas (2001) reveal the “waning coefficients” trends within given cohorts, observable both in general and specifically prior to the transition from senior high school to college. Furthermore, these models encounter limitations in addressing the complexities of educational pathways because of their reliance on dichotomous bifurcation and fixed nesting structures.
In the realm of sociology, the pursuit of more sophisticated modeling techniques to grapple with unobserved heterogeneity and introduce greater structural flexibility is ongoing. Breen and Jonsson (2000) demonstrated the substantial influence of family origin on educational outcomes through the application of multinomial transition models within two latent classes, signifying a burgeoning interest in the use of heterogeneity to bolster causal inference. While such strategies are gaining traction as a means to verify analytical robustness, reliance on latent class analysis as a supplementary tool does not ensure the primary model’s capacity to adequately address potential sources of heterogeneity. In response to these econometric challenges, Karlson (2011) formulated a bias-corrected multinomial logit model incorporating alternative-specific IVs. His findings suggest a consistent underestimation of family background’s impact on educational transitions by conventional models. Nevertheless, these innovative models do not entirely escape from foundational assumptions, such as the IIA, which the presence of unobserved heterogeneity often invalidates—a concern Karlson himself raised regarding his own model. Further complicating the matter is the fact that IV estimates, conceived as local average treatment effects (LATE), are only pertinent for a subset known as “compliers,” with the proportion of such latent groups typically remaining elusive even if the monotonicity condition is met. Moreover, these novel applications in educational choice models fall short of incorporating the rigorous statistical tests that are standard in traditional IV estimations. Despite the progression from binary to multinomial frameworks, these models continue to wrestle with the complexity of capturing all stages of education and the array of choices presented at each pivotal transition.Footnote 6
Principal findings and unresolved problems of relevant empirical studies in china
The majority of research on educational inequality in modern China employs the logit model, focusing primarily on college enrollment as the outcome variable (e.g., Li, 2010; Li and Lu, 2015; Yang and Zhang, 2020). Additionally, certain investigations target college students to delineate the influence of background characteristics on access to different levels of universities (Ye and Ding, 2015; Wu, 2017). Nonetheless, these models fall short of adequately probing the intricacies of educational inequality throughout the evolving landscape of educational trajectories. Furthermore, the reference category utilized within these models is a mish-mash of multiple educational attainments with non-university endpoints, thus obfuscating distinct educational pathways. Over an extended period, Liang et al. (2012) found diverse family backgrounds among students with elite higher education. Concurrently, in a focused review of their work, Ying and Liu (2015) critiqued that the key high school systemFootnote 7 entrenched inequality between urban and rural education.Footnote 8
Event history analysis (EHA) offers a solution to censoring bias in educational attainment studies, yet its application remains rare in China. Liu (2006) employed the Cox proportional hazard model to investigate differences in higher education achievement among various risk groups, with a primary emphasis on final outcomes. For analyzing the multiple phases of educational careers, multi-state EHA emerges as a more fitting approach. However, the classic assumptions of EHA are contested by the dependence and heterogeneity present among repeated events, necessitating a generalized correlation structure for transition risks. Nevertheless, the complexity of multi-state EHA increases significantly with the inclusion of numerous frailty parameters, leading to potential identification problems (Bijwaard, 2014). Additionally, EHA tends to oversimplify the relationship between interconnected events in the life course, thereby constraining its effectiveness in illuminating the critical aspects of earlier trajectories that affect later events (Rossignon et al., 2018).
Studies employing Mare model to explore educational attainment in China (Li, 2006; Guo and Wu, 2010; Wu, 2013a, 2013b; Li, 2014a; Yang and Lin, 2014; Pang, 2016, etc.) have yielded significant findings, largely supporting the theoretical framework of EMI. However, “waning coefficients” lingered in their results, whereby the influence of background variables consistently diminishes as students progress through educational stages, and often becomes statistically insignificant during the transition from senior high school to college. Tang (2016) pointedly stated that as students ascend to higher levels of education, the school’s grade replaces SES and cultural background as the primary contributor. Paradoxically, Gruijters’ (2022) study, which used the sequential logit model to examine China’s educational expansion, found that inequalities declined for the most recent cohort. Wu (2010) utilized a multi-group logit model to analyze educational attainment differences between 1990 and 2000, resulting in distinct findings. Generally, “waning coefficients” are observed; there is a single exception concerning the coefficients of household registration status (hukou) between the two years, as detailed in Table 8 of the original text. However, it is crucial to acknowledge that, in this instance, the sample was restricted to populations living in rural areas. In an earlier study (Guo and Wu, 2010) that applied Lucas’s (2001) model, it is again specifically for the final education transition of the last period, the coefficients of background variables exhibited significant positivity. However, it is essential to note that the coefficients in the aforementioned results might have been understated due to the potential “waning coefficients” effect, which could intensify if educational expansion heightens the disparity linked to family backgrounds in reality. These seemingly contradictory conclusions, whether supporting EMI or MMI, remain subject to further debate and require examination through the lens of an innovative methodology.Footnote 9
Utilizing the same data as in this paper, Hao et al. (2014) employed growth mixture modeling (GMM) to analyze educational attainment, delineating four latent classes to explore heterogeneity. However, when employing GMM to investigate highly selective phenomena such as educational attainment, it becomes imperative to address the challenge of missing values caused by unobserved heterogeneity. Although some findings align with the outcomes of the present research, some coefficients, such as those for rural schooling experience, especially in certain latent classes, may be underestimated due to missing data caused by selection bias.Footnote 10 Furthermore, the approach of conceptualizing the educational trajectory as a continuous variable presents shortcomings in capturing the multiplicity of educational pathways. This methodology falls short in distinguishing between academic and vocational education tracks, differentiating between key and non-key schools, and recognizing various levels of higher education.
Methodological advantages of sequence analysis
The concept of advantage/disadvantage accumulation plays a pivotal role within the life course framework in elucidating the emergence of inequality. An analytical model falls short in accurately depicting the dynamics of inequality without incorporating the diversity inherent in the cumulative process (Allison et al., 1982, Allison, 1999, p. 313; Dannefer, 2003; DiPrete and Eirich, 2006; Dannefer, 2009). However, Mare model posits a scale-invariant feature at each educational transition, a presumption that renders capturing these effects challenging in the aforementioned studies.Footnote 11
Magnusson (2001) suggests, from a holistic interactionistic lens, that complex systems are marked by their irreducibility and indecomposability. Critiques have emerged regarding the disparity between the theoretical focus on holism and the empirical reliance on generalized linear models; the latter imposes linear assumptions that stand in stark contrast to the principles of interactionism (Bergman and Magnusson, 1997; Bauer and Shanahan, 2007). These are the “difficulties deep down” (Wittgenstein, 1980, p. 48e) of this field. Xie (2011) recognized two types of heterogeneity-induced biases in educational transition studies, termed “outcome incommensurability” and “population incommensurability,” which are actually connected to the basic claim of holistic interactionism. He posited that they were inherent problems and could not be easily remedied by better statistical models, thus resorted to the expedient use of the sequential logit model. Nonetheless, to address these issues effectively without succumbing to undue pessimism, it is essential to embrace the Wittgensteinian strategy of “tearing out by the roots” and “start thinking of these matters in a new way” (Wittgenstein, 1980, p. 48e).
Von Hayek (1989) emphasized the inherent challenges in precisely predicting “phenomena of organized complexity,” suggesting that only pattern recognition is feasible. In this regard, the person-centered approach offers an alternative that emphasizes individual uniqueness, complexity of interactions, variability of individual changes, generalization of patterns, and finiteness (Sterba and Bauer, 2010). This approach captures the high level of interaction and non-linear relationships in dynamic processes by identifying homogenous subgroups and preserving the complex dynamics of the variable system, which can be seen as surface outputs of underlying processes that accumulate over time and trigger transitions between states (Halpin, 2019).Footnote 12
As a typical person-centered approach, SA was introduced into social sciences from computer science and biostatistics by Abbott (Abbott, 1983; Abbott and Forrest, 1986). SA brings “process” back into sociological theory and empirical research by using a “narrative positivism” (Abbott, 1988, 1992) or “story” approach (Cornwell, 2015). SA does not require any assumptions about the life course, avoids the methodological pitfalls connected with simple statistical aggregation of heterogeneous types, and enables straightforward translation of concepts from the life course perspective (Courgeau, 2018; Vanhoutte et al., 2019).
Research hypotheses
Owing to the absence of previous studies on educational inequality that employ SA, this study has to rely on established theories for the formulation of formalized hypotheses. However, given the discord between current theoretical findings and methodologies, the hypotheses will be both crafted and examined on a distinct methodological base, albeit with some superficial similarities.
Boudon (1974) and Mare (1981) postulated that assessing education’s impact on equality necessitated an analysis of the changes in educational opportunities that accompany the expansion of education. According to the tenets of modernization theory, the proliferation of educational access is believed to trigger a rise in the general level of educational attainment, and the achieved principle supplanting the ascribed principle, thereby diminishing educational inequalities (Treiman, 1970; Boudon, 1974; Treiman and Yip, 1989). Empirical studies from China, such as those by Liu (2006), lend support to this assertion. Based on this, Hypothesis 1 is proposed: with the universality of compulsory education and the growth of higher education, there is an augmentation of overall educational opportunities, a decline in the probability of terminating education prematurely, and an increased likelihood of pursuing higher education trajectories.
Educational inequality, which is rooted in broader social inequality, does not yield consistent results across social subgroups. Higher classes maintain advantages, while opportunities for lower classes do not increase unless a certain level of education is saturated. With the expansion of educational access, disparities in education tend to manifest more significantly in terms of qualitative discrepancies rather than quantitative imbalances (Raftery and Hout, 1993; Lucas, 2001). This phenomenon is corroborated by Shavit and Blossfeld’s (1993) international comparative study, as well as by empirical evidence from China (Li, 2006, 2010; Wu, 2010, etc.). Social subgroups exhibit significant differences in their educational trajectories. Based on Pfeffer’s (2008) conceptualization of educational inequality linking individuals’ educational attainment to their parents’ highest education level, Hypothesis 2 is raised: individuals with highly educated parents experience more stable and high-quality educational trajectories.
The urban–rural dualist structure in China plays a considerable role in fostering the nation’s socioeconomic stratification. This systemic division results in stark disparities in educational opportunities for urban versus rural inhabitants (Li, 2014b; Wang, 2014). Therefore, Hypothesis 3 is formulated: rural individuals are more likely to experience terminated educational trajectories and have limited access to education trajectory types with cumulative advantages compared to their urban counterparts.
Methodological considerations
Data profile
This study draws on data from the 2008 Chinese General Social Survey (CGSS 2008), featuring a sample size of 6000 individuals.Footnote 13 Rather than engaging in cohort comparative analysis, it delves into the dynamics of educational trajectory shifts. To mitigate excessive heterogeneity, the study focuses particularly on the 1976–1988 birth cohort. The educational trajectories of this cohort are encompassed within a period beginning with the enactment of the integrated college enrollment policy in 1994 and culminating at the termination point of the survey in 2008, thereby encapsulating an entire cycle of education of this cohort.Footnote 14 Consequently, it is feasible to conduct empirical observations of their comprehensive educational trajectories.Footnote 15 Following thorough programmatic and manual scrutiny, 21 samples with educational sequences containing irreparable logical inconsistencies were excluded. This vetting process yielded a refined sample comprising 1305 valid respondents, which corresponded to 3915 person-stage records formatted in long format.
Variables and measurements
The utilization of a life history calendar is indispensable for the conduct of sequence analysis. Within the CGSS 2008 survey, an education history table is incorporated, meticulously chronicling the academic trajectories of respondents. This encompasses initiation, conclusion dates, and categories of educational institutions attended, among other pertinent details. This data is converted into a spell format and defined as a state sequence object as needed for sequence analysis. Educational levels undergo a reconfiguration to distinctly identify varying tiers of educational establishments, including universities, colleges, as well as junior and senior high schools. Two highest types of higher education institutions are amalgamated, culminating in a schema comprising 21 distinct types of educational states with the inclusion of one state to account for empty phase between two education stages (see Figs. 2 and 3 for more information).Footnote 16
The locale of respondents’ residence at the age of 14 serves as a surrogate for household registration. While family background factors lack information on past income, and parental occupation codes for respondents aged 14 had a high proportion of missing values (59.5% for fathers and 77.7% for mothers), the principal metric for family background is determined to be the educational attainment of parents.Footnote 17 The highest education level achieved by the parents is coded as a continuous variable representing years of schooling. For instance, elementary school typically corresponds to 6 years of education, while junior high school corresponds to 9 years. The survey instrument features a 10-item scale with five levels to gauge academic performance at age 14, which predominantly evaluates attitudes toward learning, efficacy, and adaptation to school life. The aggregate of these items produces a composite measure.Footnote 18 Gender and age are incorporated into the model. The statistical analyses proceed without weighting, as the study only uses data from a single cohort.
Table 1 provides details of the background variables utilized in the model for the 1976–1988 birth cohort. The distribution of the sample is generally normal, although it exhibits a slightly higher urbanization rate compared to the actual situation.
Causal inference basis of sequence analysis and auxiliary model
The study’s fundamental positions are as follows: (1) Life course sequence data can be interpreted as a collection of condensed individual biographies, rich in information beyond observable metrics. Determining unobserved heterogeneity in this context is dynamic, multifaceted, and emergent. This complexity necessitates a holistic interactionist perspective, which views educational trajectories as comprehensive processes from their very inception, incorporating unobserved heterogeneity into the morphological fabric of these trajectories. This perspective recognizes the inherent nonlinearity, interdependence, and adaptability of educational experiences, emphasizing the importance of understanding these elements in conjunction rather than isolation.
(2) The analysis consistently encompasses the entire sample, a methodological choice aimed at mitigating bias stemming from the selection of the sample. A person-centered approach is adopted, meticulously documenting the complete educational trajectories of individuals within a cohort, while acknowledging disruptions or terminations as distinct states. This approach allows for a comprehensive assessment of the entire sample, paralleling the benefits of prospective studies. As Vanhoutte et al. (2019) suggest, focusing on a single event inherently excludes individuals who have not experienced it. SA addresses this by examining the trajectory in its entirety, ensuring the inclusion of individuals who may not have been at risk of encountering the event.
(3) Although it is not feasible to assume that unobserved variables or inherent discrepancies, such as IQ, are randomly distributed across the population, the study ensures that the classification and correspondence during the clustering sequence are sufficient to satisfy the ignorability assumption by maximizing discernible differences among various educational trajectory categories. This methodology aligns with the traditional approach of conditioning through stratification in causal analysis (Rosenbaum, 2002; Morgan and Winship, 2015). However, it is considered superior to the reliance on propensity score matching that uses external explanatory variables. Provided these conditions are met, the study argues that it is possible to circumvent issues of selectivity-based endogeneity when examining the prevalence of diverse educational trajectory types within distinct characteristic subgroups. Footnote 19
Specifically, the sequence analysis is executed programmatically using the TraMineR and TraMineRextras packages within the R language. These packages cluster state sequences according to the optimal matching distance. This investigation intentionally circumvents the multinomial logit model to prevent regression to a variable-centered approach and to avoid the “curse of dimensionality” that arises when integrating a wide array of interaction terms into the model. Instead, the research employs a conditional inference tree model—a tree-based methodology recognized as a supervised learning algorithm conducive to causal inference (Athey and Imbens, 2016; Wager and Athey, 2018; Brand et al., 2021). By utilizing techniques such as heterogeneity maximization and adaptive nearest neighbor matching within recursive partitioning, these models enable the segmentation of the dataset into distinct sub-samples. This approach provides a flexible and interactive mechanism for addressing confounders, thereby improving the accuracy of heterogeneous causal effect estimations.
Research findings
Descriptive statistics
This study delineates the educational progression by creating a Sankey diagram (Fig. 1), which employs educational sequence data to graphically elucidate educational trajectories and verify data veracity. The illustration portrays the educational continuum as a multifaceted construct characterized by sequential dynamism, branching heterogeneity, and pronounced selectivity, reflecting the complex dynamics inherent in the educational process. It prominently highlights the significant repercussions of the 1990s educational reforms on the educational trajectories of the 1976–1988 born cohort. These reforms are associated with an augmented propensity for higher educational attainment, as evidenced by the rise in enrollment rates at tertiary institutions such as junior colleges and universities, as well as a notable increase in senior high school participation.
The figure presents a comprehensive Sankey diagram generated using gvisSankey, which visualizes the educational pathways of the 1976–1988 birth cohort. The diagram validates the educational sequence data and highlights the dynamic and heterogeneous nature of educational progression, capturing the complex interplay of sequential dynamism, branching heterogeneity, and pronounced selectivity in shaping the cohort's educational trajectories.
Sequence analysis
Figure 2 depicts chronograms (state distribution plots) illustrating the educational trajectory of the 1976–1988 birth cohort, revealing shifts in the prevalence of various educational levels across distinct age brackets. The trend observed in the data indicates that the majority of the cohort transitions from elementary to junior and senior high school with increasing age. Concurrently, there is a rise in the percentage of the cohort engaging in informal education as the cohort exits the formal education system.
The figure illustrates the distribution of educational attainment among the 1976–1988 birth cohort using the seqdplot function. It documents shifts in the prevalence of various education levels across different ages. The graphically displayed structural patterns demonstrate a fluid progression from elementary to middle and high school education as the birth cohort ages, alongside more complex shifts within higher education.
Cluster analysis
The construction of a substitution cost matrix, derived from transition rates observed in the aggregate sequence pattern, is employed for the computation of a distance matrix among sequences utilizing the OMloc method. For the cohort in question, the exponential cost parameter (expcost) within OMloc is set to zero.Footnote 20Subsequently, hierarchical clustering employing the Ward method is conducted on the said distance matrix, and 16 clusters are chosen based on their practical relevance and goodness of fit.Footnote 21 Figure 3 presents the state distribution for the various sequence types, as delineated by the seqdplot function, each manifesting a unique attribute. The clustering outcome accounts for 68.8% of the discrepancy, and Table 2 details the proportions and the salient features of each cluster. The clusters demonstrate a high degree of fit and distinctiveness, mirroring the combined effects of the key school system and the bifurcation of academic and vocational tracks.
The figure reveals the diverse state distributions across a range of sequence types using the seqIplot function, each displaying distinct traits. These characteristics highlight the complex interplay of the key school system and the bifurcation between academic and vocational tracks. The visualization provides a detailed insight into the heterogeneity of educational pathways within the cohort, emphasizing the influence of both structural factors and individual choices.
The 16 identified trajectory types can be classified into three principal categories: (1) The general education pathway category encompasses Clusters 3, 4, 5, 8, 10, 11, 12, and 13, charting educational progressions from diverse tiers of secondary education to assorted junior colleges and universities. Clusters 4 and 8 are particularly noteworthy, as they represent the highest echelons of educational attainment at each respective stage, thereby exhibiting a significant cumulative advantage. Collectively, this category accounts for 33.0% of the overall composition. (2) The education termination category, which includes Clusters 1, 2, 9, 15, and 16, is characterized by a pattern of cumulative disadvantage. It represents the majority of cases, comprising 51.1% of the total. (3) The alternative pathways category is primarily defined by vocational education trajectories, including Clusters 6, 7, and 14. These trajectories typically involve educational institutions such as technical schools, vocational high schools, and specialized technical secondary schools. This category constitutes 15.9% of the total trajectories observed.
Cumulative advantages and disadvantages manifest distinctly across various educational trajectory clusters. For instance, Cluster 8 exhibits a higher representation of students from elite high schools compared to peer clusters, while a substantial segment also stems from less prestigious secondary institutions. Typically, students hailing from high-ranking schools are more likely to gain entrance into subsequent elite educational levels, although there are exceptions where this trend is reversed, known in some contexts as “counterattacks” (ni xi). This pattern is also observed in Clusters 4 and 11, where the likelihood of such reversals seems to be more evident. Conversely, Clusters 9 and 15 are characterized by individuals who either departed from key junior high schools prematurely or who, despite graduating from key senior high schools, did not proceed to higher education. These instances account for a smaller fraction and are often labeled as “antitypes” or “white spots” within the terminology of SA (Bergman et al., 2003). These findings reveal that educational trajectories are intricate and non-linear, profoundly influenced by path dependency. They are prone to feedback mechanisms that have the potential to magnify initial disparities. However, there is considerable variability, resulting in a diverse educational landscape. Such complexity necessitates the implementation of advanced and adaptable analytical methods within the realm of educational research.
Conditional inference tree
This study utilizes the conditional inference tree (CIT) algorithm by Hothorn, Hornik, and Zeileis (2006), tailored to our analytical needs. As a non-parametric model, the CIT allows data to shape model structure and complexity, avoiding predefined parameters. It excels in handling complex, nonlinear data without distributional assumptions. The CIT algorithm effectively manages multi-valued categorical outcomes and explanatory variables, particularly in scenarios lacking a primary treatment assignment. It enables detailed analysis of educational trajectories and rigorous significance testing and mitigates selectivity-based endogeneity by preserving unconfoundedness achieved from the trajectory clustering process. By applying the CIT, this study conducts a nuanced classification of educational trajectories without imposing a uniform causal model, offering a sophisticated alternative to traditional measures like the average treatment effect (ATE) and ensuring sensitivity to each trajectory’s unique features and contexts.
The CIT algorithm selects variables for splitting using permutation-based significance tests, which circumvents biases associated with traditional information measures, like the Gini coefficient or information gain.Footnote 22 This process involves selecting the variable with the smallest p-value for each split and continues until no significant independent variables remain. Although this method embodies a conservative stance, it effectively determines an optimal tree size without the need for post-pruning or cross-validation techniques.Footnote 23
In line with the established criteria, the dataset was partitioned into eight distinct leaf (terminal) nodes, as depicted in Fig. 4 and Table 3. A considerable variation in educational trajectory types was observed across these leaf nodes, exhibiting significant global or local effects for all predictive factors except gender. Table 3 details the distribution of the 16 educational trajectories among the leaf nodes. The key findings are as follows.
The figure offers a comprehensive decision tree diagram, mapping out the diverse educational pathways of the 1976–1988 birth cohort. This dataset, strategically bifurcated into eight unique leaf nodes using established criteria, uncovers a remarkable range of variation in these educational routes. Beyond the factor of gender, all other predictive elements exert global or local effects, providing a nuanced understanding of the determinants shaping these educational trajectories.
First of all, the decision tree that outlines the educational trajectories for the entire sample bifurcates primarily along the urban/rural divide. This suggests that, among the explanatory variables considered, the urban/rural distinction is the most significant factor in differentiating educational trajectories. Individuals who are in rural regions at the age of 14 are more likely to be classified within Cluster 2, as opposed to those in urban areas, highlighting the greater prevalence of compulsory education in urban settings.
Furthermore, parental education level significantly affects the educational outcomes of children in both urban and rural contexts, substantiating Hypothesis 2. It demonstrates a consistent and considerable influence on the quality of educational pathways, encompassing both exemplary and poor outcomes. This factor is crucial in determining whether offspring will follow the prevailing educational routes, though its impact on the entry into distinctive educational paths diverges between urban and rural regions.
In general, rural inhabitants more frequently conform to educational patterns characterized by Cluster 2. Despite the less pronounced morphological differentiation in rural areas compared to urban ones, notable systematic differences are evident across the three leaf nodes. Leaf Node 3, representing the most educationally disadvantaged subgroup, has the highest portion of Cluster 2 at 46.8%, the largest among all nodes. The main distinction within its predecessor node lies in whether parental education exceeds 6 years, synonymous with completing elementary education. If parents have 6 or fewer years of education, there is a roughly 50% chance that their children will be categorized within Leaf Node 3.
Among urban youth, the likelihood of falling into Leaf Node 8 escalates when parental educational achievement does not exceed nine years, which corresponds to finishing middle school or below. While this node predominantly consists of Cluster 2 members, the proportion is smaller at 20.4% compared to its rural equivalents, such as Leaf Nodes 3, 5, and 6. Furthermore, this leaf node encompasses a larger percentage of superior educational trajectory types relative to those observed in rural settings.
With respect to the expansion of higher education, urban populations possess a distinct advantage in accessing elevated educational trajectories such as junior college (Cluster 5) and second-best or lowest higher education trajectories like Clusters 11 and 13. The progression through educational levels adheres to a social choice mechanism that aligns with a survival pattern, where transition rates differ among different social strata, as delineated by Müller and Karle (1993). This evidence underscores a more pronounced “survivor effect” within the rural populace. The group of rural elite students lags behind their urban counterparts across all echelons of identified formal higher education trajectories. Institutions such as junior colleges, undergraduate colleges, and universities at the prefectural level play an essential role in facilitating the pursuit of higher education for urban students with average endowments. The significance is especially pronounced when considering that these students are viewed within a competitive framework alongside their rural peers, who have undergone more stringent selection processes. Hence, Hypothesis 3 is broadly corroborated.
It is critical to understand that within this model, the “age” variable does not simply signify the chronological age but rather encapsulates the period effect associated with the birth year of the cohort, amidst the structural shift occasioned by the expansion of education.Footnote 24 For rural individuals with more than six years of parental education, “age” plays a substantial role in the determination of entering distinct educational trajectories. Younger rural pupils, born more recently and commencing their education later, are more inclined to progress to Leaf Node 5, which encompasses 31.0% of Cluster 2, with Clusters 5 and 8 also representing noteworthy proportions. Conversely, those over 22 are more apt to advance to Leaf Node 6, where Cluster 2 constitutes 44.9%. This indicates that younger rural students are less prone to discontinuing their elementary schooling and tend to adhere to stable educational trajectories. Overall, despite the aforementioned distinctions, it is the frequency of rural students advancing to Cluster 2 that overwhelmingly surpasses that of any urban node, regardless of the background combinations that characterize these nodes.
Analogous to the patterns observed within rural nodes, among the urban demographic segment, those aged 27 or under, categorized within Parent Node 12, exhibit a greater propensity to transition into Leaf Node 13. This node is predominantly influenced by Cluster 4, with Cluster 8 having the most substantial representation across the nodes. In contrast, individuals aged over 27 tend to proceed to Leaf Node 14, which is distinguished by a preponderance of Cluster 12. These patterns mirror a period effect specific to the 1976–1988 birth cohort, which can be attributed to the widespread implementation of compulsory education and the subsequent enlargement and normalization of higher education throughout the 1990s. Consequently, Hypothesis 1 finds general corroboration through these observations.
The findings reveal that the expansion of higher education exerts a more pronounced effect on younger individuals hailing from rural locales and minor urban centers. An evaluation of pivotal age demarcations—22 and 27 years respectively—demonstrates that adolescents from small towns accrue more benefits than those from rural areas. In stark contrast, this policy wields a diminished influence on larger and mid-sized urban areas, a consequence of their pre-existing advantages in educational accessibility. The semblance of equity in educational proliferation across the urban–rural spectrum is a direct result of the historical impediments faced by individuals from rural settings and small towns in securing educational opportunities. The observed phenomenon implies that the reduction of educational disparities within broader socio-spatial frameworks is largely explained by this compensatory effect.
The results of the study elucidate the emergence of a distinctly well-educated echelon. This segment predominantly consists of individuals from substantial urban areas who exhibit enhanced adaptability in educational pursuits and whose parentage often includes access to higher education. The genesis of this group lies at the intersection of accessible opportunities, familial backgrounds, and individual diligence. In addition, the study identifies a distinct category termed the “small-town swot” (xiao zhen zuo ti jia) within Parent Node 12. This group bears resemblance to the concept of the “privileged poor” as delineated by Jack (2019), and carries significant implications for the field of educational sociology.Footnote 25 The concept of being the first-born in a famous Chinese online community’s interest group is used to express complex socio-psychological identification in a self-deprecating tone. This articulation subtly underscores the dissonance between the group’s academic achievements and their actual social standing, a phenomenon situated within the framework of China’s exam-centric educational system and urban-rural dualism system.
The findings indicate that the methodology employed in this study is particularly sensitive to identifying unique subgroups within the populace. It is imperative, however, to acknowledge that variable-centered models exhibit inherent constraints in their capacity to comprehensively capture specific cohorts. This constraint arises from these models’ propensity to dismember holistic educational trajectories, thus restricting the study to isolated segments and fostering sample bias.
The learning situation at age 14 emerges as a notable differentiator for urban youths whose parents have attained more than 9 years of maximum schooling. Within this demographic, the likelihood of transitioning into elite educational pathways is notably higher, exemplified by Leaf Node 10, where 20.0% of participants are from Cluster 5. This likelihood is particularly high in urban populations where this indicator exceeds 35 (i.e., the average value). Residents from prefecture-level cities and provincial capitals or municipalities have the highest proportion of individuals entering Leaf Node 15.Footnote 26 Cluster 8 enjoys a higher percentage within this node, especially relative to Leaf Node 14, which is significantly linked to informal higher education. Leaf Node 13, on the other hand, is marked by Clusters 4 and 8, which are notable for their smaller scale and more homogeneous composition. In summary, a better learning situation substantially improves the likelihood of pursuing academic trajectories that culminate in higher education.
The tree-based model utilized in this study demonstrates variability in the extent of differentiation and specialization of educational trajectories between urban versus rural areas. This variation poses challenges in terms of comparability, as urban demographic groups are more likely to exhibit dominance in certain trajectory types within particular leaf nodes. Nonetheless, even the worst-case scenario for urban populations, as illustrated by Leaf Node 8, displays a greater propensity for individuals to enter advantageous educational trajectories and avoid inferior ones compared to all rural nodes. Additionally, there are clear differences within Cluster 16 when comparing rural and urban areas. For example, in rural areas, a higher percentage (21.2%) of individuals have parents with 6 years or less of education, as seen in Leaf Node 3.Footnote 27
In addition to the previously discussed findings, several other observations are particularly significant in a holistic sense and thus deserve further emphasis and clarification:
Firstly, an examination of Table 3 reveals disparities in the distribution within Cluster 1, particularly between Leaf Nodes 3 and 15, and others (Leaf Nodes 5, 6, 8, 10, 13), with the exception of Leaf Node 14. Specifically, the prevalence of Cluster 1 at Leaf Nodes 3 and 15 is estimated to be around 5%, whereas at the alternate leaf nodes, it ranges between 10–15%. This pattern ostensibly suggests a balanced distribution of educational pathways across urban and rural regions. However, this parity is restricted solely to instances of non-success in college entrance examinations and is unidirectional in nature.
Despite the superficial similarity in proportions, they have completely different meanings. The equivalent presence of Cluster 1 in Nodes 3 and 15 denotes fundamentally disparate realities. A larger fraction of rural students end their formal education prior to attaining high school and thus have no chance of potentially failing the college entrance examination. Consequently, this situation affords urban students a relative advantage in fulfilling their educational paths, thereby creating an educational divide with their rural counterparts, notwithstanding the demographic predominance of rural inhabitants at that time. These observations superficially concur with Mare model’s principles but are derived through a different methodology. From a holistic perspective, there is no evidence that educational inequality decreases progressively with transitions or that a “tail-raising” effect of inequality occurs during senior high school to college transition. Except for a small percentage of special types, educational trajectories, as they are typically mapped, demonstrate persistent inequality, characterized by the cumulative amplification of either advantages or disadvantages.
Secondly, when considering the overall picture, the homogeneity within populations from different levels of urban or rural residence should not be overstated. Urban populations have better access to lower-level higher education trajectories with full-time or part-time junior colleges. Nevertheless, a considerable proportion of the population in urban areas does not continue their education beyond mandatory or senior high levels. Conversely, with regard to variations among variables, this study differs from existing studies, revealing that the urban–rural dichotomy and familial background influence individuals differently across varied educational trajectories. When examined from a holistic interactionistic standpoint, the causal links of the underlying mechanisms are conditional and heterogeneous. Moreover, when scrutinizing the impact of parents’ educational levels on the academic achievements of their offspring, pronounced disparities are predominantly observed within Clusters 2, 5, and 8. Conversely, the termination type of senior high school (Cluster 1), various vocational education types (Clusters 6 and 7), and informal higher education types (Clusters 12, 13, and 14) exhibit less variability. Clusters 9 and 15 represent two unique types of reversals, with negligible differences across nearly all nodes. This pattern implies that the nexus between the education levels of parents and the educational success of children is multifaceted and conditional, incorporating elements of intergenerational statistical regression effects.
Thirdly, echoing Cornwell’s (2015, p. 34) critique of “general linear reality”, the CIT reveals that certain variables play a pivotal role in differentiating between types of educational trajectories within urban settings, yet they do not hold the same significance in rural contexts. On one hand, this implies that rural children experience more pervasive educational failures compared to urban children. On the other hand, the academic achievements of rural students appear to hinge more heavily on unobserved factors such as ability or specific contingencies, such as making the right choices during admission to higher education. Replicating these individual successes poses a challenge, as they tend to follow less predictable patterns than those observed in urban environments. In other words, their educational careers are fraught with greater uncertainties. In this regard, by comparison, urban populations rely more on achieved factors for access to quality educational trajectories. The prevailing conditions in rural areas fail to provide adequate support for academically diligent students to pursue top-tier educational opportunities. This fundamental issue lies at the heart of the educational divide between urban and rural areas.
Discussions and conclusions
Summary of findings and implications
Educational trajectories are complex and involve selective mechanisms. However, Mare model and its adaptations struggle to incorporate the elements of selectivity and diversity that characterize educational careers. The emergence of “waning coefficients” underscores this deficiency, except when particular analyses are confined to localized effects or take into account extended time frames or significant temporal changes that conceal the extent of bias caused by selectivity. Although the preponderance of research signals an increase in educational inequality, the identification of “waning coefficients” casts doubts on this assertion. Moreover, the assumptions integrated into models designed for bias correction in discrete selection are often overly simplistic and fall short in confronting the elaborate and heterogeneous nature of educational pathways.
Utilizing a person-centered, holistic statistical methodology, the synergistic application of sequence clustering and tree-based modeling provides an efficacious means to dissect complex educational trajectories. This approach requires minimal presumptions and adopts a naturalistic perspective, furnishing a comprehensive yet nuanced portrait of educational progression. Research reveals that educational disparities manifest early and accumulate over time, enduring throughout the span of one’s educational career, particularly pronounced at the extreme ends of the educational trajectory typology. Educational attainment, viewed as a filtration process, induces a selection bias that conventional educational transition models struggle to address. However, in this approach, the selectivity here becomes a fulcrum that research can capitalize on. This allows for the counterbalancing of unobserved heterogeneity, thereby enabling a more veritable evaluation of the causal interplay between individuals’ background characteristics and their educational achievements.Footnote 28 Based on the above methodology, this study has enhanced the differentiation of individual cases, particularly those exhibiting unobserved heterogeneity, by implementing holistic trajectory matching and clustering. In conjunction, a tree-based model with manifest covariates has been employed for explicit distinction. This dual strategy enhances the differentiation of between-group differences across both case and variable dimensions and helps to mitigate the effects of unobserved heterogeneity. This approach aids in identifying comparable subgroups, resulting in more accurate estimations.
Beyond several research hypotheses, our approach has discerned differential impacts of educational expansion policies across diverse demographic segments. It has pinpointed particular cohorts, including the highly-educated elite and “small-town swots,” while also revealing the multifaceted ways in which background factors influence entry into distinct educational pathways. Some pathways may be more accessible to specific backgrounds; however, this does not uniformly apply across all trajectories. It can be seen that, despite senior high school termination being equally distributed at each node, this localized equality does not necessarily translate into systemic equality. The “counterattack” phenomenon, although prevalent across multiple trajectories, does not exhibit enough distinction to constitute a separate category. Conversely, transitions from key to non-key schools are identified as two distinct types, presenting a seemingly random pattern not tied to particular characteristics. These findings suggest the presence of dynamic mechanisms such as compensatory or statistical regression effects within the trajectories.
In the investigation of educational stratification through quantitative methods, sociologists should leverage their discipline’s intrinsic holistic perspective and comparative advantage, rather than merely emulating econometric methodologies. The examination of micro-mechanisms and decision-making processes within educational trajectories is crucial, necessitating the integration of both analytical (for instance, as illustrated by Holm et al., 2019) and qualitative (as explored by Walther et al., 2015) approaches. To effectively dissect the procedural complexities of educational stratification, it is crucial to embrace a holistic viewpoint. An overarching perspective is as important as an in-depth understanding of specific mechanisms, aiding in the formation of accurate causal inferences at an emergent level.
Limitations and perspectives of research
SA, a methodology enjoying a resurgence of interest, is garnering increasing recognition as an alternative approach to examining social processes. However, this approach is not devoid of limitations, such as challenges associated with integrating time-varying variablesFootnote 29 and the data-driven nature of cluster analysis. SA can present methodological intricacies, especially when dealing with expansive datasets containing numerous elements, potentially yielding complex patterns. The effectiveness of this method relies heavily on the availability of comprehensive data. Moreover, the procedure is computationally demanding, with advanced analyses of extensive and intricate datasets necessitating significant computational resources and processing time.
For this study, due to the small sample size of this research, some findings may be insufficiently robust and require more data for verification. Utilization of big data holds the potential to improve classification accuracy and facilitate more comprehensive trajectory matching, thereby reducing confounding factors when integrated with other typological methods. Furthermore, it is crucial to acknowledge that sequences are noteworthy not merely in their morphological structure but also in the intricate dynamics they encompass. In the age of complex social science, the progression of research—especially in relation to trajectory dynamics—necessitates an integration of mechanistic analysis with data science algorithms. Promising exploratory avenues encompass methodologies such as graph theory, simulation, and deep learning, among others.
Finally, it is important to declare that this study aims to address “waning coefficients” and provide an alternative approach, rather than dismissing the long-term or mechanistic effects revealed by existing studies. Solving all methodological problems in one paper is not possible, but this approach may inspire further exploration.
Data availability
The raw datasets analyzed in the course of this study are publicly available and can be accessed through the CGSS2008 repository, located at http://www.cnsda.org/index.php?r=projects/view&id=34288661. Additionally, the datasets specifically generated for this study are provided in the supplementary information section.
Notes
The Chinese Ministry of Education’s official website provides the statistics for each year, see http://www.moe.gov.cn/jyb_sjzl/.
Mare (1980) proposed that non-randomness in school dropout led to a decrease in family background effect as students advanced in their education due to increased homogeneity in unobservable factors. He refined this by stating that “waning coefficients” primarily occurred within cohorts rather than between them. Buis (2011) undertook a sensitivity analysis and discovered that the father’s education impact across transitions was vulnerable to unobserved heterogeneity, and the cross-birth cohort effect was relatively robust but underestimated.
Müller and Karle (1993) ascribed the occurrence of “waning coefficients” to children’s increasing economic and social independence, an explanation that may be considered a misapplication of life course theory due to its lack of methodological substantiation. The tension between the continuity of education trajectories and segmented modeling underscores the critical need for a matching methodology for this issue.
Williams (2007) contended that if the variance of residuals diminished across transitions, analogous to the typical reduction observed in variances, then scholars such as Mare, Hauser, and Andrew potentially overstated the influence of SES in subsequent transitions. Conversely, the heterogeneous choice model indicated a rise in residual variance through the transitions. This presents a pivotal challenge in discerning whether the apparent decline in the SES effect, when variance decreases, is authentic, or if residual variance increases due to the effect of omitted factors rises.
Buis’ sequential logit model represents a variant of the nested logit approach, structuring choices into “nests” to mitigate the Issue of Irrelevant Alternatives (IIA). Despite this innovation, it is contingent upon specific presumptions, including the assumption that error terms are independent across these nests. In light of this, mixed logit models and latent class analyses are increasingly favored in contemporary research for their capacity to amalgamate the advantages inherent in a person-centered approach.
Sociological investigations into education, as demonstrated in the works of Bourdieu and Passeron (1990), Cuconato and Walther (2015), and Willis (2017), reveal that educational processes are intricate, filled with paradoxes, and characterized by misunderstandings and complicity. It is overly simplistic to presume that individuals invariably engage in rational decision-making aimed at optimizing their net educational benefits throughout educational transitions.
In the 1950s, as part of a state-led elitist strategy designed to propel rapid industrialization, China instituted the key school system. This initiative was further solidified in the mid-1990s with the creation of 1000 exemplary high schools on a national scale. This system, unparalleled elsewhere, is occasionally rendered in English as “famous” or “prominent” schools. Of late, the term “super high school” has surfaced, denoting key high schools that dominate access to the nation’s most esteemed universities.
Employing these models could result in survivor bias or Berkson’s paradox, potentially exaggerating the phenomenon known as the “privileged poor” (han men gui zi). Pearl (2013) characterizes Berkson’s paradox as a form of selection bias, originating from the circumstance that the sample is only reflective of a subset of the broader population.
The classic Mare model was applied to the same sample, and the results, as expected, support the observed trend of “waning coefficients.” This also holds true for the variable concerning admission to key secondary schools, as evidenced in Attached Table 1 provided in the supplementary information.
Longitudinal models, including GMM, struggle with non-randomly missing data from unobserved heterogeneity in education, a selective process. Neither ignoring nor imputing offers a flawless solution. Mainstream algorithms like FIML and REML, assuming MCAR or MAR, falter with NMAR data, potentially biasing results due to intertwined unobserved factors causing missing data.
Cornwell (2015, p. 34) contended that the “general linear reality” presupposed that all samples shared the same causal processes and mechanisms, and predictors had the same explanatory power for all units, among other assumptions. Adding only time-lagged variables was inadequate in capturing the effects of earlier stages or the entire sequence, which prompted the development of SA.
Akin to GSS, CGSS is not designed as a panel survey. The 2008 dataset stands out as the singular wave providing an in-depth educational life history, which permits transformation into longitudinal or sequence data as required.
Choosing the birth cohort spanning from 1976 to 1988 enables this study to include individuals who witnessed both the initiation of educational expansion and the brief preceding period (pre-1998). This methodology ensures that the dataset encompasses the phase of most vigorous policy implementation, thereby providing a more comprehensive insight into the effects of these changes. If the study were to align with the regularization period of educational expansion, it might not capture the previously pronounced dynamic manifestations of educational inequality.
Within the complete dataset, there are 492 instances lacking educational data, which presumably correspond to individuals who have not received formal education. Notably, a minimal number of these cases pertain to the cohort born after 1976; thus, they have been omitted from the analytical process. Additionally, various coding inaccuracies were rectified by imposing specific conditions during the recoding phase. For a few individuals who had not yet completed their tertiary education, the year of the survey was designated as the endpoint. This does not affect the study’s results, given that the primary emphasis is on the collegiate experience, and the attainment of graduate degrees is very rare within the studied population. Most individuals would have graduated under the “admitting rigorously while graduating leniently” criterion prevalent in Chinese colleges.
This research focuses on a cohort that has completed its educational career. The educational trajectories scrutinized herein are defined by their empirical and observational nature. Despite the fact that certain individuals in the cohort completed their education post-primary school, these cases also constitute a form of complete educational sequence. They ought not to be misconstrued as instances of missing data. Therefore, concerns pertaining to “missing” states necessitating imputation are unwarranted.
These variables exhibit a significant correlation. Li’s (2014b) study revealed that, when accounting for the father’s level of education and the distinction between urban and rural settings, the father’s occupation did not exert a notable impact on their children’s educational opportunities.
Items 6, 8, and 9 of this scale, which were reverse scored, were recoded to align with the directionality of the other items. A rating of 3 represents a neutral position. Consequently, the scale has become a measurement in the positive direction, with a Cronbach’s alpha of 0.6477.
Due to limited space and the applied research nature of this study, a separate methodological article will address the formal representation, simulation study, and sensitivity analysis of this topic.
Hollister (2009) found that the OMloc method offers a more effective approach than traditional OM in social research, particularly when indel costs are low. This methodology reduces the incidence of excessive approximate substitutions and enhances the fit for regression models that follow.
In order to preserve the information pertaining to educational trajectories, the computation of distances was conducted without the application of a normalization parameter. This approach maintained distinct intermittent attributes and the diversity of sequence lengths, albeit at the expense of a marginal increase in variance. While normalization has the potential to diminish variance, empirical evidence suggests that it does not significantly alter the clustering configuration.
The education trajectory data was randomly divided into training and test datasets at a ratio of 7:3. The training data model achieved a prediction accuracy of 31.5%, whereas the test data model accuracy was 33.9%. This slight increase in accuracy from training to testing suggests that the model does not suffer from overfitting and highlights the importance of unobserved factors in the analysis. However, our goal is not high prediction accuracy, but rather an interpretable model, considering the limited number of explanatory variables.
The random forest algorithm, when applied to the identical dataset, achieved an accuracy rate of 30.1%. This result suggests that the use of more complex algorithms may not be necessary. Furthermore, an assessment of the mean decrease in accuracy due to each feature—gender, age, urban/rural divide, the highest level of parental education, and learning status—revealed importance values of 2.0, 11.6, 171.8, 47.5, and 43.5, respectively, within the random forest framework.
In this study, the variable “age” is not treated as a time-varying covariate because it acts as an extrinsic variable, with its effects already pre-incorporated into the trajectory configuration. Operationally, age functions as an inverse measure of birth year; thus, the earliest birth year in the cohort is labeled “1,” with subsequent codes assigned sequentially according to the survey year’s chronological frame.
Please refer to https://en.wikipedia.org/wiki/Small-town_Swot for further information.
Given its highest proportion of Cluster 5 and the absence of any further significant differentiating factors, this node is predominantly characterized by Cluster 5.
To compare the outcomes of a given educational trajectory within two distinct terminal nodes, a two-sample test for proportion equality can be utilized. The findings pertaining to Cluster 16, when comparing Node 3 with others, are uniformly statistically significant. Additional similar tests will not be detailed herein.
Recent scholarly endeavors in qualitative research, particularly those involving in-depth case approaches targeting specific segments within higher education, such as “study God” and “small-town swots”, have yielded results that deviate from quantitative analyses. Sequence analysis represents a methodological synthesis that integrates the quantitative focus on obtaining representativeness and deducing causality with the qualitative emphasis on a comprehensive and complex investigation of cases.
The multi-state SA approach, integrating an event history model was introduced by Studer, Struffolino, and Fasang (2018) as a means to tackle this issue. However, this method is not utilized in the current study due to the unavailability of pertinent data and the absence of a requirement for its application within this specific investigation.
References
Abbott A (1983) Sequences of social events: concepts and methods for the analysis of order in social processes. Hist Methods 16(4):129–147
Abbott A (1988) Transcending general linear reality. Sociol Theory 6(2):169–186
Abbott A (1992) From causes to events: notes on narrative positivism. Sociol Methods Res 20:428–445
Abbott A, John F (1986) Optimal matching methods for historical sequences. J Interdiscip Hist 16(3):471–494
Allison PD (1999) Comparing logit and probit coefficients across groups. Sociol Methods Res 28(2):186–208
Allison PD, Long JS, Krauze TK (1982) Cumulative advantage and inequality in science. Am Sociol Rev 47(5):615–625
Athey S, Imbens G (2016) Recursive partitioning for heterogeneous causal effects. Proc Natl Acad Sci USA 113(27):7353–7360
Ayalon H, Shavit Y (2004) Educational reforms and inequalities in Israel: the MMI hypothesis revisited. Sociol Educ 77:103–120
Bauer DJ, Shanahan MJ (2007) Modeling complex interactions: person-centered and variable-centered approaches. In: Little TD, Bovaird JA, Card N (eds) A modeling contextual effects in longitudinal studies. Routledge, pp. 255–283
Bergman LR, Magnusson D (1997) A person-oriented approach in research on developmental psychopathology. Dev Psychopathol 9(2):291–319
Bergman LR, Magnusson D, El Khouri BM (2003) Studying individual development in an interindividual context: a person-oriented approach. Psychology Press
Bijwaard GE (2014) Multistate event history analysis with frailty. Demogr Res 30:1591–1620
Blau PM, Duncan OD (1967) The American occupational structure. Free Press, New York
Boudon R (1974) Education, opportunity, and social inequality: changing prospects in Western society. Wiley, New York
Bourdieu P, Passeron JC (1990) Reproduction in education, society and culture. Sage Publications
Brand JE, Xu JH, Bernard K, Geraldo P (2021) Uncovering sociological effect heterogeneity using tree-based machine learning. Sociol Methodol 51(2):189–223
Breen R, Jonsson JO (2000) Analyzing educational careers: a multinomial transition model. Am Sociol Rev 65(5):754–772
Buis ML (2011) The consequences of unobserved heterogeneity in a sequential logit model. Res Soc Stratif Mobil 29(3):247–262
Buis ML (2017) Not all transitions are equal: the relationship between effects on passing steps in a sequential process and effects on the final outcome. Sociol Methods Res 46(3):649–680
Cameron SV, Heckman JJ (1998) Life cycle schooling and dynamic selection bias: models and evidence for five cohorts of American males. J Political Econ 106(2):262–333
Cameron SV, Heckman JJ (2001) The dynamics of educational attainment for black, Hispanic, and white males. J Political Econ 109(3):455–499
Cornwell B (2015) Social sequence analysis: methods and applications. Cambridge University Press
Courgeau D (2018) Do different approaches in population science lead to divergent or convergent models? In: Ritschard G, Studer M (eds) Sequence analysis and related approaches: innovative methods and applications. Springer, Cham, Switzerland, pp. 15–33
Cuconato M, Walther A (2015) Doing transitions’ in education. Int J Qual Stud Educ 28(3):283–296
Dannefer D (2003) Cumulative advantage/disadvantage and the life course: cross-fertilizing age and social science theory. J Gerontol: Soc Sci 58(6):327–337
Dannefer D (2009) Stability, homogeneity, agency: cumulative dis-/advantage and problems of theory. Swiss J Sociol 35(2):183–210
De Graaf PM, Ganzeboom HB (1993) Family background and educational attainment in the Netherlands for the 1891–1960 birth cohorts. In: Shavit Y, Blossfeld H (eds) Persistent inequality: changing educational attainment in thirteen countries. Westview Press, Boulder, CO, pp. 75–99
DiPrete TA, Eirich GM (2006) Cumulative advantage as a mechanism for inequality: a review of theoretical and empirical developments. Annu Rev Sociol 32:271–297
Duncan B (1967) Education and social background. Am J Sociol 72(4):363–372
Gruijters RJ (2022) Trends in educational stratification during China’s great transformation. Oxf Rev Educ 48(3):320–340
Guo MC, Wu XG (2010) Trends in educational stratification in reform-era china, 1981-2006. In: Suter C (ed) Inequality Beyond Globalization: Economic Changes, Social Transformations, and the Dynamics of Inequality. LIT Verlag Münster, pp 335–360
Halpin B (2019) Introduction to sequence analysis. In: Blossfeld HP, Rohwer G, Schneider T (eds) Event history analysis with Stata. Routledge, pp. 282–307
Hao LX, Hu A, Lo J (2014) Two aspects of the rural–urban divide and educational stratification in China: a trajectory analysis. Comp Educ Rev 58(3):509–536
Hauser RM, Andrew M (2006) Another look at the stratification of educational transitions: the logit response model with partial proportionality constraints. Sociol Methodol 36(1):1–26
Hollister M (2009) Is optimal matching suboptimal? Sociol Methods Res 38(2):235–264
Holm A, Jæger MM (2011) Dealing with selection bias in educational transition models: the bivariate probit selection model. Res Soc Stratif Mobil 29(3):311–322
Holm A, Hjorth-Trolle A, Jæger MM (2019) Signals, educational decision-making, and inequality. Eur Sociol Rev 35(4):447–460
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674
Jack AA (2019) The privileged poor: how elite colleges are failing disadvantaged students. Harvard University Press
Karlson KB (2011) Multiple paths in educational transitions: a multinomial transition model with unobserved heterogeneity. Res Soc Stratif Mobil 29(3):323–341
Keele L, Park DK (2006) Difficult choices: an evaluation of heterogeneous choice models. Paper presented at the 2004 meeting of the American Political Science Association. pp. 2–5
Li CL (2010) Expansion of higher education and inequality in opportunity of education: a study on effect of “kuozhao” policy on equalization of educational attainment. Sociol Stud 3:82–113. (in Chinese)
Li CL (2014a) Educational experience and inequality of opportunity among the post-80s generation—with comments on “the silent revolution”. Soc Sci China 4:66–77. (in Chinese)
Li CL (2014b) The changing trend of educational inequality in China (1940–2010): reexamining the urban–rural gap on educational opportunity. Sociol Stud 2:65–89. (in Chinese)
Li DX, Lu HX (2015) The study of the relativity between higher education entrance chance acquirement and family’s capital: based on CFPS data binary logit regression analysis. Glob Educ 4:50–60. (in Chinese)
Li Y (2006) Institutional change and educational inequality: mechanisms in educational stratification in urban China (1966–2003). Soc Sci China 4:97–109. (in Chinese)
Liang C, Li ZQ, Zhang H, Li L, Ruan DQ, Kang WL, Yang SH (2012) A silent revolution: research on family backgrounds of students of Peking University and Soochow University (1952–2002). Soc Sci China 1:98–118. (in Chinese)
Liu JM (2006) Expansion of higher education in China and inequality in entrance opportunities: 1978–2003. Chin J Sociol 3:158–179. (in Chinese)
Lucas SR (2001) Effectively maintained inequality: education transitions, track mobility, and social background effects. Am J Sociol 106(6):1642–1690
Magnusson D (2001) The holistic-interactionistic paradigm: some directions for empirical developmental research. Eur Psychol 6(3):153–162
Mare RD (1979) Social background composition and educational growth. Demography 16(1):55–71
Mare RD (1980) Social background and school continuation decisions. J Am Stat Assoc 75(370):295–305
Mare RD (1981) Change and stability in educational stratification. Am Sociol Rev 46(1):72–87
Mare RD (2006) Response: statistical models of educational stratification—Hauser and Andrew’s models for school transitions. Sociol Methodol 36(1):27–37
Mare RD (2011) Introduction to symposium on unmeasured heterogeneity in school transition models. Res Soc Stratif Mobil 29(3):239–245
Mood C (2010) Logit regression: why we cannot do what we think we can do, and what we can do about it. Eur Sociol Rev 26(1):67–82
Morgan SL, Winship C (2015) Counterfactuals and causal inference. Cambridge University Press, New York
Müller W, Karle W (1993) Social selection in educational systems in Europe. Eur Sociol Rev 9(1):1–23
Pang SM (2016) Market transition, educational differentiation and urban-rural inequality in Chinese higher education (1977–2008). Chin J Sociol 5:155–174. (in Chinese)
Pearl J (2013) Linear models: a useful “microscope” for causal analysis. J Causal Inference 1(1):155–170
Pfeffer FT (2008) Persistent inequality in educational attainment and its institutional context. Eur Sociol Rev 24(5):543–565
Raftery AE, Hout M (1993) Maximally maintained inequality: expansion, reform, and opportunity in Irish education, 1921–75. Sociol Educ 66(1):41–62
Rosenbaum PR (2002) Observational studies. Springer, New York
Rossignon F, Studer M, Gauthier JA, Le Goff JM (2018) Sequence history analysis (SHA): estimating the effect of past trajectories on an upcoming event. In: Ritschard G, Studer M (eds) Sequence analysis and related approaches: innovative methods and applications. Springer, Cham, Switzerland, pp. 83–100
Rust J (1987) Optimal replacement of GMC bus engines: an empirical model of Harold Zurcher. Econometrica 55(5):999–1033. https://doi.org/10.2307/1911259
Shavit Y, Blossfeld HP (1993) Persistent inequality: changing educational attainment in thirteen countries. Westview Press, Boulder, CO
Sterba SK, Bauer DJ (2010) Matching method with theory in person-oriented developmental psychopathology research. Dev Psychopathol 22(2):239–254
Studer M, Struffolino E, Fasang AE (2018) Estimating the relationship between time-varying covariates and trajectories: the sequence analysis multistate model procedure. Sociol Methodol 48(1):103–135
Tang JC (2016) Lost at the starting line”: a reconsideration of educational inequality in China, 1978–2008. J Chin Sociol 3(1):1–18
Treiman DJ (1970) Industrialization and social stratification. Sociol Inq 40(2):207–234
Treiman DJ, Yip KB (1989) Educational and occupational attainment in 21 countries. In: Kohn M (ed) Cross-national research in sociology. SAGE Publications, pp. 373–394
Vanhoutte B, Wahrendorf M, Prattley J (2019) Sequence analysis of life history data. In: Liamputtong P (ed) Handbook of research methods in health social sciences. Springer, Singapore, pp. 935–954
Von Hayek FA (1989) The pretence of knowledge Am Econ Rev 79(6):3–7
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Statal Assoc 113(523):1228–1242
Walther A, Warth A, Ule M, du Bois-Reymond M (2015) Me, my education and I”: constellations of decision-making in young people’s educational trajectories. Int J Qual Stud Educ 28(3):349–371
Wang Q (2014) Rural students are being left behind in China. Nature 510(7506):445–445
Williams R (2009) Using heterogeneous choice models to compare logit and probit coefficients across groups. Sociol Methods Res 37(4):531–559
Williams R (2010) Fitting heterogeneous choice models with oglm. Stata J 10(4):540–567
Williams R (2007) Estimating heterogeneous choice models with Stata. Paper presented at west coast Stata users’ group meetings, Stata users group. http://repec.org/wcsug2007/rw_WCSUG2007.pdf. Accessed 25 Oct 2022
Willis P (2017) Learning to labour: how working class kids get working class jobs. Routledge, New York
Wittgenstein L (1980) Culture and value. University of Chicago Press, Chicago
Wu XG (2010) Economic transition, school expansion and educational inequality in China, 1990–2000. Res Soc Stratif Mobil 28(1):91–108
Wu XG (2017) Higher education, elite formation and social stratification in contemporary China: preliminary findings from the Beijing college students panel survey. Chin J Sociol 3(1):3–31
Wu YX (2013a) The keypoint school system, tracking, and educational stratification in China,1978–2008. Sociol Stud 4:179–202. (in Chinese)
Wu YX (2013b) Educational opportunities for rural and urban residents in China, 1978–2008: inequality and evolution. Soc Sci China 34(3):58–75
Xie Y (2011) Values and limitations of statistical models. Res Soc Stratif Mobil 29(3):343–349
Yang DP (2006) Access to higher education: widening social class disparities. Tsinghua J Educ 1:19–25. (in Chinese)
Yang L, Zhang TJ (2020) Family background, key schools and education attainment. Educ Econ 2020(5):33–44. (in Chinese)
Yang QM, Lin J (2014) Is educational expansion enough to achieve educational equity? and the impact of higher education reform on educational equity at the end of the 20th century. Manag World 8:55–67. (in Chinese)
Ye XY, Ding YQ (2015) Expanding Chinese higher education: quality and social stratification. Chin J of Sociol 35(3):193–220. (in Chinese)
Ying X, Liu YS (2015) “Silent revolution” is exaggerated rhetoric: some idea exchange with Liang Chen, Li Zhongqing, et al. Chin J Sociol 35(2):81–93. (in Chinese)
Acknowledgements
The authors express their gratitude to all contributors for their helpful suggestions. Special appreciation is extended to anonymous reviewers for their valuable comments and recommendations. It is emphasized that the authors bear sole responsibility for the content of this paper.
Author information
Authors and Affiliations
Contributions
The corresponding author was responsible for the conceptualization and design of the study, the creation of R programming code for sequence analysis, and the drafting and subsequent revisions of the manuscript. The contributing author collaborated on the research question and plan, identified suitable datasets, assisted with data processing, and engaged in the refinement of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
As the present study utilized publicly accessible data from the CGSS2008 repository (http://www.cnsda.org/index.php?r=projects/view&id=34288661), no specific ethical approval was deemed necessary.
Informed consent
The necessity for obtaining informed consent was waived for this study as it employed publicly available data from the CGSS2008 repository (http://www.cnsda.org/index.php?r=projects/view&id=34288661).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bi, X., Liu, X. From “transitions” to “trajectories”: towards a holistic interactionistic analysis of educational inequality in contemporary China. Humanit Soc Sci Commun 11, 918 (2024). https://doi.org/10.1057/s41599-024-03421-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1057/s41599-024-03421-7
This article is cited by
-
Predictors of Academic Performance Trajectories Across Early and Middle Adolescence: Links with Internalizing and Externalizing Problems
Journal of Youth and Adolescence (2025)