Background and questions

Motivated by a confluence of factors, such as the commercialization of education, initiatives aimed at bolstering domestic demand, and strategies devised to alleviate employment pressures, China has undertaken a policy of educational expansion since the 1990s, culminating in an impressive proliferation in higher education enrollment. Empirical data reveal that enrollments at standard colleges and universities swelled from 1.0836 million to 5.9748 million over the decade spanning from 1998 to 2008, with the gross enrollment rate in higher education institutions surging from 9.8% to 23.3%. By the year 2018, these statistics had escalated even further, with enrollments reaching 7.9099 million and the gross enrollment rate soaring to 48.1%, implying that the popularization of higher education was approaching completion.Footnote 1

In the last decades, sociologists have demonstrated a growing fascination with scrutinizing this unusual social project. Most studies (Yang, 2006; Wu, 2010; Li, 2010; Gruijters, 2022, etc.), with a few exceptions (e.g., Liu, 2006), have predominantly concluded that the expansion policy has exacerbated educational inequality, particularly in higher education. However, beyond the variations in theoretical frameworks and the available data, this trend is, to some extent, also a result of the methodologies and analytical models adopted by researchers. Namely, the problem lies in studies either exclusively concentrating on the terminal outcome of educational pathways (specifically, university admittance), or dissecting these trajectories into segmented stages for analysis as though they were separate occurrences. Consequently, there is a lack of a holistic perspective on educational trajectories in the existing literature, resulting in a limited discussion concerning the heterogeneity.

Sequence analysis, enhanced by cluster analysis and a tree-based model, will be used in this paper to develop a typology of educational trajectories and expose the relationship between background characteristics and trajectory clusters. Being a nonparametric approach, sequence analysis has the advantage of conceptualizing educational attainment as holistic trajectories rather than as a series of discrete transitions. This approach facilitates the identification of patterns and heterogeneity in the accumulation of advantages/disadvantages throughout the educational trajectories. Such a method is poised to methodologically enrich the analytical toolkit available, thereby contributing to the refinement and expansion of scholarly inquiry within this domain.

Literature review, research ideas, and working hypotheses

Theoretical contexts and methodological dilemmas of educational transition research

Historically, research into educational and status achievement has frequently employed linear regression models that treat the duration of schooling as a continuous dependent variable, as illustrated by seminal works like those of Duncan (1967) and Blau and Duncan (1967). Mare (1979, 1980, 1981) pioneered the educational transition model, deploying a series of logit models applicable to each distinct phase of the educational career. This approach has several asserted advantages, such as conceptually independent probabilities of educational persistence and the ability to estimate models separately for each stage. A pivotal implication derived from the educational transition model is the effect of “waning coefficients,” signaling a diminishing influence of family characteristics on the likelihood of school enrollment as students’ progress through the educational system, a trend that may diminish to irrelevance post-secondary education. This pattern suggests an incremental progression towards equity in the course of educational transitions.Footnote 2 Later sociological inquiries, for example, those by Raftery and Hout (1993), De Graaf and Ganzeboom (1993), Shavit and Blossfeld (1993), and Ayalon and Shavit (2004), have predominantly adopted Mare model. According to the “maximally maintained inequality” (MMI) (Raftery and Hout, 1993), the importance of family background decays to zero in the case of full popularization of certain levels of education.Footnote 3

Methodologically, Mare model presents two interrelated problems. Firstly, the logit model’s distributional assumptions regarding heteroskedasticity can lead to coefficients being subject to unobserved heterogeneity, even when there is no correlation between the two. This issue poses significant challenges in distinguishing between genuine coefficients (β) and scaled estimates (β/σ), as noted by Mood (2010) and Holm and Jæger (2011). Consequently, this difficulty complicates the comparison of coefficients across different models that operate on varying scales, leading to potential misinterpretations of the effects being studied. Heterogeneous choice models (Allison, 1999; Hauser and Andrew, 2006; Williams, 2009) were proposed to address the problem, but they are sensitive to model misspecification (Williams, 2010; Mare, 2006). Scholars such as Keele and Park (2006) and Williams (2009) suggested that it was preferable to estimate standard choice models without accounting for heteroskedasticity if the source of heteroskedasticity remains ambiguous.Footnote 4

Secondly, the educational transition model faces a severe issue with selectivity-based endogeneity, which worries econometricians. Cameron and Heckman (1998, 2001) criticized Mare model for causing dynamic selection bias due to the omission of unobserved heterogeneity and the “myopia” of agents. Hence, the “waning coefficients” effect could merely be a statistical artifact. Their development of a dynamic discrete choice model (DDCM) revealed that family background has long-term effects on educational attainment. Furthering this critique, Holm and Jæger (2011) underscore the necessity of conceptualizing the entirety of an educational pathway as a selective process in order to circumvent biased estimations resulting from selectivity issues. They introduced an advanced probit choice model that accommodated correlated residuals among transitions. The model revealed that Mare model underrepresents the impact of family background, thus reinforcing the constant inequality assumption.

The predominant methodological framework in educational attainment research, particularly for discrete choice models, is deeply rooted in econometric principles. This paradigm presumes that individuals behave as rational agents, endeavoring to optimize their utility in accordance with economic theory. DDCM exemplifies a single-agent model that necessitates strong assumptions, such as rational agents maximizing discounted utility expectations at all stages, optimal decisions adhering to a steady-state Markov process (Rust, 1987), and transition-specific instrumental variables (IVs) are necessary for model identifiability, which may prove to be either unrealistic or difficult to operationalize. It is hard to differentiate between sample selection bias and scale effect or to distinguish “state dependence” from unobserved heterogeneity (Lucas, 2001; Mare, 2006; DiPrete and Eirich, 2006). Mare (2011) argued that rectifying unobserved heterogeneity without an underlying model would result in misleading estimates.

The evolution of methodological approaches has significantly contributed to theoretical advancements in this domain, with several sociologists refining analysis models in alignment with Mare’s contributions. Lucas (2001) proposed “effectively maintained inequality” (EMI), illustrating how the influence of social background persists, even in the context of universal education, by treating “dropout” as a potential outcome within an ordered probit regression model. Buis (2011) developed the sequential logit model within a more integrated framework, which is capable of addressing the entirety of educational inequality as a weighted sum of inequality across various transitions.Footnote 5

Lucas’ adoption of a probit model, which assumes the normality of residuals, circumvents the aforementioned comparability issues. However, the exclusion of dropouts due to the risk of not entering subsequent stages may still introduce bias into the estimates as a result of dynamic selectivity—a limitation also applicable to Buis’ model. Subsequent studies by Buis (2017) and Lucas (2001) reveal the “waning coefficients” trends within given cohorts, observable both in general and specifically prior to the transition from senior high school to college. Furthermore, these models encounter limitations in addressing the complexities of educational pathways because of their reliance on dichotomous bifurcation and fixed nesting structures.

In the realm of sociology, the pursuit of more sophisticated modeling techniques to grapple with unobserved heterogeneity and introduce greater structural flexibility is ongoing. Breen and Jonsson (2000) demonstrated the substantial influence of family origin on educational outcomes through the application of multinomial transition models within two latent classes, signifying a burgeoning interest in the use of heterogeneity to bolster causal inference. While such strategies are gaining traction as a means to verify analytical robustness, reliance on latent class analysis as a supplementary tool does not ensure the primary model’s capacity to adequately address potential sources of heterogeneity. In response to these econometric challenges, Karlson (2011) formulated a bias-corrected multinomial logit model incorporating alternative-specific IVs. His findings suggest a consistent underestimation of family background’s impact on educational transitions by conventional models. Nevertheless, these innovative models do not entirely escape from foundational assumptions, such as the IIA, which the presence of unobserved heterogeneity often invalidates—a concern Karlson himself raised regarding his own model. Further complicating the matter is the fact that IV estimates, conceived as local average treatment effects (LATE), are only pertinent for a subset known as “compliers,” with the proportion of such latent groups typically remaining elusive even if the monotonicity condition is met. Moreover, these novel applications in educational choice models fall short of incorporating the rigorous statistical tests that are standard in traditional IV estimations. Despite the progression from binary to multinomial frameworks, these models continue to wrestle with the complexity of capturing all stages of education and the array of choices presented at each pivotal transition.Footnote 6

Principal findings and unresolved problems of relevant empirical studies in china

The majority of research on educational inequality in modern China employs the logit model, focusing primarily on college enrollment as the outcome variable (e.g., Li, 2010; Li and Lu, 2015; Yang and Zhang, 2020). Additionally, certain investigations target college students to delineate the influence of background characteristics on access to different levels of universities (Ye and Ding, 2015; Wu, 2017). Nonetheless, these models fall short of adequately probing the intricacies of educational inequality throughout the evolving landscape of educational trajectories. Furthermore, the reference category utilized within these models is a mish-mash of multiple educational attainments with non-university endpoints, thus obfuscating distinct educational pathways. Over an extended period, Liang et al. (2012) found diverse family backgrounds among students with elite higher education. Concurrently, in a focused review of their work, Ying and Liu (2015) critiqued that the key high school systemFootnote 7 entrenched inequality between urban and rural education.Footnote 8

Event history analysis (EHA) offers a solution to censoring bias in educational attainment studies, yet its application remains rare in China. Liu (2006) employed the Cox proportional hazard model to investigate differences in higher education achievement among various risk groups, with a primary emphasis on final outcomes. For analyzing the multiple phases of educational careers, multi-state EHA emerges as a more fitting approach. However, the classic assumptions of EHA are contested by the dependence and heterogeneity present among repeated events, necessitating a generalized correlation structure for transition risks. Nevertheless, the complexity of multi-state EHA increases significantly with the inclusion of numerous frailty parameters, leading to potential identification problems (Bijwaard, 2014). Additionally, EHA tends to oversimplify the relationship between interconnected events in the life course, thereby constraining its effectiveness in illuminating the critical aspects of earlier trajectories that affect later events (Rossignon et al., 2018).

Studies employing Mare model to explore educational attainment in China (Li, 2006; Guo and Wu, 2010; Wu, 2013a, 2013b; Li, 2014a; Yang and Lin, 2014; Pang, 2016, etc.) have yielded significant findings, largely supporting the theoretical framework of EMI. However, “waning coefficients” lingered in their results, whereby the influence of background variables consistently diminishes as students progress through educational stages, and often becomes statistically insignificant during the transition from senior high school to college. Tang (2016) pointedly stated that as students ascend to higher levels of education, the school’s grade replaces SES and cultural background as the primary contributor. Paradoxically, Gruijters’ (2022) study, which used the sequential logit model to examine China’s educational expansion, found that inequalities declined for the most recent cohort. Wu (2010) utilized a multi-group logit model to analyze educational attainment differences between 1990 and 2000, resulting in distinct findings. Generally, “waning coefficients” are observed; there is a single exception concerning the coefficients of household registration status (hukou) between the two years, as detailed in Table 8 of the original text. However, it is crucial to acknowledge that, in this instance, the sample was restricted to populations living in rural areas. In an earlier study (Guo and Wu, 2010) that applied Lucas’s (2001) model, it is again specifically for the final education transition of the last period, the coefficients of background variables exhibited significant positivity. However, it is essential to note that the coefficients in the aforementioned results might have been understated due to the potential “waning coefficients” effect, which could intensify if educational expansion heightens the disparity linked to family backgrounds in reality. These seemingly contradictory conclusions, whether supporting EMI or MMI, remain subject to further debate and require examination through the lens of an innovative methodology.Footnote 9

Utilizing the same data as in this paper, Hao et al. (2014) employed growth mixture modeling (GMM) to analyze educational attainment, delineating four latent classes to explore heterogeneity. However, when employing GMM to investigate highly selective phenomena such as educational attainment, it becomes imperative to address the challenge of missing values caused by unobserved heterogeneity. Although some findings align with the outcomes of the present research, some coefficients, such as those for rural schooling experience, especially in certain latent classes, may be underestimated due to missing data caused by selection bias.Footnote 10 Furthermore, the approach of conceptualizing the educational trajectory as a continuous variable presents shortcomings in capturing the multiplicity of educational pathways. This methodology falls short in distinguishing between academic and vocational education tracks, differentiating between key and non-key schools, and recognizing various levels of higher education.

Methodological advantages of sequence analysis

The concept of advantage/disadvantage accumulation plays a pivotal role within the life course framework in elucidating the emergence of inequality. An analytical model falls short in accurately depicting the dynamics of inequality without incorporating the diversity inherent in the cumulative process (Allison et al., 1982, Allison, 1999, p. 313; Dannefer, 2003; DiPrete and Eirich, 2006; Dannefer, 2009). However, Mare model posits a scale-invariant feature at each educational transition, a presumption that renders capturing these effects challenging in the aforementioned studies.Footnote 11

Magnusson (2001) suggests, from a holistic interactionistic lens, that complex systems are marked by their irreducibility and indecomposability. Critiques have emerged regarding the disparity between the theoretical focus on holism and the empirical reliance on generalized linear models; the latter imposes linear assumptions that stand in stark contrast to the principles of interactionism (Bergman and Magnusson, 1997; Bauer and Shanahan, 2007). These are the “difficulties deep down” (Wittgenstein, 1980, p. 48e) of this field. Xie (2011) recognized two types of heterogeneity-induced biases in educational transition studies, termed “outcome incommensurability” and “population incommensurability,” which are actually connected to the basic claim of holistic interactionism. He posited that they were inherent problems and could not be easily remedied by better statistical models, thus resorted to the expedient use of the sequential logit model. Nonetheless, to address these issues effectively without succumbing to undue pessimism, it is essential to embrace the Wittgensteinian strategy of “tearing out by the roots” and “start thinking of these matters in a new way” (Wittgenstein, 1980, p. 48e).

Von Hayek (1989) emphasized the inherent challenges in precisely predicting “phenomena of organized complexity,” suggesting that only pattern recognition is feasible. In this regard, the person-centered approach offers an alternative that emphasizes individual uniqueness, complexity of interactions, variability of individual changes, generalization of patterns, and finiteness (Sterba and Bauer, 2010). This approach captures the high level of interaction and non-linear relationships in dynamic processes by identifying homogenous subgroups and preserving the complex dynamics of the variable system, which can be seen as surface outputs of underlying processes that accumulate over time and trigger transitions between states (Halpin, 2019).Footnote 12

As a typical person-centered approach, SA was introduced into social sciences from computer science and biostatistics by Abbott (Abbott, 1983; Abbott and Forrest, 1986). SA brings “process” back into sociological theory and empirical research by using a “narrative positivism” (Abbott, 1988, 1992) or “story” approach (Cornwell, 2015). SA does not require any assumptions about the life course, avoids the methodological pitfalls connected with simple statistical aggregation of heterogeneous types, and enables straightforward translation of concepts from the life course perspective (Courgeau, 2018; Vanhoutte et al., 2019).

Research hypotheses

Owing to the absence of previous studies on educational inequality that employ SA, this study has to rely on established theories for the formulation of formalized hypotheses. However, given the discord between current theoretical findings and methodologies, the hypotheses will be both crafted and examined on a distinct methodological base, albeit with some superficial similarities.

Boudon (1974) and Mare (1981) postulated that assessing education’s impact on equality necessitated an analysis of the changes in educational opportunities that accompany the expansion of education. According to the tenets of modernization theory, the proliferation of educational access is believed to trigger a rise in the general level of educational attainment, and the achieved principle supplanting the ascribed principle, thereby diminishing educational inequalities (Treiman, 1970; Boudon, 1974; Treiman and Yip, 1989). Empirical studies from China, such as those by Liu (2006), lend support to this assertion. Based on this, Hypothesis 1 is proposed: with the universality of compulsory education and the growth of higher education, there is an augmentation of overall educational opportunities, a decline in the probability of terminating education prematurely, and an increased likelihood of pursuing higher education trajectories.

Educational inequality, which is rooted in broader social inequality, does not yield consistent results across social subgroups. Higher classes maintain advantages, while opportunities for lower classes do not increase unless a certain level of education is saturated. With the expansion of educational access, disparities in education tend to manifest more significantly in terms of qualitative discrepancies rather than quantitative imbalances (Raftery and Hout, 1993; Lucas, 2001). This phenomenon is corroborated by Shavit and Blossfeld’s (1993) international comparative study, as well as by empirical evidence from China (Li, 2006, 2010; Wu, 2010, etc.). Social subgroups exhibit significant differences in their educational trajectories. Based on Pfeffer’s (2008) conceptualization of educational inequality linking individuals’ educational attainment to their parents’ highest education level, Hypothesis 2 is raised: individuals with highly educated parents experience more stable and high-quality educational trajectories.

The urban–rural dualist structure in China plays a considerable role in fostering the nation’s socioeconomic stratification. This systemic division results in stark disparities in educational opportunities for urban versus rural inhabitants (Li, 2014b; Wang, 2014). Therefore, Hypothesis 3 is formulated: rural individuals are more likely to experience terminated educational trajectories and have limited access to education trajectory types with cumulative advantages compared to their urban counterparts.

Methodological considerations

Data profile

This study draws on data from the 2008 Chinese General Social Survey (CGSS 2008), featuring a sample size of 6000 individuals.Footnote 13 Rather than engaging in cohort comparative analysis, it delves into the dynamics of educational trajectory shifts. To mitigate excessive heterogeneity, the study focuses particularly on the 1976–1988 birth cohort. The educational trajectories of this cohort are encompassed within a period beginning with the enactment of the integrated college enrollment policy in 1994 and culminating at the termination point of the survey in 2008, thereby encapsulating an entire cycle of education of this cohort.Footnote 14 Consequently, it is feasible to conduct empirical observations of their comprehensive educational trajectories.Footnote 15 Following thorough programmatic and manual scrutiny, 21 samples with educational sequences containing irreparable logical inconsistencies were excluded. This vetting process yielded a refined sample comprising 1305 valid respondents, which corresponded to 3915 person-stage records formatted in long format.

Variables and measurements

The utilization of a life history calendar is indispensable for the conduct of sequence analysis. Within the CGSS 2008 survey, an education history table is incorporated, meticulously chronicling the academic trajectories of respondents. This encompasses initiation, conclusion dates, and categories of educational institutions attended, among other pertinent details. This data is converted into a spell format and defined as a state sequence object as needed for sequence analysis. Educational levels undergo a reconfiguration to distinctly identify varying tiers of educational establishments, including universities, colleges, as well as junior and senior high schools. Two highest types of higher education institutions are amalgamated, culminating in a schema comprising 21 distinct types of educational states with the inclusion of one state to account for empty phase between two education stages (see Figs. 2 and 3 for more information).Footnote 16

The locale of respondents’ residence at the age of 14 serves as a surrogate for household registration. While family background factors lack information on past income, and parental occupation codes for respondents aged 14 had a high proportion of missing values (59.5% for fathers and 77.7% for mothers), the principal metric for family background is determined to be the educational attainment of parents.Footnote 17 The highest education level achieved by the parents is coded as a continuous variable representing years of schooling. For instance, elementary school typically corresponds to 6 years of education, while junior high school corresponds to 9 years. The survey instrument features a 10-item scale with five levels to gauge academic performance at age 14, which predominantly evaluates attitudes toward learning, efficacy, and adaptation to school life. The aggregate of these items produces a composite measure.Footnote 18 Gender and age are incorporated into the model. The statistical analyses proceed without weighting, as the study only uses data from a single cohort.

Table 1 provides details of the background variables utilized in the model for the 1976–1988 birth cohort. The distribution of the sample is generally normal, although it exhibits a slightly higher urbanization rate compared to the actual situation.

Table 1 Statistical description of variables.

Causal inference basis of sequence analysis and auxiliary model

The study’s fundamental positions are as follows: (1) Life course sequence data can be interpreted as a collection of condensed individual biographies, rich in information beyond observable metrics. Determining unobserved heterogeneity in this context is dynamic, multifaceted, and emergent. This complexity necessitates a holistic interactionist perspective, which views educational trajectories as comprehensive processes from their very inception, incorporating unobserved heterogeneity into the morphological fabric of these trajectories. This perspective recognizes the inherent nonlinearity, interdependence, and adaptability of educational experiences, emphasizing the importance of understanding these elements in conjunction rather than isolation.

(2) The analysis consistently encompasses the entire sample, a methodological choice aimed at mitigating bias stemming from the selection of the sample. A person-centered approach is adopted, meticulously documenting the complete educational trajectories of individuals within a cohort, while acknowledging disruptions or terminations as distinct states. This approach allows for a comprehensive assessment of the entire sample, paralleling the benefits of prospective studies. As Vanhoutte et al. (2019) suggest, focusing on a single event inherently excludes individuals who have not experienced it. SA addresses this by examining the trajectory in its entirety, ensuring the inclusion of individuals who may not have been at risk of encountering the event.

(3) Although it is not feasible to assume that unobserved variables or inherent discrepancies, such as IQ, are randomly distributed across the population, the study ensures that the classification and correspondence during the clustering sequence are sufficient to satisfy the ignorability assumption by maximizing discernible differences among various educational trajectory categories. This methodology aligns with the traditional approach of conditioning through stratification in causal analysis (Rosenbaum, 2002; Morgan and Winship, 2015). However, it is considered superior to the reliance on propensity score matching that uses external explanatory variables. Provided these conditions are met, the study argues that it is possible to circumvent issues of selectivity-based endogeneity when examining the prevalence of diverse educational trajectory types within distinct characteristic subgroups. Footnote 19

Specifically, the sequence analysis is executed programmatically using the TraMineR and TraMineRextras packages within the R language. These packages cluster state sequences according to the optimal matching distance. This investigation intentionally circumvents the multinomial logit model to prevent regression to a variable-centered approach and to avoid the “curse of dimensionality” that arises when integrating a wide array of interaction terms into the model. Instead, the research employs a conditional inference tree model—a tree-based methodology recognized as a supervised learning algorithm conducive to causal inference (Athey and Imbens, 2016; Wager and Athey, 2018; Brand et al., 2021). By utilizing techniques such as heterogeneity maximization and adaptive nearest neighbor matching within recursive partitioning, these models enable the segmentation of the dataset into distinct sub-samples. This approach provides a flexible and interactive mechanism for addressing confounders, thereby improving the accuracy of heterogeneous causal effect estimations.

Research findings

Descriptive statistics

This study delineates the educational progression by creating a Sankey diagram (Fig. 1), which employs educational sequence data to graphically elucidate educational trajectories and verify data veracity. The illustration portrays the educational continuum as a multifaceted construct characterized by sequential dynamism, branching heterogeneity, and pronounced selectivity, reflecting the complex dynamics inherent in the educational process. It prominently highlights the significant repercussions of the 1990s educational reforms on the educational trajectories of the 1976–1988 born cohort. These reforms are associated with an augmented propensity for higher educational attainment, as evidenced by the rise in enrollment rates at tertiary institutions such as junior colleges and universities, as well as a notable increase in senior high school participation.

Fig. 1: Educational trajectories of the 1976–1988 birth cohort.
figure 1

The figure presents a comprehensive Sankey diagram generated using gvisSankey, which visualizes the educational pathways of the 1976–1988 birth cohort. The diagram validates the educational sequence data and highlights the dynamic and heterogeneous nature of educational progression, capturing the complex interplay of sequential dynamism, branching heterogeneity, and pronounced selectivity in shaping the cohort's educational trajectories.

Sequence analysis

Figure 2 depicts chronograms (state distribution plots) illustrating the educational trajectory of the 1976–1988 birth cohort, revealing shifts in the prevalence of various educational levels across distinct age brackets. The trend observed in the data indicates that the majority of the cohort transitions from elementary to junior and senior high school with increasing age. Concurrently, there is a rise in the percentage of the cohort engaging in informal education as the cohort exits the formal education system.

Fig. 2: Sequence distribution of education trajectories of the 1976–1988 birth cohort (seqdplot).
figure 2

The figure illustrates the distribution of educational attainment among the 1976–1988 birth cohort using the seqdplot function. It documents shifts in the prevalence of various education levels across different ages. The graphically displayed structural patterns demonstrate a fluid progression from elementary to middle and high school education as the birth cohort ages, alongside more complex shifts within higher education.

Cluster analysis

The construction of a substitution cost matrix, derived from transition rates observed in the aggregate sequence pattern, is employed for the computation of a distance matrix among sequences utilizing the OMloc method. For the cohort in question, the exponential cost parameter (expcost) within OMloc is set to zero.Footnote 20Subsequently, hierarchical clustering employing the Ward method is conducted on the said distance matrix, and 16 clusters are chosen based on their practical relevance and goodness of fit.Footnote 21 Figure 3 presents the state distribution for the various sequence types, as delineated by the seqdplot function, each manifesting a unique attribute. The clustering outcome accounts for 68.8% of the discrepancy, and Table 2 details the proportions and the salient features of each cluster. The clusters demonstrate a high degree of fit and distinctiveness, mirroring the combined effects of the key school system and the bifurcation of academic and vocational tracks.

Fig. 3: Sixteen sub-types of sequence distribution of educational trajectories of the 1976–1988 birth cohort (seqIplot).
figure 3

The figure reveals the diverse state distributions across a range of sequence types using the seqIplot function, each displaying distinct traits. These characteristics highlight the complex interplay of the key school system and the bifurcation between academic and vocational tracks. The visualization provides a detailed insight into the heterogeneity of educational pathways within the cohort, emphasizing the influence of both structural factors and individual choices.

Table 2 Description of 16 education trajectory clusters and characteristics of the 1976–1988 birth cohort.

The 16 identified trajectory types can be classified into three principal categories: (1) The general education pathway category encompasses Clusters 3, 4, 5, 8, 10, 11, 12, and 13, charting educational progressions from diverse tiers of secondary education to assorted junior colleges and universities. Clusters 4 and 8 are particularly noteworthy, as they represent the highest echelons of educational attainment at each respective stage, thereby exhibiting a significant cumulative advantage. Collectively, this category accounts for 33.0% of the overall composition. (2) The education termination category, which includes Clusters 1, 2, 9, 15, and 16, is characterized by a pattern of cumulative disadvantage. It represents the majority of cases, comprising 51.1% of the total. (3) The alternative pathways category is primarily defined by vocational education trajectories, including Clusters 6, 7, and 14. These trajectories typically involve educational institutions such as technical schools, vocational high schools, and specialized technical secondary schools. This category constitutes 15.9% of the total trajectories observed.

Cumulative advantages and disadvantages manifest distinctly across various educational trajectory clusters. For instance, Cluster 8 exhibits a higher representation of students from elite high schools compared to peer clusters, while a substantial segment also stems from less prestigious secondary institutions. Typically, students hailing from high-ranking schools are more likely to gain entrance into subsequent elite educational levels, although there are exceptions where this trend is reversed, known in some contexts as “counterattacks” (ni xi). This pattern is also observed in Clusters 4 and 11, where the likelihood of such reversals seems to be more evident. Conversely, Clusters 9 and 15 are characterized by individuals who either departed from key junior high schools prematurely or who, despite graduating from key senior high schools, did not proceed to higher education. These instances account for a smaller fraction and are often labeled as “antitypes” or “white spots” within the terminology of SA (Bergman et al., 2003). These findings reveal that educational trajectories are intricate and non-linear, profoundly influenced by path dependency. They are prone to feedback mechanisms that have the potential to magnify initial disparities. However, there is considerable variability, resulting in a diverse educational landscape. Such complexity necessitates the implementation of advanced and adaptable analytical methods within the realm of educational research.

Conditional inference tree

This study utilizes the conditional inference tree (CIT) algorithm by Hothorn, Hornik, and Zeileis (2006), tailored to our analytical needs. As a non-parametric model, the CIT allows data to shape model structure and complexity, avoiding predefined parameters. It excels in handling complex, nonlinear data without distributional assumptions. The CIT algorithm effectively manages multi-valued categorical outcomes and explanatory variables, particularly in scenarios lacking a primary treatment assignment. It enables detailed analysis of educational trajectories and rigorous significance testing and mitigates selectivity-based endogeneity by preserving unconfoundedness achieved from the trajectory clustering process. By applying the CIT, this study conducts a nuanced classification of educational trajectories without imposing a uniform causal model, offering a sophisticated alternative to traditional measures like the average treatment effect (ATE) and ensuring sensitivity to each trajectory’s unique features and contexts.

The CIT algorithm selects variables for splitting using permutation-based significance tests, which circumvents biases associated with traditional information measures, like the Gini coefficient or information gain.Footnote 22 This process involves selecting the variable with the smallest p-value for each split and continues until no significant independent variables remain. Although this method embodies a conservative stance, it effectively determines an optimal tree size without the need for post-pruning or cross-validation techniques.Footnote 23

In line with the established criteria, the dataset was partitioned into eight distinct leaf (terminal) nodes, as depicted in Fig. 4 and Table 3. A considerable variation in educational trajectory types was observed across these leaf nodes, exhibiting significant global or local effects for all predictive factors except gender. Table 3 details the distribution of the 16 educational trajectories among the leaf nodes. The key findings are as follows.

Fig. 4: Decision tree graph of educational trajectories of the 1976–1988 birth cohort.
figure 4

The figure offers a comprehensive decision tree diagram, mapping out the diverse educational pathways of the 1976–1988 birth cohort. This dataset, strategically bifurcated into eight unique leaf nodes using established criteria, uncovers a remarkable range of variation in these educational routes. Beyond the factor of gender, all other predictive elements exert global or local effects, providing a nuanced understanding of the determinants shaping these educational trajectories.

Table 3 Summary of responses for clusters of educational trajectories by terminal nodes (%).

First of all, the decision tree that outlines the educational trajectories for the entire sample bifurcates primarily along the urban/rural divide. This suggests that, among the explanatory variables considered, the urban/rural distinction is the most significant factor in differentiating educational trajectories. Individuals who are in rural regions at the age of 14 are more likely to be classified within Cluster 2, as opposed to those in urban areas, highlighting the greater prevalence of compulsory education in urban settings.

Furthermore, parental education level significantly affects the educational outcomes of children in both urban and rural contexts, substantiating Hypothesis 2. It demonstrates a consistent and considerable influence on the quality of educational pathways, encompassing both exemplary and poor outcomes. This factor is crucial in determining whether offspring will follow the prevailing educational routes, though its impact on the entry into distinctive educational paths diverges between urban and rural regions.

In general, rural inhabitants more frequently conform to educational patterns characterized by Cluster 2. Despite the less pronounced morphological differentiation in rural areas compared to urban ones, notable systematic differences are evident across the three leaf nodes. Leaf Node 3, representing the most educationally disadvantaged subgroup, has the highest portion of Cluster 2 at 46.8%, the largest among all nodes. The main distinction within its predecessor node lies in whether parental education exceeds 6 years, synonymous with completing elementary education. If parents have 6 or fewer years of education, there is a roughly 50% chance that their children will be categorized within Leaf Node 3.

Among urban youth, the likelihood of falling into Leaf Node 8 escalates when parental educational achievement does not exceed nine years, which corresponds to finishing middle school or below. While this node predominantly consists of Cluster 2 members, the proportion is smaller at 20.4% compared to its rural equivalents, such as Leaf Nodes 3, 5, and 6. Furthermore, this leaf node encompasses a larger percentage of superior educational trajectory types relative to those observed in rural settings.

With respect to the expansion of higher education, urban populations possess a distinct advantage in accessing elevated educational trajectories such as junior college (Cluster 5) and second-best or lowest higher education trajectories like Clusters 11 and 13. The progression through educational levels adheres to a social choice mechanism that aligns with a survival pattern, where transition rates differ among different social strata, as delineated by Müller and Karle (1993). This evidence underscores a more pronounced “survivor effect” within the rural populace. The group of rural elite students lags behind their urban counterparts across all echelons of identified formal higher education trajectories. Institutions such as junior colleges, undergraduate colleges, and universities at the prefectural level play an essential role in facilitating the pursuit of higher education for urban students with average endowments. The significance is especially pronounced when considering that these students are viewed within a competitive framework alongside their rural peers, who have undergone more stringent selection processes. Hence, Hypothesis 3 is broadly corroborated.

It is critical to understand that within this model, the “age” variable does not simply signify the chronological age but rather encapsulates the period effect associated with the birth year of the cohort, amidst the structural shift occasioned by the expansion of education.Footnote 24 For rural individuals with more than six years of parental education, “age” plays a substantial role in the determination of entering distinct educational trajectories. Younger rural pupils, born more recently and commencing their education later, are more inclined to progress to Leaf Node 5, which encompasses 31.0% of Cluster 2, with Clusters 5 and 8 also representing noteworthy proportions. Conversely, those over 22 are more apt to advance to Leaf Node 6, where Cluster 2 constitutes 44.9%. This indicates that younger rural students are less prone to discontinuing their elementary schooling and tend to adhere to stable educational trajectories. Overall, despite the aforementioned distinctions, it is the frequency of rural students advancing to Cluster 2 that overwhelmingly surpasses that of any urban node, regardless of the background combinations that characterize these nodes.

Analogous to the patterns observed within rural nodes, among the urban demographic segment, those aged 27 or under, categorized within Parent Node 12, exhibit a greater propensity to transition into Leaf Node 13. This node is predominantly influenced by Cluster 4, with Cluster 8 having the most substantial representation across the nodes. In contrast, individuals aged over 27 tend to proceed to Leaf Node 14, which is distinguished by a preponderance of Cluster 12. These patterns mirror a period effect specific to the 1976–1988 birth cohort, which can be attributed to the widespread implementation of compulsory education and the subsequent enlargement and normalization of higher education throughout the 1990s. Consequently, Hypothesis 1 finds general corroboration through these observations.

The findings reveal that the expansion of higher education exerts a more pronounced effect on younger individuals hailing from rural locales and minor urban centers. An evaluation of pivotal age demarcations—22 and 27 years respectively—demonstrates that adolescents from small towns accrue more benefits than those from rural areas. In stark contrast, this policy wields a diminished influence on larger and mid-sized urban areas, a consequence of their pre-existing advantages in educational accessibility. The semblance of equity in educational proliferation across the urban–rural spectrum is a direct result of the historical impediments faced by individuals from rural settings and small towns in securing educational opportunities. The observed phenomenon implies that the reduction of educational disparities within broader socio-spatial frameworks is largely explained by this compensatory effect.

The results of the study elucidate the emergence of a distinctly well-educated echelon. This segment predominantly consists of individuals from substantial urban areas who exhibit enhanced adaptability in educational pursuits and whose parentage often includes access to higher education. The genesis of this group lies at the intersection of accessible opportunities, familial backgrounds, and individual diligence. In addition, the study identifies a distinct category termed the “small-town swot” (xiao zhen zuo ti jia) within Parent Node 12. This group bears resemblance to the concept of the “privileged poor” as delineated by Jack (2019), and carries significant implications for the field of educational sociology.Footnote 25 The concept of being the first-born in a famous Chinese online community’s interest group is used to express complex socio-psychological identification in a self-deprecating tone. This articulation subtly underscores the dissonance between the group’s academic achievements and their actual social standing, a phenomenon situated within the framework of China’s exam-centric educational system and urban-rural dualism system.

The findings indicate that the methodology employed in this study is particularly sensitive to identifying unique subgroups within the populace. It is imperative, however, to acknowledge that variable-centered models exhibit inherent constraints in their capacity to comprehensively capture specific cohorts. This constraint arises from these models’ propensity to dismember holistic educational trajectories, thus restricting the study to isolated segments and fostering sample bias.

The learning situation at age 14 emerges as a notable differentiator for urban youths whose parents have attained more than 9 years of maximum schooling. Within this demographic, the likelihood of transitioning into elite educational pathways is notably higher, exemplified by Leaf Node 10, where 20.0% of participants are from Cluster 5. This likelihood is particularly high in urban populations where this indicator exceeds 35 (i.e., the average value). Residents from prefecture-level cities and provincial capitals or municipalities have the highest proportion of individuals entering Leaf Node 15.Footnote 26 Cluster 8 enjoys a higher percentage within this node, especially relative to Leaf Node 14, which is significantly linked to informal higher education. Leaf Node 13, on the other hand, is marked by Clusters 4 and 8, which are notable for their smaller scale and more homogeneous composition. In summary, a better learning situation substantially improves the likelihood of pursuing academic trajectories that culminate in higher education.

The tree-based model utilized in this study demonstrates variability in the extent of differentiation and specialization of educational trajectories between urban versus rural areas. This variation poses challenges in terms of comparability, as urban demographic groups are more likely to exhibit dominance in certain trajectory types within particular leaf nodes. Nonetheless, even the worst-case scenario for urban populations, as illustrated by Leaf Node 8, displays a greater propensity for individuals to enter advantageous educational trajectories and avoid inferior ones compared to all rural nodes. Additionally, there are clear differences within Cluster 16 when comparing rural and urban areas. For example, in rural areas, a higher percentage (21.2%) of individuals have parents with 6 years or less of education, as seen in Leaf Node 3.Footnote 27

In addition to the previously discussed findings, several other observations are particularly significant in a holistic sense and thus deserve further emphasis and clarification:

Firstly, an examination of Table 3 reveals disparities in the distribution within Cluster 1, particularly between Leaf Nodes 3 and 15, and others (Leaf Nodes 5, 6, 8, 10, 13), with the exception of Leaf Node 14. Specifically, the prevalence of Cluster 1 at Leaf Nodes 3 and 15 is estimated to be around 5%, whereas at the alternate leaf nodes, it ranges between 10–15%. This pattern ostensibly suggests a balanced distribution of educational pathways across urban and rural regions. However, this parity is restricted solely to instances of non-success in college entrance examinations and is unidirectional in nature.

Despite the superficial similarity in proportions, they have completely different meanings. The equivalent presence of Cluster 1 in Nodes 3 and 15 denotes fundamentally disparate realities. A larger fraction of rural students end their formal education prior to attaining high school and thus have no chance of potentially failing the college entrance examination. Consequently, this situation affords urban students a relative advantage in fulfilling their educational paths, thereby creating an educational divide with their rural counterparts, notwithstanding the demographic predominance of rural inhabitants at that time. These observations superficially concur with Mare model’s principles but are derived through a different methodology. From a holistic perspective, there is no evidence that educational inequality decreases progressively with transitions or that a “tail-raising” effect of inequality occurs during senior high school to college transition. Except for a small percentage of special types, educational trajectories, as they are typically mapped, demonstrate persistent inequality, characterized by the cumulative amplification of either advantages or disadvantages.

Secondly, when considering the overall picture, the homogeneity within populations from different levels of urban or rural residence should not be overstated. Urban populations have better access to lower-level higher education trajectories with full-time or part-time junior colleges. Nevertheless, a considerable proportion of the population in urban areas does not continue their education beyond mandatory or senior high levels. Conversely, with regard to variations among variables, this study differs from existing studies, revealing that the urban–rural dichotomy and familial background influence individuals differently across varied educational trajectories. When examined from a holistic interactionistic standpoint, the causal links of the underlying mechanisms are conditional and heterogeneous. Moreover, when scrutinizing the impact of parents’ educational levels on the academic achievements of their offspring, pronounced disparities are predominantly observed within Clusters 2, 5, and 8. Conversely, the termination type of senior high school (Cluster 1), various vocational education types (Clusters 6 and 7), and informal higher education types (Clusters 12, 13, and 14) exhibit less variability. Clusters 9 and 15 represent two unique types of reversals, with negligible differences across nearly all nodes. This pattern implies that the nexus between the education levels of parents and the educational success of children is multifaceted and conditional, incorporating elements of intergenerational statistical regression effects.

Thirdly, echoing Cornwell’s (2015, p. 34) critique of “general linear reality”, the CIT reveals that certain variables play a pivotal role in differentiating between types of educational trajectories within urban settings, yet they do not hold the same significance in rural contexts. On one hand, this implies that rural children experience more pervasive educational failures compared to urban children. On the other hand, the academic achievements of rural students appear to hinge more heavily on unobserved factors such as ability or specific contingencies, such as making the right choices during admission to higher education. Replicating these individual successes poses a challenge, as they tend to follow less predictable patterns than those observed in urban environments. In other words, their educational careers are fraught with greater uncertainties. In this regard, by comparison, urban populations rely more on achieved factors for access to quality educational trajectories. The prevailing conditions in rural areas fail to provide adequate support for academically diligent students to pursue top-tier educational opportunities. This fundamental issue lies at the heart of the educational divide between urban and rural areas.

Discussions and conclusions

Summary of findings and implications

Educational trajectories are complex and involve selective mechanisms. However, Mare model and its adaptations struggle to incorporate the elements of selectivity and diversity that characterize educational careers. The emergence of “waning coefficients” underscores this deficiency, except when particular analyses are confined to localized effects or take into account extended time frames or significant temporal changes that conceal the extent of bias caused by selectivity. Although the preponderance of research signals an increase in educational inequality, the identification of “waning coefficients” casts doubts on this assertion. Moreover, the assumptions integrated into models designed for bias correction in discrete selection are often overly simplistic and fall short in confronting the elaborate and heterogeneous nature of educational pathways.

Utilizing a person-centered, holistic statistical methodology, the synergistic application of sequence clustering and tree-based modeling provides an efficacious means to dissect complex educational trajectories. This approach requires minimal presumptions and adopts a naturalistic perspective, furnishing a comprehensive yet nuanced portrait of educational progression. Research reveals that educational disparities manifest early and accumulate over time, enduring throughout the span of one’s educational career, particularly pronounced at the extreme ends of the educational trajectory typology. Educational attainment, viewed as a filtration process, induces a selection bias that conventional educational transition models struggle to address. However, in this approach, the selectivity here becomes a fulcrum that research can capitalize on. This allows for the counterbalancing of unobserved heterogeneity, thereby enabling a more veritable evaluation of the causal interplay between individuals’ background characteristics and their educational achievements.Footnote 28 Based on the above methodology, this study has enhanced the differentiation of individual cases, particularly those exhibiting unobserved heterogeneity, by implementing holistic trajectory matching and clustering. In conjunction, a tree-based model with manifest covariates has been employed for explicit distinction. This dual strategy enhances the differentiation of between-group differences across both case and variable dimensions and helps to mitigate the effects of unobserved heterogeneity. This approach aids in identifying comparable subgroups, resulting in more accurate estimations.

Beyond several research hypotheses, our approach has discerned differential impacts of educational expansion policies across diverse demographic segments. It has pinpointed particular cohorts, including the highly-educated elite and “small-town swots,” while also revealing the multifaceted ways in which background factors influence entry into distinct educational pathways. Some pathways may be more accessible to specific backgrounds; however, this does not uniformly apply across all trajectories. It can be seen that, despite senior high school termination being equally distributed at each node, this localized equality does not necessarily translate into systemic equality. The “counterattack” phenomenon, although prevalent across multiple trajectories, does not exhibit enough distinction to constitute a separate category. Conversely, transitions from key to non-key schools are identified as two distinct types, presenting a seemingly random pattern not tied to particular characteristics. These findings suggest the presence of dynamic mechanisms such as compensatory or statistical regression effects within the trajectories.

In the investigation of educational stratification through quantitative methods, sociologists should leverage their discipline’s intrinsic holistic perspective and comparative advantage, rather than merely emulating econometric methodologies. The examination of micro-mechanisms and decision-making processes within educational trajectories is crucial, necessitating the integration of both analytical (for instance, as illustrated by Holm et al., 2019) and qualitative (as explored by Walther et al., 2015) approaches. To effectively dissect the procedural complexities of educational stratification, it is crucial to embrace a holistic viewpoint. An overarching perspective is as important as an in-depth understanding of specific mechanisms, aiding in the formation of accurate causal inferences at an emergent level.

Limitations and perspectives of research

SA, a methodology enjoying a resurgence of interest, is garnering increasing recognition as an alternative approach to examining social processes. However, this approach is not devoid of limitations, such as challenges associated with integrating time-varying variablesFootnote 29 and the data-driven nature of cluster analysis. SA can present methodological intricacies, especially when dealing with expansive datasets containing numerous elements, potentially yielding complex patterns. The effectiveness of this method relies heavily on the availability of comprehensive data. Moreover, the procedure is computationally demanding, with advanced analyses of extensive and intricate datasets necessitating significant computational resources and processing time.

For this study, due to the small sample size of this research, some findings may be insufficiently robust and require more data for verification. Utilization of big data holds the potential to improve classification accuracy and facilitate more comprehensive trajectory matching, thereby reducing confounding factors when integrated with other typological methods. Furthermore, it is crucial to acknowledge that sequences are noteworthy not merely in their morphological structure but also in the intricate dynamics they encompass. In the age of complex social science, the progression of research—especially in relation to trajectory dynamics—necessitates an integration of mechanistic analysis with data science algorithms. Promising exploratory avenues encompass methodologies such as graph theory, simulation, and deep learning, among others.

Finally, it is important to declare that this study aims to address “waning coefficients” and provide an alternative approach, rather than dismissing the long-term or mechanistic effects revealed by existing studies. Solving all methodological problems in one paper is not possible, but this approach may inspire further exploration.