Introduction

Epidemic models that explicitly incorporate network structure have gained traction during the COVID-19 pandemic1,2,3,4,5. Including information about the intricate network structure of the population allows for better predictions of the shape of the epidemic curve, the regions or population groups likely to be infected given observed cases, and the types of contacts relevant for transmission. All three are highly relevant for public health responses.

Despite advances in network-based epidemic modeling during the COVID-19 pandemic, a significant gap remains in measuring the relative impact of different types of school and family contacts on transmission dynamics. Previous studies on influenza6,7, and also SARS-CoV-28,9highlight the importance of schools as bridges for disease transmission between households, schools and other social contact areas. Indeed, in an attempt to contain the spread of the virus, governments around the world closed schools, resulting in large learning losses, especially among students from less-educated families10.

At the same time, studies in diverse countries such as the United Kingdom, Australia, and Singapore emphasize that transmission within schools can be managed with interventions such as physical distancing, air filtering and rapid isolation, and that household contacts remain a more prominent pathway for transmission11,12,13,14,15,16,17. For instance, Cordery et al.13 observed minimal school-based transmission when precautions were in place, contrasting with high secondary transmission rates in households, likely due to prolonged close contact and viral shedding. Furthermore, in Wales, Thompson et al.11 found that while students faced increased risks of infection from peers in their immediate year groups, the total number of cases in a school was not associated with an increased risk for staff or pupils. Similarly, Macartney et al.12 observed low SARS-CoV-2 transmission rates in Australian educational settings, suggesting that schools did not contribute significantly to COVID-19 spread when effective case-contact testing and epidemic management strategies were in place.

Our study adds to this body research by examining the impact of family and school contacts on COVID-19 transmission among Dutch students using detailed population-level registry data from the Netherlands. We specifically examine students who transitioned from primary to secondary school in 2021, which takes place at age 12. Focusing on this transition allows us to distinguish whether infections occur primarily at school, through social ties inherited from primary school, or through non-school interactions such as community transmission. Specifically, we match pairs of students who attended primary school together in 2020 and either attended separate secondary schools or the same school in 2021. Given the large student segregation at schools18, we expect students who attended the same primary school to be similar to each other. We compare these groups of students to a reference group of students who did not attend primary or secondary school together. We then calculate the probability of temporally associated infections for the different groups as a function of the distance between the students’ homes. Next, we compare our results to the probabilities of temporally associated infections among different family members (siblings, parent-child, co-parents) living in the same house or at varying distances. Finally, we run a series of multilevel regressions to understand the heterogeneity between schools.

Our findings show that family ties contribute strongly to the spread of SARS-CoV-2. While attending the same school increased the probability of temporally associated infections from 0.5% to 1.6%, the probability of associated infections was much higher for family members living in the same house (25–50%) and even for family members living at different addresses (around 10%). During the period studied in this paper, temporally associated infections in primary schools were rare. These results align with previous literature showing a high frequency of secondary transmission in households11,13 and low frequency of transmission in schools. Examining heterogeneity at the school level, we found that factors such as the distance between the students’ homes, school size, the median income of the postcode area of the school, and school denomination explained only 3% of the variance in outcomes. Most of the variance manifested at the individual level (60%) and at the school level (35%).

The paper proceeds as follows: Section 2 details the data, the matching of students and the analysis of the data. Section 3 details the probability of temporally associated infections for different subgroups. Section 4 concludes and discusses the potential of administrative data for epidemic studies.

Data and methods

Main datasets and network construction

Our analysis integrates two main datasets from Statistics Netherlands (CBS): the COVID-19 PCR-test data and the population network data. Every CBS dataset can be linked to each other at the individual-level through a unique identifier. We provide below a short summary of the data processing steps. A detailed explanation of all datasets and variables can be found in the online Supplementary Information.

The first dataset is the COVID-19 test dataset, which includes all PCR-tests conducted by municipal health services in the Netherlands outside of a hospital setting between June 2020 and September 2021. Schools were open for the majority of the period studied (Fig. 1). Since reinfections were unusual for the period studied, we retained for each person the first recorded infection.

Fig. 1
figure 1

Number of SARS-CoV-2 infections. Infections are measured by the municipal health services using PCR-tests, and displayed for students attending primary (gray line) and secondary schools (black line), aggregated per week (using a rolling window) over the time studied. To preserve the privacy of those individuals and in line with CBS regulations, only weeks with at least 10 cases are shown. School closures are shown at the bottom of the figure. Time periods where secondary schools were closed are marked in red: June 1st–15th (2020) and December 15th (2020)–March 1st (2021). Time periods where secondary schools were open are marked in yellow (open with restrictions, March 1st–April 26th (2021) and blue (open without restrictions). Primary schools were open from February 8th–March 1st (2021).

The second dataset is the Person Network dataset, which contains formal relationships—i.e., family, school and household relationships recorded officially by the government—between individuals connecting the entire population of the Netherlands. These connections indicate “a highly increased probability that two individuals interact socially”19. Administrative networks bear a novel opportunity to researchers studying social processes since they do not suffer from common drawbacks of studies based on surveys, digital trace data, or contact tracing such as non-response bias, selection bias, or social desirability effects19. Furthermore, these data are readily available in the Netherlands as well as many other countries, and could lower the burden of additional data collection efforts to inform policy decisions in a pandemic situation.

We constructed school networks using educational records from primary and secondary schools. Educational records connect students to their schools, year of education, and program tracks. We included students who did not attend special schools—schools servicing students with special needs such as blind or deaf individuals, where infection dynamics are likely to be different. To compare infections arising from school, family and non-school interactions (Section 2.22.3) we focus on students transitioning to secondary school in September 2021. For the multilevel regression analysis (Section 2.4) we focus on students registered in primary schools. This approach allows us to compare transmission dynamics across distinct social environments by examining both primary and transitioning secondary students.

To analyze the role of family ties in the transmission of SARS-CoV-2, we collected all family pairs in the following categories: Full-siblings, co-parents (two adults being the parents of the same child) and parent-child. We extracted family networks using data derived from parent-child records19. For example, siblings are recorded if they share at least a parent or if their parents are partners. Different types of siblings—such as half-siblings (who share one biological parent) and step-siblings (who have no biological parents in common and are related through their parents’ relationship)—may have very different levels of closeness. Some might grow up together, some might grow up in different houses (especially those who share the same father), while others might become siblings as adults and have less frequent contact. To better estimate the probability of temporally associated co-infections in siblings that are likely to keep regular contact, we focus only on full-siblings.

Finally, for the multilevel regression analyses we classified schools according to their denomination. The school denomination denotes the type of school and is correlated with attitudes towards COVID-19. In the Netherlands, parents have the right to choose schools that match their values. A majority of schools are Christian (either Protestant, Catholic, Evangelic or Reformist), while around one third are public schools10. Other denominations include for example Islamic and Anthroposophic schools. We also included in the regression analyses the school size, the median income of the school’s neighborhood (at the 4-digit postcode level, an administrative area equivalent to a neighborhood with an average population size of 4,314 and a maximum population of 28,19020 and the distance between the house addresses of each pair of students. For privacy considerations, the location of the houses is only known at a resolution of 100x100 m2. We assigned all individuals to the last known address before 2021 and kept individuals who remained living in the Netherlands throughout 2021. We estimated the distance between two households as the euclidean distance plus 52 meters—the average distance between two random points in a 100x100 m2 square. This implies that students living in the same 100x100 m2 are estimated to live at a distance of 52 meters. Students living in the same household (sharing the same house ID) were set at a distance of 0 meters.

Matching students in groups of increasing level of contact

We first analyzed the role of schools in the transmission of SARS-CoV-2 by focusing on students transitioning from primary school to secondary school in 2021 (illustrated in Fig. 2). Focusing on this transition allows us to distinguish whether infections mainly occur at school, or from non-school interactions such as community transmission. We create four groups of student pairs representing increasing level of contact: Group 1 (Baseline): Pairs of students who did not attend the same primary or secondary school. Since we are interested in a comparison group of pairs of students living near each other, we oversampled pairs of students living within the same municipality. Group 2 (Same background): Pairs of students who attended the same primary school (and will have a similar social background) but not the same secondary school. Group 3 (Same school, different program track): Pairs of students who attended both primary and secondary schools together but were not in the same program track in secondary school—i.e, they attend different classrooms. Group 4 (Same school, same program track): Pairs of students who attended both primary and secondary schools together and were in the same program track in secondary school.

Fig. 2
figure 2

Illustration of the different types of student pairs, with increasing level of expected contact. Each student in the data is paired with different type of students. The example shows the process for one student named Alex. Group 1: Students who attend a different primary school than Alex and a different secondary school. Group 2: Students who attend the same primary school than Alex and a different secondary school. Group 3: Students who attend the same primary and secondary school than Alex, but are placed in a different program track in the secondary school. Group 4: Students who attend the same primary and secondary school than Alex, and are placed in the same program track in the secondary school.

Finally, we created three separate categories for twins, which we identify as pairs of students living in the same household and attending the same school year. Twins 2 (Same background): Pairs of twins who attended the same primary school but not the same secondary school. Twins 3 (Same school, different program track): Pairs of twins who attended both primary and secondary schools together but were not in the same program track in secondary school—i.e, they will attend different classrooms. Twins 4 (Same school, same program track): Pairs of twins who attended both primary and secondary schools together and were in the same program track in secondary school. Unsurprisingly, we were unable to create a group Twins 1 (Baseline) since there were less than 10 twin pairs in the studied cohort which did not attend the same primary school together.

Probability of temporally associated infection

We calculated the probability of temporally associated infections within a 14-day period for each group as a function of distance between the places of residence of two students or two family members. To preserve the individuals’ privacy, results can only be exported from the secure computer of CBS in groups of at least 10 individuals. Because of this, we calculated the number of student or family pairs and number of temporally associated infected pairs for the following distance bins: 0m, 0-300m, 300-1000m, 1000-3000m, 3000-10,000m, 30,000+. Each bin excludes its left boundary (e.g., 0–300m includes distances greater than 0m but up to 300m), except for the 0m bin, which represents individuals living in the same household. The 14-day period was chosen based on the approximately 7 days incubation and generation periods for SARS-CoV-221, but the results are robust to changes of this threshold.

The probability of temporally associated infections within a distance bin d is calculated as \(P_{\text {temp}}(d) = \frac{\sum _{i,j} I_{ij}(d)}{\sum _{i,j} N_{ij}(d)},\) where \(I_{ij}(d)\) equals one if the student or family pair (ij) living within the distance bin d tested positive within a 14-day period, and \(N_{ij}(d)\) is the total number of student or family pairs within the distance bin d.

The number of student and family pairs (\(\sum _{i,j} N_{ij}(d)\)) and associated temporally associated infections (\(\sum _{i,j} I_{ij}(d)\)) is given in Table 1 and in Table 2 respectively for school and family ties.

Table 1 Number of student pairs (N) and student pairs co-infected within 14 days of each other (N_inf) as a function of distance and background. G1: Pairs of students who did not attend the same primary (in 2020) or secondary schools (2021). G2–4: Attended the same primary school in 2020 and (G1) did not attend the same secondary school in 2021; (G2) attended the same secondary school but different program; (G3) attended the same secondary school and the same program. Note that results can only be exported from the secure computer of CBS in groups of at least 10 individuals.
Table 2 Number of family pairs (N) and family pairs co-infected within 14 days of each other (N_inf) as a function of distance and type of family tie.

Statistical analysis of school and municipality heterogeneity

In our second analysis we focused on the factors driving the transmission dynamics between students. We conducted a multilevel regression analysis, where we modeled temporally associated infections between student pairs using logistic regressions, with parameters estimated via maximum likelihood estimation.

Multilevel models are capable of accurately estimating regression parameters in situations where data is hierarchically structured, and thus violating the assumption of independence of observations. They furthermore allow to attribute variance to the respective levels in the data structure.

We constructed models accounting for a three-level structure: individuals, schools, and municipalities (Dutch: gemeenten). This enabled us to identify the extent to which the school context contributes to temporally associated infection events, separating it from influences at the individual and municipal level.

We added several explanatory variables at each level to explain variability in temporally associated infection probabilities. At the individual level, we added the distance between student pairs as a predictor of an associated infection. Due to the skewed distribution of the variable, and to aid convergence in parameter estimation, we took its natural logarithm, centered, and z-scaled it (i.e., subtracted the mean and divided by the standard deviation to normalize the values). The school-level predictors were the number of students indicating the size of the school, the median income of the school’s 4-letter postcode area, and the school’s denomination (if any). School size and income were centered and z-scaled for the same reasons as the distance variable.

A first model served as the baseline, decomposing the variance at the different levels by including random intercepts for schools and municipalities, but not including any predictors:

$$\begin{aligned} y_{isg} = \gamma _{000} + v_{0m} + u_{0sm} + e_{ism}, \end{aligned}$$
(1)

where i denotes an individual, s a school, and m a municipality. In the equation, \(\gamma _{000}\) is the overall intercept and \(v_{0m}, u_{0sm}, e_{ism}\) represent the error terms at the municipality, school, and individual level, respectively.

We then estimated a second model including the random intercepts introduced above as well as the predictors at the individual and school level:

$$\begin{aligned} y_{ism} = \gamma _{000} + \gamma _{p00} X_{pism} + \gamma _{0q0} Z_{qsm} + v_{0m} + u_{0sm} + e_{ism}, \end{aligned}$$
(2)

where \(X_{pism}\) corresponds to the (\(p=1\)) predictor at the individual level: the logged distance between students. \(Z_{qsm}\) represents the (\(q=3\)) predictors at the school level: school size, median income, and school denomination.

Finally, we included random slopes (\(u_{psm}\)) at the school level for the distance between student pairs (\(X_{pism}\)):

$$\begin{aligned} y_{ism} = \gamma _{000} + \gamma _{p00} X_{pism} + \gamma _{0q0} Z_{qsm} + u_{psm} X_{pism} + v_{0m} + u_{0sm} + e_{ism}. \end{aligned}$$
(3)

Significance of predictors was assessed using Wald tests at \(\alpha =0.05\). Significant differences between the models were determined by likelihood-ratio tests. Explained variance at different levels of the models was calculated according to the method proposed by McKelvey & Zavoina22, which relates the systemic variance of the model introduced by the predictor variables to the residual variance at all levels. Model coefficients and variances were furthermore rescaled following Hox et al.23, pp.125 to enable comparison of explained variance across models (see also Hox et al.23, pp.121–125 for more clarification on variance calculation and rescaling procedures in multilevel models for dichotomous outcomes.).

Instead of running the three models for all the different groups of student pairs introduced in Section 2.2, we restricted this part of the study to students who attended primary education together (in the same class year) in 2021. Furthermore, the analyses were based on a 5-percent sample of all schools in the data, with an inclusion probability proportional to school size. This was done in order make the models computationally feasible, while still including a substantial number of different schools from various areas. The sampling yielded a dataset of 2,509,927 observations representing student pairs, grouped in 312 schools, and 174 municipalities. While there is a large class imbalance, with 0.1% of the student pairs temporally co-infected, we follow the advice of recent research of not correcting for it24. Coefficient estimates are stable for high class imbalance as long as there are enough observations in the minority class and corrections tend to miscalibrate the models24.

The statistical models were run using the lme4 library in R. The Python and R code documenting the performed steps of all data processing and analysis procedures is available at https://github.com/jgarciab/covid_schools/.

Results and discussion

Shared school and classroom environments

We first analyzed the role of schools in the transmission of SARS-CoV-2 by focusing on students transitioning from primary school to secondary school (see Methods Section 2.2), which allowed us to better separate school from non-school social interactions such as community transmission.

We found that the probability of temporally associated infection was 0.52% (95% confidence interval (CI): 0.49–0.56%) for the baseline group, 1.11% (CI: 1.01–1.20%) for students in group 3, and 1.65% (CI: 1.52–1.77%) for students in group 4. Compared with the baseline group (group 1), attending the same primary school and the same program track in secondary school (group 4) increased the probability of associated infections significantly by 213% (CI: 183–247%, see Fig. 3A). Attending secondary school in a different program track (group 3)—thus not sharing a classroom so frequently—increased the probability of associated infections significantly by 111% (CI: 89–135%, see Fig. 3A).

Fig. 3
figure 3

Attending the same school increases the probability of temporally associated infection. (A) Increase in the probability of temporally associated infection, compared to the baseline (G1), for student pairs in the same program track (G4), same school but different program track (G3), and different school but same background (G2). Error bars indicate 95% confidence intervals. (B) Probability of temporally associated infection for the four different groups of student pairs, as a function of the distance between the student’s homes. Note the logarithmic horizontal axis and that the \(<0.3km\) distance bin excludes individuals living in the same household. Groups have slightly different horizontal offsets to avoid overlapping error bars.

Students who attended primary school together but not secondary school together (group 2) had only a slightly higher probability of temporally associated infections (CI: 0.62–0.69%) compared to the baseline (0.52%). This small difference indicates that social ties inherited from primary school had little impact on this probability. Moreover, for all groups of students, the distance between students’ houses had only a minor effect on the probability of associated infection (Fig. 3B).

Shared household and family contexts

After assessing the increase in the probability of temporally associated infections for students attending the same school, we examined how this probability compares to the probability for individuals of the same family.

Fig. 4
figure 4

Comparison of the probability of temporally infections in school and family networks. (A) Increase in temporally associated infection rate, compared to the baseline (Twins2: siblings attending a different school but having attended the same primary school), for twins in the same program (Twin4), and same school but different program track (Twins3). Error bars indicate 95% confidence intervals. (B) Probability of temporally associated infection as a function of the distance between the individual’s homes for twin pairs (purple), sibling pairs (magenta), parent-child pairs (green) and co-parents (orange). Groups have slightly different horizontal offsets to avoid overlapping error bars. Note the logarithmic horizontal axis. The grayed area correspond to pairs living in the same household and a break in the logarithmic axis.

For twins living in the same household, attending the same school modestly increased the probability of temporally associated infection: 33% (28 out of 84 pairs, CI: 28–84%) for twins attending different secondary schools, compared to 39% (24 out of 62 pairs, CI: 24–62%) for twins attending the same school and different track, and 50% (54 out of 107 pairs, CI: 54–107%) for twins attending the same school and program track (Fig. 4A). Living in the same household results in a large increase in the probability of temporally associated infection (Fig. 4B). The estimated probabilities ranged from 23% for sibling pairs and parent-child pairs to 50% (CI: 41–60%) for twins attending the same program track in secondary school. This increased risk is presumably due to prolonged exposure if there is an active case. The probabilities for family pairs are much larger than the estimated 1.6% for students in group 4 (attending the same year group in primary and secondary school). This finding aligns with current scientific knowledge highlighting the key role of household transmission in the spread of SARS-CoV-2 (see for example8,9,11,13).

Among individuals who do not share a household, family relationships are highly predictive of the probability of temporally associated infections (Fig. 4B). The probability of temporally associated infections for parent-child, co-parents and siblings living in different but nearby households is 7–12%. This probability decreases with distance (Fig. 4B), as social interactions between family members are more likely when they live close together.

School and municipality heterogeneity in the probability of temporally associated infections

We finally investigated the determinants of temporally associated infections through a series of multilevel regression models (Table 3), which explicitly attribute variance to different levels of observational units (see Section 2.4). The presented results of Model 1—including random intercepts for schools and municipalities but no predictors—serves as a baseline to assess the variance explained by the predictors. The majority of the total variance in this three-level structure manifested at the individual level with \(\frac{3.29}{3.29+1.93+0.25} = 0.60\) (i.e., 60% of the variance is due to individual-level differences). The school-level variance made up a share of 0.35, which left 0.05 to the municipality level.

Table 3 Results of logistic multi-level regression models. Statistically significant coefficients are marked as *** \(p<0.001\), ** \(p<0.01\), * \(p<0.05\). Standard errors are displayed next the coefficients in parenthesis. The school denominations were translated from Dutch. Codes of the original variable from the CBS dataset INSCHRWPOTAB are displayed in square brackets. The reference category of denomination is Specialized non-denominational education [ABZ] (e.g., Montessori). Numerical variables were centered and z-scaled.

We then included predictors indicating the residential distance of student pairs, the size of the school they attended, the median income of the postcode area of the school, and the school’s denomination (Model 2). The random effects remained the same as in Model 1—i.e., including random intercepts for schools and municipalities.

The distance between student pairs’ homes was found to significantly decrease the temporally associated infection probability. This is in line with the results of the preceding probability analysis, identifying student pairs living in the same household to be facing the highest risk of temporally associated infection. We also found significant decreases in associated infection probabilities in Islamic and Protestant schools as compared to the reference category. While Islamic schools are only a small minority of all schools, Protestant, together with public schools and Catholic schools, are one of the largest denominations. However, our analysis does not distinguish if the results are driven by a lower spread of virus or by a lower propensity to test. School size and median income in the postcode area did not show a significant association with temporally associated infections.

In total, the predictors were able to explain 3% of the overall variance. This can be calculated by the share of variance in the linear predictor compared to the total variance of Model 2, or comparing the total variances of Model 2 and Model 1. Looking at variance reduction at all levels individually, variance decreased by 3 percent at the individual level, 6.2 percent at the school level, and, most notably, by 48 percent at the municipality level. While the predictors explained a large share of variability at the municipality level, the variability was very low to begin with (5% of the total). A likelihood-ratio test confirmed a significant improvement in model fit (\(\chi ^2(13) = 65.24, p < 0.001\)) of Model 2 over Model 1.

Finally, in Model 3, we included random slopes of student-pair distance at the school level—i.e., the effect of distance was allowed to vary by school. A term for intercept-slope covariance was also included, modeling the strength of the distance effect depending on the average probability by school. These parameters substantially increased model fit, meaning, the association of distance between the students residence and temporally associated infection probability was indeed dependent on the specific school. This is also indicated by the significant result of the likelihood-ratio test comparing Model 3 to Model 2 (\(\chi ^2(2) = 200.16, p < 0.001\)).

To conclude, the variance decomposition of the multi-level models showed that the vast majority of variability in the data results from differences at the individual level (60%) and the school level (35%). While we could find significant effects of student distance and school denomination on temporally associated infection probability, these effects could explain only 3 percent of the overall variance. Possible omitted factors driving differences in this probability could be families’ attitudes towards COVID-19, or prevention measures implemented at the school level.

Conclusion

In this paper we investigate the impact of schools and families in the temporal association of SARS-CoV-2 infections among students during the period from June 2020 to September 2021. This is possible by integrating population-scale networks and PCR test result data using registry data from Statistics Netherlands.

Our results show that living together at home is the most significant factor correlated with two individuals testing positive within a 14-day period, underscoring the importance of household transmission in the spread of the virus. Both social ties inherited from primary school and geographical distance were found to have little effect on the probability of both students testing positive within a 14-day period. This suggests that either social ties with classmates in primary school are weakened after students move to secondary school, or that COVID restriction strategies targeting non-school social networks were highly effective. Future studies could estimate the effect of school and non-school restrictions on students contacts and SARS-CoV-2 transmission.

In contrast with the low impact of social ties inherited from primary school, shared school and classroom environments were found to significantly increase the likelihood that both students would test positive within a 14-day period. Although the likelihood of temporally associated infections in schools was low, it should be noted that even small increases in the transmission rate in schools can lead to larger outbreaks, since the transmission rate is linearly related to the reproduction number25and a large proportion of children’s contacts are expected to occur in schools. The observed increase in temporally associated infections from 0.6 to 1.6% may lead to reproduction numbers above one infectee per infected when schools reopen26. These insights into the transmission dynamics of SARS-CoV-2 within Dutch families and educational institutions can inform future use of network models and provide insights for possible interventions, such as school closures.

The analysis presented in this paper has a limitation that open up fruitful additional avenues for future research. Governments around the world introduced several interventions to reduce the transmissions, including (partially) closing schools, workplaces, and wearing masks in confined spaces. Due to the wide variety of measures at school, workplaces, and public spaces27, we did not examine the role of school closures. Using data from Statistics Netherlands and a similar methodological approach, further research could explore the effectiveness of various interventions. Furthermore, a similar approach merging infection results data and population scale data could be used to understand the effects of school and family connections for other diseases.