Abstract
Large-scale networks have been instrumental in shaping how we think about social systems, and have undergirded many foundational results in mathematical epidemiology, computational social science, and biology. However, many of the social systems through which diseases spread, information disseminates, and individuals interact are inherently mediated through groups, known as higher-order interactions. A gap exists between higher-order models of group formation and spreading processes and the data necessary to validate these mechanisms. Similarly, few datasets bridge the gap between pairwise and higher-order network data. The Bluesky social media platform is an ideal laboratory for observing social ties at scale through its open API. Not only does Bluesky contain pairwise following relationships, but it also contains higher-order social ties known as “starter packs” which are user-curated lists designed to promote social network growth. We introduce “A Blue Start”, a large-scale network dataset comprising 39.7M user accounts, 2.4B pairwise following relationships, and 365.8K groups representing starter packs. This dataset will be an essential resource for the study of higher-order networks.
Similar content being viewed by others
Data availability
The dataset is hosted on the Social Media Archive (SOMAR) at the Inter-university Consortium for Political and Social Research (ICPSR) at https://doi.org/10.3886/ICPSR300499.
Code availability
The code used to analyze the starter pack and following networks is available on GitHub (https://github.com/nwlandry/a-blue-start) and at Ref. 74.
References
Cattuto, C. et al. Dynamics of Person-to-Person Interactions from Distributed RFID Sensor Networks. PLOS ONE 5, e11596, https://doi.org/10.1371/journal.pone.0011596 (2010).
Ebel, H., Mielsch, L.-I. & Bornholdt, S. Scale-free topology of e-mail networks. Physical Review E 66, 035103, https://doi.org/10.1103/PhysRevE.66.035103 (2002).
Barabási, A.-L. et al. Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications 311, 590–614, https://doi.org/10.1016/S0378-4371(02)00736-7 (2002).
González, M. C., Hidalgo, C. A. & Barabási, A.-L. Understanding individual human mobility patterns. Nature 453, 779–782, https://doi.org/10.1038/nature06958 (2008).
Newman, M. E. J. Assortative Mixing in Networks. Physical Review Letters 89, 208701, https://doi.org/10.1103/PhysRevLett.89.208701 (2002).
Newman, M. E. J. Mixing patterns in networks. Physical Review E 67, 026126, https://doi.org/10.1103/PhysRevE.67.026126 (2003).
Vaquero, L. M. & Cebrian, M. The rich club phenomenon in the classroom. Scientific Reports 3, 1174, https://doi.org/10.1038/srep01174 (2013).
Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826, https://doi.org/10.1073/pnas.122653799 (2002).
Morstatter, F., Pfeffer, J., Liu, H. & Carley, K. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. Proceedings of the International AAAI Conference on Web and Social Media 7, 400–408, https://doi.org/10.1609/icwsm.v7i1.14401 (2013).
Campan, A., Atnafu, T., Truta, T. M. & Nolan, J. Is Data Collection through Twitter Streaming API Useful for Academic Research? In 2018 IEEE International Conference on Big Data (Big Data), 3638–3643, https://doi.org/10.1109/BigData.2018.8621898 (2018).
Li, Q. et al. How Much Data Do You Need? Twitter Decahose Data Analysis. In The 9th International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (2016).
Mislove, A., Lehmann, S., Ahn, Y.-Y., Onnela, J.-P. & Rosenquist, J. Understanding the Demographics of Twitter Users. Proceedings of the International AAAI Conference on Web and Social Media 5, 554–557, https://doi.org/10.1609/icwsm.v5i1.14168 (2011).
Ginsberg, J. et al. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014, https://doi.org/10.1038/nature07634 (2009).
Kreiss, D. Seizing the moment: The presidential campaigns’ use of Twitter during the 2012 electoral cycle. New Media & Society 18, 1473–1490, https://doi.org/10.1177/1461444814562445 (2016).
Gleeson, J. P., O’Sullivan, K. P., Baños, R. A. & Moreno, Y. Effects of Network Structure, Competition and Memory Time on Social Spreading Phenomena. Physical Review X 6, 021019, https://doi.org/10.1103/PhysRevX.6.021019 (2016).
Gerbaudo, P. Tweets and the Streets: Social Media and Contemporary Activism. https://library.oapen.org/handle/20.500.12657/30772 (Pluto Press, 2012).
Jackson, S. J., Bailey, M. & Welles, B. F. #HashtagActivism: Networks of Race and Gender Justice (MIT Press, 2020).
Tufekci, Z. Twitter and Tear Gas: The Power and Fragility of Networked Protest (Yale University Press, 2017).
Li, M. Cross-Platform Social Media Usage and Behavior. Ph.D. thesis, University of Michigan http://deepblue.lib.umich.edu/handle/2027.42/195325 (2024).
Quelle, D., Denker, F., Garg, P. & Bovet, A. Why Academics Are Leaving Twitter for Bluesky http://arxiv.org/abs/2505.24801 (2025).
Seckin, O. C. et al. The Rise of Bluesky http://arxiv.org/abs/2504.12902 (2025).
Palmer, A. Twitter CEO Jack Dorsey has an idealistic vision for the future of social media and is funding a small team to chase it https://www.cnbc.com/2019/12/11/twitter-ceo-jack-dorsey-announces-bluesky-social-media-standards-push.html (2019).
McCue, M. How the Open Social Web Will Change Everything, with Bluesky’s Jay Graber https://dot-social.simplecast.com/episodes/jay-graber (2024).
Kleppmann, M. et al. Bluesky and the AT Protocol: Usable Decentralized Social Media. In Proceedings of the ACM Conext-2024 Workshop on the Decentralization of the Internet, DIN ’24, 1–7, https://doi.org/10.1145/3694809.3700740 (Association for Computing Machinery, New York, NY, USA, 2024).
Bluesky PBC.Federation Architecture https://docs.bsky.app/docs/advanced-guides/federation-architecture (2025).
Bluesky PBC. Repository https://atproto.com/specs/repository (2025).
Bluesky Team. Moderation in a Public Commons https://bsky.social/about/blog/6-23-2023-moderation-proposals (2023).
Jeong, U., Jiang, B., Tan, Z., Bernard, H. R. & Liu, H. Descriptor: A Temporal Multi-network Dataset of Social Interactions in Bluesky Social (BlueTempNet). IEEE Data Descriptions 1, 71–79, https://doi.org/10.1109/IEEEDATA.2024.3474640 (2024).
Failla, A. & Rossetti, G. “I’m in the Bluesky Tonight”: Insights from a year worth of social data. PLOS ONE 19, e0310330, https://doi.org/10.1371/journal.pone.0310330 (2024).
Balduf, L. et al. Looking AT the Blue Skies of Bluesky. In Proceedings of the 2024 ACM on Internet Measurement Conference, IMC ’24, 76–91, https://doi.org/10.1145/3694809.3700740 (Association for Computing Machinery, New York, NY, USA, 2024).
Balduf, L. et al. Bootstrapping Social Networks: Lessons from Bluesky Starter Packs. Proceedings of the International AAAI Conference on Web and Social Media 19, 178–192 (2025).
Bond, R. M. et al. A 61-million-person experiment in social influence and political mobilization. Nature 489, 295–298, https://doi.org/10.1038/nature11421 (2012).
Zhang, K., Yu, Q., Lei, K. & Xu, K. Characterizing Tweeting Behaviors of Sina Weibo Users via Public Data Streaming. In Li, F., Li, G., Hwang, S.-w., Yao, B. & Zhang, Z. (eds.) Web-Age Information Management, 294-297, https://doi.org/10.1007/978-3-319-08010-9_32 (Springer International Publishing, Cham, 2014).
Takac, L. & Zabovsky, M. Data analysis in public social networks. In International Scientific Conference and International Workshop Present Day Trends of Innovations, 1 (2012).
Backstrom, L., Huttenlocher, D., Kleinberg, J. & Lan, X. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 44–54 https://doi.org/10.1145/1150402.1150412 (Association for Computing Machinery, New York, NY, USA, 2006).
Mahoney, M. W., Dasgupta, A., Leskovec, J. & Lang, K. J. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1474, https://doi.org/10.1080/15427951.2009.10129177 (2009).
Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P. & Bhattacharjee, B. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07, 29–42, https://doi.org/10.1145/1298306.1298311 (Association for Computing Machinery, New York, NY, USA, 2007).
Rossi, R. & Ahmed, N. The Network Data Repository with Interactive Graph Analytics and Visualization. Proceedings of the AAAI Conference on Artificial Intelligence 29, https://doi.org/10.1609/aaai.v29i1.9277 (2015).
Kunegis, J. KONECT: The Koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13 Companion, 1343–1350, https://doi.org/10.1145/2487788.2488173 (Association for Computing Machinery, New York, NY, USA, 2013).
Leskovec, J. & Sosič, R. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Trans. Intell. Syst. Technol. 8, 1:1–1:20, https://doi.org/10.1145/2898361 (2016).
Fu, X., Yu, S. & Benson, A. R. Modelling and analysis of tagging networks in Stack Exchange communities. Journal of Complex Networks 8, cnz045, https://doi.org/10.1093/comnet/cnz045 (2021).
Landry, N. W., Amburg, I., Shi, M. & Aksoy, S. G. Filtering higher-order datasets. Journal of Physics: Complexity 5, 015006, https://doi.org/10.1088/2632-072X/ad253a (2024).
Landry, N. W., Young, J.-G. & Eikmeier, N. The simpliciality of higher-order networks. EPJ Data Science 13, 1–20, https://doi.org/10.1140/epjds/s13688-024-00458-1 (2024).
Chodrow, P. S. Configuration models of random hypergraphs. Journal of Complex Networks 8 https://doi.org/10.1093/comnet/cnaa018 (2020).
Guimerà, R., Uzzi, B., Spiro, J. & Amaral, L. A. N. Team Assembly Mechanisms Determine Collaboration Network Structure and Team Performance. Science 308, 697–702, https://doi.org/10.1126/science.1106340 (2005).
Wuchty, S., Jones, B. F. & Uzzi, B. The Increasing Dominance of Teams in Production of Knowledge. Science 316, 1036–1039, https://doi.org/10.1126/science.1136099 (2007).
Wu, L., Wang, D. & Evans, J. A.Large teams develop and small teams disrupt science and technology. Nature https://doi.org/10.1038/s41586-019-0941-9 (2019).
Shi, F. & Evans, J. Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines. Nature Communications 14, 1641, https://doi.org/10.1038/s41467-023-36741-4 (2023).
Chowdhary, S., Gallo, L., Musciotto, F. & Battiston, F. Team careers in science: Formation, composition and success of persistent collaborations http://arxiv.org/abs/2407.09326 (2024).
Newman, M. E. J. Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences 101, 5200–5205, https://doi.org/10.1073/pnas.0307545100 (2004).
Iacopini, I., Karsai, M. & Barrat, A. The temporal dynamics of group interactions in higher-order social networks. Nature Communications 15, 7391, https://doi.org/10.1038/s41467-024-50918-5 (2024).
Gallo, L., Zappalà, C., Karimi, F. & Battiston, F. Higher-order modeling of face-to-face interactions http://arxiv.org/abs/2406.05026 (2024).
Benson, A. R., Abebe, R., Schaub, M. T., Jadbabaie, A. & Kleinberg, J. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences 115, E11221–E11230, https://doi.org/10.1073/pnas.1800683115 (2018).
Stehlé, J. et al. High-Resolution Measurements of Face-to-Face Contact Patterns in a Primary School. PLOS ONE 6, e23176, https://doi.org/10.1371/journal.pone.0023176 (2011).
Mastrandrea, R., Fournet, J. & Barrat, A. Contact Patterns in a High School: A Comparison between Data Collected Using Wearable Sensors, Contact Diaries and Friendship Surveys. PLOS ONE 10, e0136497, https://doi.org/10.1371/journal.pone.0136497 (2015).
Vanhems, P. et al. Estimating Potential Infection Transmission Routes in Hospital Wards Using Wearable Proximity Sensors. PLOS ONE 8, e73970, https://doi.org/10.1371/journal.pone.0073970 (2013).
Young, J.-G., Petri, G. & Peixoto, T. P. Hypergraph reconstruction from network data. Communications Physics 4, 1–11, https://doi.org/10.1038/s42005-021-00637-w (2021).
Lizotte, S., Young, J.-G. & Allard, A. Hypergraph reconstruction from uncertain pairwise observations. Scientific Reports 13, 21364, https://doi.org/10.1038/s41598-023-48081-w (2023).
Landry, N. W. et al. XGI: A Python package for higher-order interaction networks. Journal of Open Source Software 8, 5162, https://doi.org/10.21105/joss.05162 (2023).
Coll, M. et al. HIF: The hypergraph interchange format for higher-order networks http://arxiv.org/abs/2507.11520 (2025).
Smith, A., Landry, N., Kumar, S., Foucault Welles, B. & Amburg, I. A Blue Start: A large-scale pairwise and higher-order social network dataset https://doi.org/10.3886/ICPSR300499 (2026).
LaRock, T. & Lambiotte, R. Encapsulation structure and dynamics in hypergraphs. Journal of Physics: Complexity 4, 045007, https://doi.org/10.1088/2632-072X/ad0b39 (2023).
Aksoy, S. G., Joslyn, C., Ortiz Marrero, C., Praggastis, B. & Purvine, E. Hypernetwork science via high-order hypergraph walks. EPJ Data Science 9, 1–34, https://doi.org/10.1140/epjds/s13688-020-00231-0 (2020).
Liu, X. T. et al. High-order Line Graphs of Non-uniform Hypergraphs: Algorithms, Applications, and Experimental Analysis. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 784–794, https://doi.org/10.1109/IPDPS53621.2022.00081 (IEEE Computer Society, 2022).
Amburg, I., Kleinberg, J. & Benson, A. R. Planted hitting set recovery in hypergraphs. Journal of Physics: Complexity 2, 035004, https://doi.org/10.1088/2632-072X/abdb7d (2021).
Yoon, S.-e., Song, H., Shin, K. & Yi, Y. How Much and When Do We Need Higher-order Information in Hypergraphs? A Case Study on Hyperedge Prediction. In Proceedings of The Web Conference 2020, WWW ’20, 2627–2633, https://doi.org/10.1145/3366423.3380016 (Association for Computing Machinery, New York, NY, USA, 2020).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports 9, 5233, https://doi.org/10.1038/s41598-019-41695-z (2019).
Oremus, W. Bluesky, a trendy rival to X, finally opens to the public. The Washington Post https://www.washingtonpost.com/technology/2024/02/06/bluesky-launch-public-jay-graber/ (2024) .
Weatherbed, J. Elon Musk and Brazil are beefing over X https://www.theverge.com/2024/4/8/24124156/brazil-investigates-elon-musk-x-twitter-obstruction-of-justice (2024).
Goodman, L. A. Snowball Sampling. The Annals of Mathematical Statistics 32, 148–170, https://www.jstor.org/stable/2237615 (1961).
McKinney, W. Data Structures for Statistical Computing in Python. scipy https://doi.org/10.25080/Majora-92bf1922-00a (2010).
pandas development team. Pandas-dev/pandas: Pandas. Zenodo https://doi.org/10.5281/zenodo.3509134 (2020).
Polars Development Team. Polars: Blazingly fast DataFrame library (2025).
Smith, A., Amburg, I. & Landry, N. nwlandry/a-blue-start https://doi.org/10.5281/zenodo.18436227 (2026).
Acknowledgements
N.W.L. acknowledges support from the University of Virginia Prominence-to-Preeminence (P2PE) STEM Targeted Initiatives Fund, SIF176A Contagion Science. A.H.S. acknowledges support from the National Science Foundation Graduate Research Fellowship Program under Grant No. 1938052. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Pacific Northwest National Laboratory is operated by Battelle Memorial Institute under Contract DE-ACO6-76RL01830. PNNL Information Release Number: PNNL-SA-211224. The authors would like to thank Tommaso Bertola and Manlio De Domenico for pointing out issues with a previous version of the dataset and being courageous early adopters.
Author information
Authors and Affiliations
Contributions
A.H.S.: project conception, data acquisition, data processing, data analysis, writing. I.A.: data analysis, writing. S.K.: writing. B.F.W.: project conception, writing. N.W.L.: project conception, data acquisition, data processing, data analysis, writing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Smith, A.H., Amburg, I., Kumar, S. et al. A Blue Start: A large-scale pairwise and higher-order social network dataset. Sci Data (2026). https://doi.org/10.1038/s41597-026-06920-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06920-1


