Abstract
This study presents a new artificial intelligence (AI) literacy scale for comprehensive assessment of the concept across adult populations, regardless of the setting in which it is applied: the SAIL4ALL. The scale contains 56 items distributed across four different themes [(1) What is AI? (a: Recognizing AI, Understanding Intelligence and Interdisciplinarity; b: General vs. Narrow AI); (2) What can AI do?; (3) How does AI work?; and (4) How should AI be used?] and has two different response formats (true/false and 5-point Likert scale), each of which can be applied depending on the context. The study provides quantitative evidence of psychometric quality in three different UK samples. It also presents evidence of internal structure validity through confirmatory factor analysis and adequate internal consistency for most of the scales and formats. Moreover, it shows measurement invariance tested for gender and education level. Finally, the study also assesses the relationship of AI literacy with external measures, examining the nomological network. SAIL4ALL demonstrates positive evidence of psychometric quality, and serves as a valuable tool for determining both actual and perceived knowledge of AI, thus guiding educational, organizational, and institutional AI literacy initiatives.
Introduction
Artificial intelligence (AI) is reshaping many aspects of human life, including education and work, affecting industries such as engineering, agriculture, politics, and the media (Gil de Zúñiga et al., 2024; Ng et al., 2023). Recent advancements in AI, especially in the area of generative AI and foundation models (Feuerriegel et al., 2024), are reigniting debates on artificial general intelligence and attract interest from scientists across fields (Xu et al., 2024). At an individual level, the increasing presence of AI-powered applications, such as generative AI chatbots (e.g., ChatGPT), means that users increasingly transition from being AI novices to natural users (Wang et al., 2023). In computing, AI is often defined as a set of algorithms that emulate human intelligence (Sarker, 2022). In the social sciences and humanities, it is broadly seen as a framework for autonomous systems designed to replicate human judgment in areas like perception and decision-making (Appiahene et al., 2022). Rai et al. (2019) define AI as the capacity of machines to perform cognitive functions such as perception, reasoning, and learning.
Given the rapid evolution of AI, a challenge faced in many societies is an insufficient understanding of its conceptual and technological underpinnings (Almatrafi et al., 2024). A deficit in the essential literacy needed to effectively engage with and reflect on the technology can impede responsible uses of AI (Long & Magerko, 2020), fostering misunderstandings or improper applications (Heyder and Posegga, 2021) and hindering the implementation of efficient education programs. For instance, in the context of embodied AI systems such as social robots, research has shown that many people lack realistic understandings of both the capabilities and limitations of such technologies, leading to overtrust (Aroyo et al., 2021; Booth et al., 2017; Robinette et al., 2016), which can have serious consequences. Moreover, unrealistic fear or skepticism towards AI and related technologies, so-called ‘algorithm aversion’ (Dietvorst et al., 2015), can prevent people from leveraging the full potential of AI. The AI literacy required to realistically evaluate such technologies is hence an increasingly important asset.
AI literacy has attracted considerable research attention (e.g., Almatrafi et al., 2024; Laupichler, Aster and Raupach, 2023; Laupichler, Aster, Haverkamp et al., 2023; Long and Magerko, 2020; Pinski and Benlian, 2023, 2024; Wang et al., 2023; Weber et al., 2023) even though it is not a new phenomenon (Stolpe and Hallström, 2024). Nowadays AI literacy is widely discussed in policy circles as well. For example, article 4 of the European AI Act requires that “[p]roviders and deployers of AI systems shall take measures to ensure, to their best extent, a sufficient level of AI literacy of their staff and other persons dealing with the operation and use of AI systems on their behalf.”
However, psychometrically robust and comprehensively evaluated instruments to assess AI literacy are scarce and pose several limitations. One reason for this might be the confusion as to what AI literacy entails. Another might be the relative recency of the topic. Moreover, with the exception of Weber et al. (2023), existing measures of AI literacy are not psychometrically sound and rely on self-reported knowledge rather than factual knowledge. Existing scale development attempts also frequently lack comprehensive descriptions of the tools used, thus hindering replication by other researchers (Laupichler, Aster and Raupach, 2023).
According to Wang et al. (2023), AI literacy research is essential for several key reasons. First, it sheds light on human-AI interaction research by influencing individuals’ conceptions of AI products, which is crucial for understanding interaction dynamics. Secondly, it offers a more accurate definition of user competence in AI use. Finally, AI literacy serves as a foundation for improving AI education by providing a comprehensive framework for curriculum design. Consequently, a standardized framework is essential to delineate foundational literacy among individuals, organizations, and systems (Schüller, 2022, p. 478). Such a framework facilitates the precise targeting of policies that foster AI literacy, establishing a benchmark to evaluate their effectiveness.
Given that such instruments are only in their infancy, a robust AI literacy scale needs to have sound psychometric properties and be able to reliably measure the general adult population’s understanding of AI (Wang et al., 2023; Weber et al., 2023). It should also be applicable across different research domains (Laupichler, Aster and Raupach, 2023).
In this article, we document the development of an AI literacy scale for non-expert adults. In contrast to most existing scales (e.g., Carolus et al., 2023; Laupichler, Aster and Raupach, 2023; Pinski and Benlian, 2023; Wang et al., 2023; see Lintner, 2024 for an overview of 16 AI literacy scales, of which only three – including ours – are performance-based, but 13 scales instead rely on self-reported responses), our approach probes for factual knowledge and is intended to be generally applicable to anyone regardless of the setting in which it is applied (e.g., educational, professional). Unlike earlier attempts to measure AI literacy, we focus on knowledge rather than skills, experiences or attitudes, and also avoid specific subject areas (e.g., Pinksi and Benlian, 2023). This relatively narrower understanding of AI literacy yields more accurate measurements, thus supporting follow-up research with greater methodological clarity. We base our scales on the theoretical framework proposed by Long and Magerko (2020), the most widely accepted and comprehensive contribution in the field of AI literacy, which itself relies on a review of previous research (Laupichler, Aster and Raupach, 2023). Our research strictly adheres to the recommendations of the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014).
Developing a solid and robust psychometric scale to measure AI literacy across diverse populations is essential from both theoretical and practical perspectives. As AI becomes increasingly integrated into daily life, individuals’ ability to understand, interact with, and critically evaluate these technologies becomes fundamental. Such tools not only provide precise assessments of current competencies but also guide the development of educational interventions and policies that promote equitable and effective integration of AI into society.
From a theoretical standpoint, creating such a scale is justified by the need to conceptualize and standardize AI literacy as a multidimensional construct encompassing both technical knowledge and ethical competencies (Long and Magerko, 2020; Markus et al., 2025). As highlighted in this article, existing measurement instruments incorporate various dimensions of the construct. However, our newly developed scale allows for progress not only in the comprehensive evaluation of the knowledge necessary for effective interaction with AI systems but also in their definition, characterization, and measurement with robustness and confidence. The fact that the scale can be used in diverse population samples also enables the unification of measures and observations, their objective comparison, and avoids the dispersion inherent in instruments designed for specific professional and/or educational environments or to measure concrete interventions, which limit the understanding and development of AI literacy.
Practically, a well-designed psychometric scale facilitates the identification of gaps in AI literacy, enabling the design of more effective educational and training interventions. Moreover, it can serve as a significant predictor of both the level of knowledge at a given time and the performance of students and/or professionals in AI-supported tasks, surpassing measures without psychometric guarantees, with objective and validated instruments. Furthermore, implementing such a scale will allow for an objective evaluation of AI knowledge, considering specific central sub-competencies. We emphasize that, unlike previous attempts focusing on skills, experiences, or attitudes, this proposal emphasizes factual knowledge about AI, avoiding specific subject areas and providing methodological clarity.
Artificial intelligence literacy
According to UNESCO (n.d), literacy involves recognizing, understanding, and using different materials or resources across various contexts, enabling individuals to achieve their goals and participate in society. Literacy is a concept that extends to various domains, including finance, health, and science (Carolus et al., 2023), while it also has a variety of subtypes centered on information technology, such as digital literacy, media literacy, information literacy, technology literacy, information technology literacy, social media literacy and digital interaction literacy (Carolus et al., 2023).
In this paper, we will focus on AI literacy, for which a universally accepted definition has yet to be developed (Laupichler, Aster and Raupach, 2023; Schüller, 2022). Existing definitions vary, among others, in terms of the precise quantity and arrangement of competencies (Carolus et al., 2023). However, many scholars (e.g., Laupichler, Aster and Raupach, 2023) recognize the foundational contribution by Long and Magerko (2020), who consolidated existing research on AI literacy and developed a comprehensive competency-based approach to the topic. These authors define AI literacy as “a set of competencies that enables individuals to evaluate AI technologies critically; communicate and collaborate effectively with AI; and use AI as a tool online, at home, and in the workplace” (p. 2). Following Chiu et al. (2024), literacy is about knowing, while competency is about applying the knowledge in a beneficial way. However, the two are strongly connected and not always easy to separate in practice. In turn, Wang et al. (2023, p. 1324) define AI literacy as “the ability to properly identify, use, and evaluate AI-related products under the premise of ethical standards.” Likewise, Ng et al. (2021), in an extensive literature review, synthesize the existing definitions and propose that AI literacy encompasses four aspects: (1) knowing and understanding AI, (2) applying AI, (3) evaluating and creating AI, and (4) AI ethics (p. 505). Most recently, Weber et al. (2023, p. 5) summarized ten existing definitions of AI literacy, including the three presented above, as well as those by Cetindamar et al. (2022), Mikalef and Gupta (2021), Hermann (2022), Kandlhofer and Steinbauer (2018), Dai et al. (2020), Chiu et al. (2021) and Pinski and Benlian (2023). They conclude that AI literacy is “a set of socio-technical competencies of humans that shape relevant types of human-AI interaction” (Weber et al., 2023, p. 6).
We acknowledge Weber et al.’s (2023) overarching and synthesizing definition as helpful, in particular the necessity of a socio-technical perspective, where AI literacy combines social and technical elements. Their approach highlights the necessity of evaluating AI literacy through objective measures that encompass this dual focus, facilitating a more accurate assessment of individuals’ abilities to engage with AI systems in various contexts. However, we adopt Long and Magerko’s (2020) five main themes as a starting point for a more fine-grained conceptualization: (1) What is AI?; (2) What can AI do?; (3) How does AI work?; (4) How should AI be used?, and 5) How do people perceive AI? Each of these five areas or themes of AI literacy is formed by different competences. In our case, we focus on 1–4 because our goal is to create a scale to measure factual knowledge as opposed to people’s opinions or perceptions. Thus, our definition is as follows: AI literacy is a person’s factual knowledge of AI, including the competence to critically evaluate AI technologies, understand their mechanisms, and recognize their ethical implications. This includes a solid understanding of what AI is, its capabilities and limitations, basic knowledge about how AI systems operate, and familiarity with the ethical frameworks guiding their use.
At the time of writing, the most comprehensive scales are those by Wang et al. (2023), Laupichler, Aster and Raupach (2023), Carolus et al. (2023), Pinski and Benlian (2023), Weber et al. (2023), and Markus et al. (2025). Table A1 (see Supplementary Materials) provides an overview of these, including sample items. Except for Weber et al. (2023) and Markus et al. (2025), all these scales assess perceived knowledge and competence, rather than actual knowledge. Moreover, despite an emphatic call for researchers to use psychometrically sound questionnaires to measure AI literacy (Laupichler, Aster, Haverkamp et al., 2023), some of the scales lack a theoretical foundation (e.g., Carolus et al., 2023), suffer from incomplete or no empirical validation (e.g., Laupichler, Aster and Raupach, 2023; Laupichler, Aster, Haverkamp et al., 2023), are too short to confidently assess knowledge (e.g., Wang et al., 2023), are used to measure the results of education programs (e.g., De Souza, 2021; Yim and Su, 2024; Zhang et al., 2024), target different developmental stages of young people (e.g., Biagini et al., 2023; Chiu et al., 2021; Hwang et al., 2023; Kim and Lee, 2022; Ng et al., 2023; Su and Yang, 2024; Wang et al., 2023; Yim and Su, 2024; Xia et al., 2023), use small samples (Carolus et al., 2023; Weber et al., 2023), are dedicated to specific training programs or professions (Laupichler, et al., 2024; Sperling et al., 2024), or fail to measure the concept in a holistic way (Yuan, Tsai, and Chen, 2024).
As highlighted by Weber et al. (2023) and Lintner (2024), most existing scales do not assess the respondent’s actual knowledge of AI, and instead only inquire about their perceived knowledge (e.g., Laupichler et al., 2023) or self-reported skills (e.g., Wang et al., 2023). Some scales also include experience (e.g., Pinski and Benlian, 2023), which is another matter entirely. Indeed, Weber et al. (2023) argue that “subjective measurements may not serve as a good proxy for evaluating a person’s AI literacy” (p. 8). To address this issue, Weber et al. (2023) developed an objective (i.e., knowledge-based) AI literacy scale that differentiates between users and creators. The latter encompass individuals who develop and program AI systems, while the former refers to those who only use them. However, the distinction is fuzzy and hard to uphold with concrete items. Certain users might have a sophisticated technical understanding, while some creators could have a domain-specific and hence limited one. AI literacy primarily concerns the competencies of non-experts (Laupichler, Aster and Raupach, 2023) who have no formal AI training, and who merely use AI applications and are not engaged in their development. Most adults who interact with advanced digital technologies can be categorized as non-experts. Hornberger et al. (2023) also introduce a knowledge-based and objective AI literacy assessment based on a 31-question test, where the items are roughly aligned with Long and Magerko’s (2020) 17 competencies. However, unlike our scale, their approach is geared at university students and not the general population. This is similar for Zhang et al. (2024), who focused on students as well but younger ones (middle school). Their scale (the AI Literacy Concept Inventory Assessment - AI-CI) of 20 multiple-choice items includes four dimensions: AI general concepts, logic systems, machine learning general concepts, and supervised learning. As such, this scale has a more technical scope and mostly neglects socio-legal issues. The multi-modal implementation (several of the items are image-based) also complicates its use in follow-up studies. Most recently, Markus et al. (2025) presented a comprehensive and psychometrically sound AI literacy scale, the AI Competency Objective Scale (AICOS). Dividing AI literacy into six dimensions, including one dimension on generative AI, it includes a full version with 51 items and a short version with 18 items. AICOS is closest to our approach, as it targets the general population, uses a knowledge-based approach, and approaches the topic through a pre-existing conceptual foundation of AI literacy, namely Carolus et al.’s (2023) and Annapureddy et al. (2025) six-dimensional framework (understand AI, apply AI, detect AI, create AI, AI ethics, generative AI). In contrast to our scale, their scale uses a four-options response format, where each statement has four response options, one of which is true. Unlike our SAIL4ALL scale, which was built in English, AICOS was created and launched in German and then translated to English. Finally, several items in AICOS, especially in the full version, are complex, leading to a median response time of over an hour in their survey (62.78 min), making the reuse of the full scale in most scenarios impractical. Nevertheless, together with our SAIL4ALL scale presented here, Markus et al. (2025) AICOS scale represents the most thorough attempt to measure AI literacy objectively and holistically, and the short version is much more feasible to implement.
Objectives, research questions, and hypotheses
To address previous limitations and create a more robust measurement tool, we developed an AI literacy scale suitable for general audiences across diverse contexts (e.g., educational, professional). To evaluate the psychometric properties of this new measure, several hypotheses must be tested, according to the validity argument of the American Educational Research Association (AERA, 2014). First, the appropriateness of the response format needs to be examined. The choice and number of response options, along with the inclusion of neutral midpoints, can significantly affect both data collection and interpretation (Asamoah et al., 2024). Therefore, it is essential to explore how score interpretations differ between two-point and five-point response formats. Based on this, we propose the following research question (RQ):
-
–
RQ1: How does the number of response options in a Likert scale (e.g., five-point vs. two-point) affect response variability and psychometric properties such as reliability and validity?
Additionally, the nomological network of relationships between the test and external constructs must be examined (Lim, 2024). At the time of study design, no existing scales could serve as a gold standard for SAIL4ALL. Therefore, the relationship between SAIL4ALL scores and other relevant constructs will be explored. Based on this, we propose the following three hypotheses:
-
–
H1: Participants with greater acceptance of AI will have higher AI literacy scores.
-
–
H2: Participants with greater fear of AI will have lower AI literacy scores.
-
–
H3: Participants with greater affinity towards AI will have higher AI literacy scores.
The proposed hypotheses posit that acceptance and affinity towards AI are positively correlated with AI literacy, whereas fear is negatively correlated. This aligns with existing literature that associates positive attitudes and interest in AI with higher levels of AI literacy. For instance, a study by Reyes et al. (2024) found that university students’ positive attitudes towards AI were associated with higher AI literacy scores. Similarly, Laupichler et al. (2024) reported a strong positive correlation between AI literacy and positive attitudes towards AI among medical students. Furthermore, Çayak (2024) found a significant positive correlation between teachers’ positive attitudes towards AI and their AI literacy levels, and a significant negative correlation between negative attitudes and AI literacy
Lastly, we aim to explore potential group differences based on participants’ demographic characteristics, particularly gender and educational level. These represent two variables frequently examined in studies related to technology literacy. In terms of gender, the persistent disparity in STEM and technical fields such as computer science has led scholars to emphasize the need for integrating gender considerations into AI literacy research (Casal-Otero et al., 2023). Furthermore, systemic gender biases embedded in the development and application of AI technologies make gender a critical analytical lens in AI education (West et al., 2019; Xia et al., 2023). Accordingly, several studies have investigated gender differences in AI literacy across various populations, including students from different educational levels (e.g., Yim, 2024; Tan and Tang, 2025), instructors (e.g., Özden et al., 2025; Salhab, 2024), and the general public (e.g., Hossain et al., 2025).
Educational attainment has also been widely recognized as a factor influencing AI literacy (e.g., Long and Magerko, 2020; Laupichler et al., 2022). Although conclusive evidence on the specific conditions under which education affects AI literacy remains limited, recent research has increasingly examined educational level as a predictive variable (e.g., Hossain et al., 2025). Since AI is a technical and social subject, where developments are strongly shaped by foundational and applied research and where higher education systems play an important role, we expect there to be an association between education and AI literacy.
To this end, we pose the following RQs:
-
1.
RQ2: Does participants’ gender influence their responses on the AI literacy scale?
-
2.
RQ3: Do individuals with higher levels of education exhibit greater AI literacy than those with lower levels of education?
Material and methods
Our process spanned three analytical phases, following widely used recommendations in the literature (e.g., AERA et al., 2014; Muñiz and Fonseca-Pedrero, 2019). Phase 1 (Scale development) includes the definition of the construct and its dimensions, the item pool generation, the assessment of their appropriateness by a panel of experts, and the creation of the scale used in Phase 2. Subsequently, Phase 2 (Pilot test) provides initial evidence of psychometric quality and leads to the development of a second version. Phase 3 involves establishing further quantitative evidence of the psychometric quality of the second version, including a comprehensive examination of its psychometric properties through a cross-sectional design, which culminated with the presentation of the definitive scale. Figure 1 shows the entire scale development process, accompanied by a detailed description. Ethical approval was obtained from the National University of Singapore for all phases (see "Ethical approval" section).
Phase 1: scale development
To identify and define the domain, we first conducted a comprehensive literature review on the development of AI literacy and similar scales. As mentioned, we identified Long and Magerko’s (2020) AI literacy framework as the most suitable conceptual foundation due to its clarity and comprehensive assessment of previous AI literacy literature, thus offering a suitably holistic approach to the matter. We heavily relied on the structure of this framework to develop an initial pool of items for 13 (out of 17) dimensions/competencies of four of Long and Magerko’s (2020) five themes of AI literacy (see Table A2). Out of the 17 dimensions proposed by Long and Magerko, we made several adaptations to align the framework with our study’s operational requirements. First, Competency 6 (Imagine Future AI) was removed, as it could not be reliably operationalized using a standardized scale. Its inherently speculative and abstract nature made it difficult to translate into measurable survey items without compromising validity. Second, we merged three closely related competencies (i.e., Competency 11 (Data literacy), Competency 12 (Learning from data), and Competency 13 (Critically interpreting data)) into a single, broader category labeled “Data understanding” (corresponding to Competency 10 in Appendix A2). This decision was based on conceptual overlap among the three and a need for parsimony in the measurement model, as all three involve interpreting and deriving meaning from data in ways that are deeply interrelated. Additionally, Competency 15 (Sensors) was combined with Competency 14 (Action and reaction), reflecting the integrated nature of sensing technologies and responsive systems. This consolidation acknowledges that understanding how systems perceive inputs (through sensors) and act upon them (action and reaction) often forms a unified conceptual and practical skill. Finally, we excluded Competency 17 (Programmability) from our framework, as its technical specificity falls outside the focus of our study and would likely not be relevant or interpretable for participants without specialized programming knowledge. Moreover, since ethics are a fundamental part of AI literacy, we also further revised different specialized contributions to particularly stress these aspects (e.g., Huang et al., 2022). We hence propose a series of competencies (13) defining AI literacy and their corresponding items (Version 1, 113 in total), which are detailed in the results section.
After the initial round of item development, eight experts separately evaluated the appropriateness of these dimensions and their constituents as indicators of AI literacy between January and April 2023. These were all contacted by the researchers in consideration of their AI expertise and were compensated with approximately 200 USD at the time of data collection. The AI experts were three women and five men, representing diverse specialized fields (i.e., two experts in computer science, one in computer physics, one in ethics, one in law, one in data protection, one in data analysis and psychometrics, and one applied social researcher with knowledge of computer science) and working in six different countries (i.e., Italy, Singapore, Spain, Switzerland, Turkey and the United Arab Emirates). It is important to note that many existing scales predominantly involve experts from a single domain, such as computer science, or PhD students who are often at the early stages of their careers, and are predominantly male (e.g., Carolous et al., 2023; Wang et al., 2023).
For the expert feedback stage, an Excel spreadsheet was devised to evaluate AI literacy across different levels of analysis. This included an introductory page in which the 13 competences were presented and defined to offer the experts a general overview of the dimensions of the AI literacy construct. Each competence, along with its definition and associated items, was then neatly organized into its own dedicated tab. For each item, the experts assessed three aspects that were organized into columns: Representativeness (i.e., “Is this item representative of the dimension that is being evaluated?”; No = 0; It can be improved = 1; Yes = 2), Clarity (i.e., “Is this item clear?”; No = 0; It can be improved = 1; Yes = 2) and difficulty (i.e., “Assess the level of difficulty of this item for the general population”; Easy = 1 to Difficult = 10). A column titled “Improved item” included the next instruction: “Write an improved version of this item for the general population if necessary”. The experts also reported their overall opinion about the dimension by answering two questions [i.e., “Are all the facets of this dimension being evaluated?” (No = 0; Yes = 1); “If the answer above is no, how can it be improved?”].
Analysis began with an evaluation of the inter-rater agreement among the eight raters using the intraclass correlation coefficient for the clarity, representativeness, and difficulty of the items. In cases of uncertainty, and following the criteria proposed by Landis and Koch (1977) for interpretation, the items were carefully reviewed by the three authors in accordance with the feedback from the experts. This resulted in a new version with fewer items, which was then subjected to a pilot test (Version 2).
Phase 2: pilot test
After providing their informed consent, participants were asked to assess the factual accuracy of 103 statements by selecting either TRUE or FALSE based on their understanding of AI. We used Prolific (Palan and Schitter, 2018) to collect a total of 501 responses to Version 2 of the scale from residents of the UK. Upon completing the survey, respondents were reimbursed with £2. The median response time was 11 min, 59 s.
Of these 501 respondents, 49.7% identified as women, 49.7% as men, with the remaining participants identifying as non-binary. Their ages ranged from 18 to 78 (M = 42.2, SD = 13.3). Regarding educational attainment, the distribution was diverse: 38.9% held a Bachelor’s or equivalent degree, 21.6% had completed higher secondary education, 13.4% possessed a Master’s or equivalent degree, 10.8% had a post-secondary non-tertiary education, 6.4% had a lower secondary education, 4.2% had short-cycle tertiary education, 4% held a PhD or equivalent, and 0.8% had only completed primary education.
To analyze the data, we initially recoded inverse items so that a value of 1 always reflects a correct answer. As our independent dimensions refer to different competencies, we first examined whether each competence was one-dimensional. Since the dimensions “What Can AI Do?” and “How Should AI Be Used?” are each composed of a single dimension, we will address them separately in different sections later.
We assessed the suitability of our data for analysis by checking the Kaiser–Meyer–Olkin (KMO) measure and Bartlett’s test of sphericity. To explore dimensionality, we conducted a parallel analysis using tetrachoric correlations, followed by an exploratory factor analysis (EFA) using the function from the R lavaan package (Rosseel, 2012) with tetrachoric correlations and the WLSMV (weighted least square mean and variance adjusted) estimator. To assess goodness-of-fit indexes (GOFI), we considered comparative-fit index (CFI) and Tucker-Lewis index (TLI) values greater than 0.95 and a root mean square error of approximation (RMSEA) less than 0.05 as excellent, and CFI and TLI values greater than 0.90 and RMSEA less than 0.08 as adequate (Hu and Bentler, 1999; Xia and Yang, 2019). During this process, we reviewed item content, difficulty, and discrimination, deleting items with low correlations or factor loadings in each competence. We provided reliability indicators including Cronbach’s alpha (α), ordinal alpha (αo), and categorical omega (ωc; see Doval et al., 2023).
Phase 3: quantitative evidence of psychometric quality
Phase 3 involved two distinct samples. Sample 2 was in the same TRUE/FALSE response format as in Phase 2. Sample 3, used a 5-point Likert scale that incorporated a measure of confidence level (1 = False with high confidence; 5 = True with high confidence). Both samples were collected in the UK, this time with Prolific’s representative sample option.
Sample 2. A total of 619 responses were obtained. Respondents who had participated in the pilot test were not eligible. Of the respondents in Sample 2, 51.1% (n = 316) identified as female, 48.1% (n = 298) as male, and 0.9% (n = 5) as non-binary or transgender. Their mean age was 45.8 (SD = 15.2, Range = 20–79). Regarding educational attainment, 39.1% (n = 242) held a Bachelor’s degree or equivalent, 24.6% (n = 152) had completed upper secondary education, 19.9% (n = 123) had pursued postgraduate studies, 11.5% (n = 71) held an HNC or HND, and 0.3% (n = 2) had only completed primary school. In terms of household income, 15.2% (n = 94) reported an income lower than 20 K Pounds Sterling a year, 47% (n = 291) up to 50 K, 27.6% (n = 171) up to 90 K, and 10.2% (n = 63) reported an income higher than 90 K Pounds Sterling.
Sample 3. We received 393 responses from Sample 3. Again, respondents who had participated in the earlier data collections (Sample 1 and Sample 2) were not eligible. Of the respondents in Sample 3, 50.9% (n = 199) identified as female, 48.3% (n = 189) as male, 0.5% (n = 2) as transgender, and one person preferred not to disclose their gender. The mean age of respondents was 46.3 (SD = 15.4, Range = 18–83). 36.3% (n = 142) held a Bachelor’s degree or equivalent, 8.7% (n = 34) had completed lower secondary, 23.0 (n = 90) had completed upper secondary education, 21.5% (n = 84) had pursued postgraduate studies, and 10.5% (n = 41) held an HNC or HND. 21.02% (n = 82) reported a household income lower than 20 K Pounds Sterling a year, 43.2% (n = 169) up to 50 K, 25.1% (n = 98) up to 90 K, and 10.1% (n = 42) reported an income higher than 90 K.
To provide evidence of internal structure validity, we conducted confirmatory factor analysis (CFA) for the scales in both samples. Following the recommendations of Doval et al. (2023), we utilized tetrachoric correlations and the WLSMV estimator for Sample 2, given the dichotomous nature of the data. For Sample 3, which employed a 5-point Likert scale, we utilized Pearson’s correlation. We employed the MLR estimator for the first two themes, and the ULS estimator for the other, due to the higher number of items. We assessed GOFI using the same criteria as in Phase 2.
To examine differences between relevant groups, such as gender and education level, we initially evaluated measurement invariance among participants across these groups (Fairness in Fig. 1). This involved conducting multi-group CFA using the aforementioned estimators. For Sample 2, we evaluated configural invariance (i.e., same factor structure between groups) and scalar invariance (i.e., same factor loadings and thresholds between groups). In Sample 3, we assessed configural, metric (i.e., same factor loadings between groups), and scalar invariance. We gauged differences by considering changes in chi-square and p-values, as well as an increase of 0.01 in CFI and RMSEA (see Chen, 2007).
Finally, to assess the relationship between the variables, after studying measurement invariance, we first compared the mean scores of the study according to the gender and level of education of the participants. For gender, we included only female and male participants due to the limited number declaring another gender, while for education, we assessed two distinct groups: lower education (individuals with no university studies) and higher education (individuals with university studies). We evaluated differences using the t-test with Welch correction and assessed effect size using Cohen’s d (small effect = 0.20; medium effect = 0.50; large effect = 0.80; Cohen, 1988). For discriminant and convergent relations, we utilized Pearson’s correlation, considering medium effect values higher than 0.30 and large effect values greater than 0.50 (Cohen, 1988).
For evidence of validity based on the relationship with external variables, we included two constructs. First, the Attitudes towards Artificial Intelligence (ATAI) scale proposed by Sindermann et al. (2021), which includes 5 items out of 10 item response options (1= Totally agree; 11 = Totally disagree; e.g., “I fear Artificial Intelligence”). For Sample 2 we obtained favorable evidence of validity for a two-factor model except for RMSEA and TLI (X2[df]= 50.85 [4], CFI = 0.95, TLI = 0.88, RMSEA = 0.14[0.10, 0.17] and in the case of Sample 2 the fit values were excellent (X2[df]= 7.88 [4], CFI = 0.99, TLI = 0.98, RMSEA = 0.05[0.00, 0.09]). Internal consistency reliability was adequate for the acceptance factor (Sample 2: α/ω = 0.74; Sample 3: α/ω = 0.81) and the fear factor (Sample 2: α = 0.76, ω = 0.77; Sample 3: α = 0.72, ω = 0.75).
We also included the Affinity for Technology Interaction (ATI) scale proposed by Franke et al. (2019) and further validated by Lezhnina and Kismihók (2020). This is a 9-item scale over a 6-point Likert scale (1 = Totally disagree; 6 = Totally agree). An example item is: I like to occupy myself in greater detail with technical systems. For both samples we had positive evidence for a one-factor model of the ATI except for the RMSEA (Sample 2: X2[df]= 241.30 [20], CFI = 0.99, TLI = 0.98, RMSEA = 0.13[0.11,0.15]; Sample 3: X2[df]= 241.30 [20], CFI = 0.99, TLI = 0.98, RMSEA = 0.13[0.11,0.15]). Internal consistency reliability was adequate for Sample 2 and Sample 3 considering omega (α = 0.65, ω = 0.81).
Results
Phase 1: scale development
To identify and define the scale domain, we based our work on Long and Magerko (2020). We conceptualized our scale along four areas of knowledge that define AI literacy with different competencies: 1. What is AI? 2. What can AI do? 3. How does AI work? 4. How should AI be used?
To develop the scales, we considered that each of these competencies should be represented in the broader dimension. As for item format, we considered the true/false format to be the most suitable as we were assessing factual statements. In total, the initial formulation of the scale contained 131 items across 13 competencies that were themselves nested in the four aforesaid areas of knowledge/themes (henceforth themes). The themes and competencies are outlined in Table A3 (see Supplementary Material).
In terms of the expert evaluation of the subject matter, we initially assessed interrater reliability. A detailed reliability analysis is provided in Table A3 (Supplementary materials). From the initial pool of 131 items, 39.4% (n = 41) remained unchanged, while 26.9% (n = 28) required adjustments. 24.0% (n = 25) of items were excluded, and consensus could not be reached for 9.6% (n = 10), primarily the items related to the theme “How should AI be used?” Some examples of these changes are provided in the Supplementary material (Table A4). Following the experts’ suggestions, existing items were modified, and new items were created to address perceived gaps. This was especially crucial for Theme 4, which had issues with construct representativeness, leading to a complete overhaul in accordance with Long and Magerko’s (2020) definitions. As a result of this phase, Version 2 of the scale was produced, comprising 103 items across the 13 competencies.
Phase 2: pilot test
Table A5 presents the results of the one-dimensional EFA for individual competences, excluding competence 12 (refer to 5.2.3). As shown, the results are favorable, but the reliability indicators have low values. In this initial phase, following a meticulous examination of item difficulty and discrimination, 23 items were excluded.
What is AI?
The results of the parallel analysis indicated either a one-dimensional or a two-factor solution for EFA, which both align with the theoretical framework. Through successive modeling, we identified items with low factor loadings. After reviewing their content, we considered removing items that did not significantly impact the representativeness of these competences. Ultimately, we evaluated two EFA solutions, one comprising a single factor with 14 items (factor loadings range 0.19 to 0.67; X2 = 105.37[77], p = 0.02, CFI = 0.92, TLI = 0.91, RMSEA [90%IC] = 0.03 [0.01, 0.04]); α = 0.59, ω = 0.61) and one comprising a two-factor solution (also with 14-items), we obtained a factor including items from three competences (i.e., Recognizing AI, Understanding Intelligence and Interdisciplinarity of AI labeled as RUI) and one including items of General vs Narrow. For the first factor, loadings ranged from 0.42 to 0.68, except for one item of C3 (“Computer vision is an example of interdisciplinary AI technology”) that had a loading of 0.21. We decided to keep it due to the representativeness of this competence in the scale. As for the second factor, loadings ranged from 0.56 to 0.78. GOFI were excellent (X2 = 85.65[77], p = 0.21, CFI = 0.97, TLI = 0.97, RMSEA[90%IC] = 0.02 [0.00, 0.03]). Internal consistency reliability was low, which could be explained by the low variance of the items (RUI: α = 0.51, ωc = 0.52; General vs Narrow: α = 0.50, ω = 0.51).
What can AI do?
The results of parallel analysis suggested either two or three factors. We finally opted for a 2-factor model, as it aligns more closely with the theoretical background. Consequently, we present a scale comprising four items for AI strengths and four for AI weaknesses (X2 = 47.46[26], p = 0.01, CFI = 0.98, TLI = 0.98, RMSEA[90%IC] = 0.02 [0.00, 0.03]). For the strengths, we obtained an ωc of 0.56 and a α = 0.56, and for the weaknesses factor we obtained an ωc of 0.49 and a α of 0.46.
How does AI work?
On initial inspection, 12 items were deemed unsuitable and subsequently removed. Parallel analysis indicated a one-factor solution for this dimension. After exploring the initial solution, four items were removed. Given our aim for this dimension was to encompass all theoretical competences, we reviewed item content and difficulty, leading to the exclusion of an additional four items. The final version of this dimension comprises 23 items, representing all competencies. GOFI were adequate (X2 = 328.75(230), p = <0.001, CFI = 0.94, TLI = 0.93, RMSEA[90%IC] = 0.03[0.03, 0.04]). We obtained adequate internal consistency reliability estimates for the general score (ωc = 0.75, α = 0.72).
How should AI be used?
Parallel analysis suggested the presence of one or two dimensions. However, no theoretically compatible solution was found for a two-factor dimension. For the single-factor solution, we conducted three iterative analyses, systematically excluding items, resulting in a final version comprising 10 items, where at least one item of each ethics domain (e.g., transparency, accountability, regulation, privacy, trust, freedom, justice, dignity) remained. The final solution was deemed adequate (X2 = 45.71(35), p = 0.11, CFI = 0.97, TLI = 0.97, RMSEA[90%IC] = 0.02 [0.00, 0.04]) and internal consistency reliability values were as follows: ωc = 0.70, α = 0.66.
Considerations for Phase 3
The pilot phase and subsequent analysis of results led to the development of an AI literacy tool with four distinct themes. The first scale, “What is AI?”, comprises two dimensions: RUI (10 items) and General vs. Narrow (4 items), totaling 14 items. The second scale, “What can AI do?”, also features two dimensions, evaluating the strengths (5 items) and weaknesses (4 items) of AI, totaling 9 items. The third scale, “How does AI work?”, consists of 23 items measuring a unidimensional scale. Lastly, “How should AI be used?” is also unidimensional, featuring 10 items.
While the evidence for the internal structure validity of the test scores remains robust, the reliability of the scales is often below the desirable threshold for research. This might be attributed in part to the binary (true-false) response scale used and the low variability found. Therefore, in Phase 3 data was also collected using a scale that features a greater number of categories, in addition to data collection with the original true-false response format.
Phase 3: quantitative evidence of psychometric quality
Table 1 presents GOFI for the measurement models in Phase 3. Descriptive statistics and internal consistency reliability coefficients for the mean scores derived from the final models are provided in Table 2.
What is AI?
We evaluated the two proposed models that were previously considered in Phase 2. Initially, we examined a one-factor model, which exhibited GOFI in both samples. Subsequently, we tested the proposed two-factor solution from Phase 2. This comprised RUI and General and Narrow competences. Factor loadings for both models are depicted in Fig. A1. While both samples demonstrated adequate GOFI, and factor loadings were similar, internal consistency reliability was higher with the 5-point Likert scale response.
As detailed in Table 2, for Sample 2, the mean scores were 0.86 and 0.87 for the factors, indicating that, on average, individuals answered most items correctly. Conversely, for the Likert scale, the mean scores were 3.82 and 3.45, suggesting that, on average, respondents had a moderate level of confidence in their responses. Consequently, the interpretation of test scores differs depending on the scale response utilized.
What can AI do?
Based on the findings from Phase 2, we examined a model featuring two factors (strengths and weaknesses). However, this model yielded unsatisfactory GOFI for Sample 2 and could not be tested for Sample 3 due to convergence issues. Negative variances were observed in the items of the weakness dimension, prompting further investigation into the independence of the two dimensions. As shown in Table 1, GOFI for both samples were excellent for the strengths dimension but unacceptable for the weakness dimension. Factor loadings were consistently high and uniform across both samples (see Fig. A2 for standardized factor loadings). Once again, internal consistency reliability was superior for Sample 3, with values reaching acceptable levels.
How does AI work?
The model initially proposed in Phase 2 proved to be suitable for both samples. The standardized factor loadings are detailed in Table A6. The internal consistency reliability estimates are adequate for both samples, with Sample 3 demonstrating slightly better results. Across both samples, the participants exhibit a strong understanding of how AI works.
How should AI be used?
Finally, the ethics scale exhibits excellent GOFI for both samples (see Table A7 for standardized factor loadings). However, internal consistency reliability is not deemed adequate for Sample 2, but when considering the mean scores of both samples, the participants generally possess a good understanding of how AI should be used.
Measurement Invariance
Table A8 (Supplementary materials) provides the measurement invariance for both samples, considering the participants’ gender and level of education. For the dimension “What is AI?”, metric and scalar invariance were achieved for both Sample 2 and Sample 3 with respect to gender. However, metric invariance was achieved across education groups, suggesting varying factor loadings based on this item. In the dimensions of strengths in “What can AI do?” and “How should AI be used?”, metric and scalar invariance were established for gender and education in both samples. Therefore, the interpretation of scores is comparable across groups. Unfortunately, invariance in the dimension “How does AI work?” could not be studied due to convergence issues in the models.
Evidence of validity based on relationships with external variables
Evidence of discriminant and convergent validity
Table 3 presents correlations between the developed scales and external measures. The purpose of the first set of correlations (1–5) is to establish evidence of discriminant validity, demonstrating that although the developed scales assess AI literacy, they are distinct from each other. Second, we aim to provide convergent evidence by examining correlations with externally used instruments.
Correlations exhibit consistent directionality across both the dichotomous response scale and the 5-point scale, with generally higher correlations observed in the latter. In terms of discriminant relationships, correlations tend to be low, except for F1 with “How should AI be used?” (Sample 1: r = 0.51; Sample 2 r = 0.63) and General vs Narrow with “How should AI be used?” (Sample 1: r = 0.32; Sample 2: r = 0.41).
In terms of expected convergent relationships, correlations for Sample 1 are generally low. However, for Sample 2, moderate positive correlations are observed between RUI and ATAI acceptance, moderate negative correlations between RUI and ATAI fear, and moderate positive correlations between “What Can AI do?” and ATAI acceptance.
Differences between gender and level of education
Table 4 displays gender differences for the variables examined in both samples. Across both samples, men tend to score higher than women in RUI and “How does AI work?” In Sample 2, the effect size is moderate, while in Sample 3, it is high. Furthermore, in Sample 3, differences are also observed in “General vs Narrow,” with men obtaining higher scores than women. No differences were detected in “How does AI work?” However, it is important to interpret these group differences with caution, as invariance between the groups could not be examined.
Table 5 presents group differences by education. In both samples, individuals with higher levels of education tend to score higher in RUI, with the effect size more pronounced in Sample 3 than in Sample 2. However, invariance cannot be established for this variable in Sample 3, so the results should be interpreted with caution.
Additionally, in Sample 2, differences are observed in “What can AI do?”. Individuals with lower levels of education score higher (reverse scale), indicating lower knowledge compared to those with higher levels of education. Similarly, individuals with higher levels of education tend to score higher in “How does AI work?”
Discussion and conclusion
Summary
The final version of the SAIL4ALL scale consists of four independent themes, each measuring a specific aspect of AI literacy, based on the conceptual framework of Long and Magerko (2020). These include: (1) What is AI? with two subdimensions: Recognizing AI, Understanding Intelligence and Interdisciplinarity (RUI; 10 items), and General vs. Narrow AI (4 items); (2) What can AI do? measuring perceived strengths (5 items) and weaknesses (4 items); (3) How does AI work? comprising a unidimensional 23-item scale assessing knowledge of underlying mechanisms; and (4) How should AI be used? consisting of 10 items that cover key ethical principles. The SAIL4ALL scale exists in two versions: a binary (true/false) format for assessing factual correctness and a 5-point Likert scale capturing both correctness and respondent confidence. Depending on the research aims and context, each of the four themes may be administered independently or in combination. However, calculating an overall literacy score is not recommended, as the multidimensional nature of the construct precludes the assumption of unidimensionality across all themes. The final items for each theme, along with their wording and scoring instructions, are available in the Supplementary Materials (see section “Supplementary information” below).
Implications and recommendations
The development and validation of the SAIL4ALL scale advances the research and understanding of AI literacy in general (Almatrafi et al., 2024; Zhang et al., 2025) and its measurement (Lintner, 2024) in particular. A particular strength is the focus on factual knowledge as opposed to self-reported skills, experiences, or attitudes. This aligns with a very recent turn in AI literacy measurement that acknowledges the importance of assessing actual knowledge and competencies rather than perceived and self-reported ones (Hornberger et al., 2023; Markus et al., 2025; Weber et al., 2023; Zhang et al., 2024; see Lintner, 2024 for a review of AI literacy measurement). A meticulous three-phase methodology, including expert feedback, pilot testing, and quantitative analysis following established scale development guidelines (AERA et al., 2014), produced a robust measurement tool that can be used in general populations like the ones examined in this research. This is a further strength in comparison to other scales, which often target specific populations such as students (Hornberger et al., 2023; Zhang et al., 2024) or experts (Weber et al., 2023 and their differentiation between user AI literacy and creator/evaluator AI literacy). Thus, our SAIL4ALL can be more broadly and flexibly re-used.
The themes can be used in a combined manner or each in isolation. By integrating feedback from diverse experts and refining the scale through pilot testing and quantitative analysis, we align with the foundational framework of Long and Magerko (2020), which presents a holistic, socio-technical understanding that carefully considers ethical and social aspects, including an understanding of the importance of data in AI. Based on the psychometric evidence obtained, specifically the demonstrated unidimensionality of three these themes, and the two-factor dimensionality of “What can AI do?”, it is methodologically sound to use any one of the themes independently in a research study to observe the specific dimensions. Alternatively, researchers may choose to administer all four scales. Thus, it is possible to obtain five differential scores.
The findings on measurement invariance across gender and education offer significant theoretical implications for the study of AI literacy. The attainment of metric and scalar invariance for gender in most dimensions suggests that AI literacy, as measured by this scale, is invariant across genders. However, the lack of metric invariance across different levels of education, particularly in the “What is AI?” dimension, implies that academic background influences how individuals understand or interpret AI. This finding suggests that educational strategies may need to be tailored to address AI literacy differences effectively. Additionally, our ability to gather discriminant evidence and examining the relationship with technology enhances the theoretical understanding of AI literacy by clarifying its boundaries and connections with related constructs, such as AI acceptance and fear. These insights contribute to a more nuanced theory of AI literacy, emphasizing the importance of considering both demographic and psychometric factors in its measurement. It also implies that emotional knowledge of AI, individual’s psychological characteristics, or relationships with technology could be considered in further developments of the scale.
To optimize the effectiveness of SAIL4ALL, key insights from this study should be considered. CFA evidence shows that both response formats support the theoretical dimensions outlined by Long and Magerko (2020), with no clear preference for either. The 5-point Likert scale demonstrated better internal consistency reliability than the 2-point scale, a result expected given the increased response variability typically linked to higher reliability (e.g., Nunnally and Bernstein, 1978). While high internal consistency reliability is preferable quantitatively, the 2-point scale may be more appropriate in some contexts due to its simplicity. Regardless of the chosen scale, users should note that interpretation of results will differ based on the format.
This study highlights the importance of considering invariance analysis results before assessing group differences. Invariance was observed across both gender and education for all dimensions except “how does AI work,” indicating that factor loadings and item relationships are consistent and can be reliably compared. However, differences in the conceptualization of the scales were found in the “how does AI work” dimension, which future studies should explicitly address.
Regarding validity evidence based on external variables, gender differences were found in two dimensions across both response formats and in one dimension specific to the Likert format. The effect size was notably larger in the Likert case, with men scoring higher than women. Additionally, differences in educational level were observed across most scales, except for the “General vs. Narrow” dimension, with individuals holding higher education levels consistently scoring higher. Future research should continue examining the influence of gender and education level on perceptions of AI, as consistent differences suggest that these demographic factors may systematically shape how individuals conceptualize and engage with technology. Deeper exploration of these patterns can guide the development of more inclusive AI literacy frameworks and communication strategies that are responsive to the needs of diverse populations.
Finally, regarding ATAI and ATI scores, correlations are stronger with the 5-point scale compared to the 2-point scale. The initial hypotheses are supported: acceptance and affinity for AI show a positive relationship with the SAIL4ALL scales, while fear of AI exhibits a negative relationship.
This leads us to assert that AI literacy is not a unidimensional construct and requires multiple scales for a thorough assessment. Consequently, SAIL4ALL covers various themes and cannot be used to calculate a single overall AI literacy score. Based on the representative sample, norms for the UK adult population can be established using the descriptive statistics provided in Table 2.
The choice between a two-point (true/false) response scale and a five-point Likert scale in research should be driven by the type of information sought. While the two-point scale assesses knowledge, it may encourage random responses from participants lacking actual understanding, as is common in knowledge tests. In contrast, the five-point Likert scale allows participants to indicate their level of confidence, providing a different interpretation of the data. However, as our findings suggest, some scales, particularly those developed for the true/false format, exhibit limitations such as low reliability.
Our scale offers educators, organizations, policymakers, and researchers a tool to assess and enhance AI literacy among adults in the UK. The emphasis of the scale on factual knowledge can inform curricular development and education endeavors aimed at improving the understanding and responsible use of AI. For policymakers, the findings suggest the importance of investing in AI education and literacy programs that go beyond mere interface familiarity and use, addressing their underlying mechanisms and ethical considerations.
Limitations
While this study provides valuable insights into AI literacy measurement, it comes with some limitations. The focus on factual knowledge, while downplaying subjectivity, may overlook the nuanced understanding that comes from practical experience and critical reflection on AI use.
Furthermore, this study used participant samples recruited through Prolific. While Prolific provides tools for creating samples representative of certain national populations based on demographics such as age, sex, and ethnicity, it is important to acknowledge limitations that may affect the generalizability of our findings. Firstly, Prolific’s participant pool comprises individuals who have self-selected to join the platform, which may introduce selection bias. These participants are typically more technologically adept, which could influence responses, particularly in studies related to AI, where familiarity with technology may impact perceptions and understanding. Secondly, while efforts were made to obtain a demographically representative sample, certain subgroups (e.g., older adults) may still be underrepresented. This underrepresentation may limit the applicability of our results to these populations. Additionally, the reliance on online data collection inherently excludes individuals without reliable internet access or those less comfortable with digital platforms. This digital divide may further skew the sample, affecting the external validity of our findings. Given these considerations, while our study provides valuable insights into AI literacy, caution should be exercised when generalizing these results to the broader population. Future research should aim to include more diverse sampling methods, such as incorporating offline recruitment strategies, to enhance the representativeness and generalizability of findings.
Moreover, while our sample was diverse, it was confined to the UK context. To help address this limitation, we intentionally avoided culturally specific references, and incorporated feedback from experts with varied cultural backgrounds during the scale development. This is important because SAIL4ALL aims to be applied to all possible settings. However, the SAIL4ALL scale may require additional psychometric testing or adaptation before being applied in other cultural settings. The representative nature and internal diversity of the two independently collected UK samples allow for cautious generalization of the findings in this context. By testing the scale across distinct participant groups, we were able to assess the replicability of its structure and properties, which reinforces the robustness of the instrument and its suitability for a range of educational and social environments. Using representative samples also enhanced confidence in the scale’s broader applicability, as it captures key demographic variables such as gender and education level. While we acknowledge that international generalization is not yet possible without further cross-cultural validation, the consistency observed across the two UK samples provides strong evidence of the scale’s relevance at the national level. In future studies, it would be also necessary to examine the differences within different samples using the two-point scale and the five-point scale.
On the other hand, at the time the scale was developed, no other psychometrically validated measures of AI literacy were available. Consequently, and acknowledging this as a key limitation of the study, it was not possible to thoroughly assess the discriminant validity of the SAIL4ALL.
Likewise, the relatively high mean scores observed in our study suggest that participants generally reported elevated levels of AI literacy. This raises concerns regarding the scale’s sensitivity in discriminating between individuals with medium and high levels of AI literacy. Such potential ceiling effects may limit the instrument’s effectiveness in detecting nuanced differences at the upper end of the literacy spectrum. Future research should consider refining the scale to enhance its discriminative capacity among higher-literacy individuals, possibly by incorporating more challenging items or alternative assessment methods.
Subsequent studies might also want to use our scale to test AI literacy levels in settings such as work or higher education. Moreover, the integration of qualitative assessments could offer deeper insights into how individuals interpret and engage with AI technologies, complementing the quantitative measures presented. Finally, longitudinal studies could explore how AI literacy evolves over time, particularly as AI technologies and their societal implications continue to develop.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. The complete list of items and their content are available in the Supplementary Material.
References
Almatrafi O, Johri A, Lee H (2024) A systematic review of AI Literacy conceptualization, constructs, and implementation and assessment efforts (2019–2023). Comput Educ Open 6:100173. https://doi.org/10.1016/j.caeo.2024.100173
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014) Standards for Educational and Psychological Testing. American Educational Research Association
Annapureddy R, Fornaroli A, Gatica-Perez D (2025) Generative AI literacy: twelve defining competencies. Digit Gov Res Pract 6(1):1–21. https://doi.org/10.1145/3685680
Appiahene P, Domfeh EA, Andoh B (2022) Definitions of Artificial Intelligence: a review. TechRxiv. https://doi.org/10.22541/au.164670471.11415616/v1
Aroyo AM, Bruyne J, de, Dheu O, Fosch-Villaronga E, Gudkov A, Hoch H, Jones S, Lutz C, Sætra H, Solberg M, Tamò-Larrieux A (2021) Overtrusting robots: Setting a research agenda to mitigate overtrust in automation. Paladyn J Behav Robot 12(1):423–436. https://doi.org/10.1515/pjbr-2021-0029
Asamoah NAB, Turner RC, Lo W-J, Crawford BL, McClelland S, Jozkowski KN (2024) Evaluating item response format and content using partial credit trees in scale development. J Survey Stat Method. Advance online publication. https://doi.org/10.1093/jssam/smae028
Biagini G, Cuomo S, Ranieri M (2023) Developing and validating a multidimensional AI literacy questionnaire: operationalizing AI literacy for higher education. In: Proceedings of the first international workshop on high-performance artificial intelligence systems in education 1–1 CEUR. https://ceur-ws.org/Vol-3605/1.pdf
Booth S, Tompkin J, Pfister H, Waldo J, Gajos K, Nagpal R (2017) Piggybacking robots: human-robot overtrust in university dormitory security. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction 426–434 ACM. https://doi.org/10.1145/2909824.3020211
Carolus A, Koch MJ, Straka S, Latoschik ME, Wienrich C (2023) MAILS—Meta AI literacy scale: development and testing of an AI literacy questionnaire based on well-founded competency models and psychological change- and meta-competencies. Comput Hum Behav Artif Hum 1(2):100014. https://doi.org/10.1016/j.chbah.2023.100014
Casal-Otero L, Catala A, Fernández-Morante C, Taboada M, Cebreiro B, Barro S (2023) AI literacy in K-12: a systematic literature review. Int J STEM Educ 10(1):29. https://doi.org/10.1186/s40594-023-00418-7
Çayak S (2024) Investigating the relationship between teachers’ attitudes toward artificial intelligence and their artificial intelligence literacy. J Educ Technol Online Learn 7(4):367–383. https://doi.org/10.5455/jetol.2024.12.367-383
Cetindamar D, Kitto K, Wu M, Zhang Y, Abedin B, Knight S (2022) Explicating AI literacy of employees at digital workplaces. IEEE Trans Eng Manag 71:810–823. https://doi.org/10.1109/TEM.2021.3138503
Chen FF (2007) Sensitivity of goodness of fit indexes to lack of measurement invariance. Struct Equ Modeling Multidiscip J 14(3):464–504. https://doi.org/10.1080/10705510701301834
Chiu TK, Ahmad Z, Ismailov M, Sanusi IT (2024) What are artificial intelligence literacy and competency? A comprehensive framework to support them. Comput Educ Open 6:100171. https://doi.org/10.1016/j.caeo.2024.100171
Chiu Y-T, Zhu Y-Q, Corbett J (2021) In the hearts and minds of employees: a model of pre-adoptive appraisal toward artificial intelligence in organizations. Int J Inf Manag 60:102379. https://doi.org/10.1016/j.ijinfomgt.2021.102379
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Psychology Press
Dai Y, Chai C-S, Lin P-Y, Jong MS-Y, Guo Y, Qin J (2020) Promoting students’ well-being by developing their readiness for the artificial intelligence age. Sustainability 12(16):6597. https://doi.org/10.3390/su12166597. Article
De Souza CEC (2021) What if AI is not that fair? Understanding the impact of fear of algorithmic bias and AI literacy on information disclosure [Master’s thesis, Handelshøyskolen BI]. https://hdl.handle.net/11250/2826778
Dietvorst BJ, Simmons JP, Massey C (2015) Algorithm aversion: people erroneously avoid algorithms after seeing them err. J Exp Psychol: Gen 144(1):114–126. https://doi.org/10.1037/xge0000033
Doval E, Viladrich C, Angulo-Brunet A (2023) Coefficient alpha: the resistance of a classic. Psicothema 35(1):5–20. https://doi.org/10.7334/psicothema2022.321
Feuerriegel S, Hartmann J, Janiesch C, Zschech P (2024) Generative AI. Bus Inf Syst Eng 66(1):111–126. https://doi.org/10.1007/s12599-023-00834-7
Franke T, Attig C, Wessel D (2019) A personal resource for technology interaction: development and validation of the Affinity for Technology Interaction (ATI) scale. Int J Hum–Comput Interact 35(6):456–467. https://doi.org/10.1080/10447318.2018.1456150
Gil de Zúñiga H, Goyanes M, Durotoye T (2024) A scholarly definition of artificial intelligence (AI): advancing AI as a conceptual framework in communication research. Political Commun 41(2):317–334. https://doi.org/10.1080/10584609.2023.2290497
Hermann E (2022) Artificial intelligence and mass personalization of communication content—An ethical and literacy perspective. N Media Soc 24(5):1258–1277. https://doi.org/10.1177/14614448211022702
Heyder T, Posegga O (2021) Extending the foundations of AI literacy. In ICIS 2021 Proceedings, 9. https://aisel.aisnet.org/icis2021/is_future_work/is_future_work/9
Hornberger M, Bewersdorff A, Nerdel C (2023) What do university students know about Artificial Intelligence? Development and validation of an AI literacy test. Comput Educ Artif Intell 5:100165. https://doi.org/10.1016/j.caeai.2023.100165
Hossain Z, Biswas MS, Khan NA, Khan G (2025) Artificial intelligence literacy among South Asian library and information science students: socio-demographic influences and educational implications. IFLA J. https://doi.org/10.1177/03400352251331468
Hu L, Bentler PM (1999) Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling Multidiscip J 6(1):1–55. https://doi.org/10.1080/10705519909540118
Huang C, Zhang Z, Mao B, Yao X (2022) An overview of artificial intelligence ethics. IEEE Trans Artif Intell 4(4):799–819. https://doi.org/10.1109/TAI.2022.3194503
Hwang HS, Zhu LC, Cui Q (2023) Development and validation of a digital literacy scale in the artificial intelligence era for college students. KSII Trans Internet Inf Syst 17(8):2241–2258. https://doi.org/10.3837/tiis.2023.08.016
Kandlhofer M, Steinbauer G (2018) A driving license for intelligent systems. Proc AAAI Conf Artif Intell 32(1):7954–7955. https://doi.org/10.1609/aaai.v32i1.11399
Kim S-W, Lee Y (2022) The artificial intelligence literacy scale for middle school students. J Korea Soc Comput Inf 27(3):225–238. https://doi.org/10.9708/jksci.2022.27.03.225
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174. https://doi.org/10.2307/2529310
Laupichler MC, Aster A, Raupach T (2023) Delphi study for the development and preliminary validation of an item set for the assessment of non-experts’ AI literacy. Comput Educ: Artif Intell 4:100126. https://doi.org/10.1016/j.caeai.2023.100126
Laupichler MC, Aster A, Meyerheim M, colleagues (2024) Medical students’ AI literacy and attitudes towards AI: a cross-sectional two-center study using pre-validated assessment instruments. BMC Med Educ 24:401. https://doi.org/10.1186/s12909-024-05400-7
Laupichler MC, Aster A, Schirch J, Raupach T (2022) Artificial intelligence literacy in higher and adult education: a scoping literature review. Comput Educ Artif Intell 3:100101. https://doi.org/10.1016/j.caeai.2022.100101
Laupichler MC, Aster A, Haverkamp N, Raupach T (2023) Development of the “Scale for the assessment of non-experts’ AI literacy”—an exploratory factor analysis. Comput Hum Behav Rep 12:100338. https://doi.org/10.1016/j.chbr.2023.100338
Lezhnina O, Kismihók G (2020) A multi-method psychometric assessment of the affinity for technology interaction (ATI) scale. Comput Hum Behav Rep. 1:100004. https://doi.org/10.1016/j.chbr.2020.100004
Lim WM (2024) A typology of validity: Content, face, convergent, discriminant, nomological, and predictive validity. J Trade Sci Advance online publication. https://doi.org/10.1108/JTS-03-2024-0016
Lintner T (2024) A systematic review of AI literacy scales. npj Sci Learn 9(1):50. https://doi.org/10.1038/s41539-024-00264-4
Long D, Magerko B (2020) What is AI literacy? Competencies and design considerations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Paper 598, pp. 1–16). ACM. https://doi.org/10.1145/3313831.3376727
Markus A, Carolus A, Wienrich C (2025) Objective measurement of AI literacy: development and validation of the AI competency objective scale (AICOS). arXiv preprint arXiv:2503.12921. https://doi.org/10.48550/arXiv.2503.12921
Mikalef P, Gupta M (2021) Artificial intelligence capability: conceptualization, measurement calibration, and empirical study on its impact on organizational creativity and firm performance. Inf Manag 58(3):103434. https://doi.org/10.1016/j.im.2021.103434
Muñiz J, Fonseca-Pedrero E (2019) Ten steps for test development. Psicothema 31(1):7–16. https://doi.org/10.7334/psicothema2018.291
Ng DTK, Leung JKL, Chu SKW, Qiao MS (2021) Conceptualizing AI literacy: an exploratory review. Comput Educ: Artif Intell 2:100041. https://doi.org/10.1016/j.caeai.2021.100041
Ng TK, Wu W, Chu S, Leung J (2023) Artificial intelligence (AI) literacy questionnaire with confirmatory factor analysis. In: International Conference on Advanced Learning Technologies (ICALT) (pp. 233–235). IEEE. https://doi.org/10.1109/ICALT58122.2023.00074
Nunnally JC, Bernstein I (1978) Psychometric theory. MacGraw-Hill
Özden M, Örge Yaşar F, Meydan E (2025) The relationship between pre-service teachers’ attitude towards artificial intelligence (AI) and their AI literacy. Pegem J Educ Instr 15(3):121–131. https://doi.org/10.14527/pegegog.2025.011
Palan S, Schitter C (2018) Prolific.ac—A subject pool for online experiments. J Behav Exp Financ 17:22–27. https://doi.org/10.1016/j.jbef.2017.12.004
Pinski M, Benlian A (2024) AI literacy for users—a comprehensive review and future research directions of learning methods, components, and effects. Comput Hum Behav Artif Hum 2(1):100062. https://doi.org/10.1016/j.chbah.2024.100062
Pinski M, Benlian A (2023) AI literacy—towards measuring human competency in artificial intelligence. In Proceedings of the 56th Hawaii International Conference on System Sciences (pp. 165–175). https://doi.org/10.24251/HICSS.2023.021
Rai A, Constantinides P, Sarker S (2019) Next-generation digital platforms: toward Human-AI hybrids. MIS Q 43(1):3–9
Reyes R, Mariñas JM, Tacang JR, Asis LJ, Sayman JM, Flores CC, Sumatra K (2024) The relationship between attitude towards AI and AI literacy of university students. Int J Multidiscip Stud High Educ 1(1):37–46. https://doi.org/10.70847/587958
Robinette P, Li W, Allen R, Howard AM, Wagner AR (2016) Overtrust of robots in emergency evacuation scenarios. In 11th ACM/IEEE International Conference on Human-Robot Interaction (pp. 101–108). ACM. https://doi.org/10.1109/HRI.2016.7451740
Rosseel Y (2012) lavaan: an R package for structural equation modeling. J Stat Softw 48(2):1–36. https://doi.org/10.18637/jss.v048.i02
Salhab R (2024) AI literacy across curriculum design: Investigating college instructors’ perspectives. Online Learn 28(2):22–47. https://doi.org/10.24059/olj.v28i2.4426
Sarker IH (2022) AI-based modeling: Techniques, applications, and research issues towards automation, intelligent, and smart systems. SN Comput Sci 3(1):158. https://doi.org/10.1007/s42979-022-01043-x
Schüller K (2022) Data and AI literacy for everyone. Stat J IAOS 38(2):477–490. https://doi.org/10.3233/SJI-220941
Sindermann C, Sha P, Zhou M, Wernicke J, Schmitt HS, Li M, Sariyska R, Stavrou M, Becker B, Montag C (2021) Assessing the attitude towards artificial intelligence: Introduction of a short measure in German, Chinese, and English Language. Künstliche Intell 35:109–118. https://doi.org/10.1007/s13218-020-00689-0
Sperling K, Stenberg C-J, McGrath C, Åkerfeldt A, Heintz F, Stenliden L (2024) In search of artificial intelligence (AI) literacy in teacher education: a scoping review. Comput Educ Open 6:100169. https://doi.org/10.1016/j.caeo.2024.100169
Stolpe K, Hallström J (2024) Artificial intelligence literacy for technology education. Comput Educ Open 6:100159. https://doi.org/10.1016/j.caeo.2024.100159
Su J, Yang W (2024) AI literacy curriculum and its relation to children’s perceptions of robots and attitudes towards engineering and science: an intervention study in early childhood education. J Comput Assist Learn 40(1):241–253. https://doi.org/10.1016/j.chbah.2024.100062
Tan Q, Tang X (2025) Unveiling AI literacy in K-12 education: a systematic literature review of empirical research. Interactive Learn Environ 1–17. https://doi.org/10.1080/10494820.2025.2482586
UNESCO Institute for Statistics (n.d). Literacy. https://uis.unesco.org/node/3079547 (accessed 4.9.24)
Wang B, Rau P-LP, Yuan T (2023) Measuring user competence in using artificial intelligence: validity and reliability of artificial intelligence literacy scale. Behav Inf Technol 42(9):1324–1337. https://doi.org/10.1080/0144929X.2022.2072768
Weber P, Pinski M, Baum L (2023) Toward an objective measurement of AI literacy. In PACIS 2023 Proceedings, Paper 60. https://aisel.aisnet.org/pacis2023/60
West SM, Whittaker M, Crawford K (2019) Discriminating systems: Gender, race and power in AI. AI Now Institute. https://ainowinstitute.org/discriminatingsystems.html
Xia Q, Chiu TKF, Chai CS (2023) The moderating effects of gender and need satisfaction on self-regulated learning through Artificial Intelligence (AI). Educ Inf Technol 28:8691–8713. https://doi.org/10.1007/s10639-022-11547-x
Xia Y, Yang Y (2019) RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: the story they tell depends on the estimation methods. Behav Res Methods 51:409–428. https://doi.org/10.3758/s13428-018-1055-2
Xu R, Sun Y, Ren M, Guo S, Pan R, Lin H, Sun L, Han X (2024) AI for social science and social science of AI: A survey. Inf Process Manag 61(3):103665. https://doi.org/10.1016/j.ipm.2024.103665
Yim IHY (2024) Artificial intelligence literacy in primary education: An arts-based approach to overcoming age and gender barriers. Comput Educ Artif Intell 7:100321. https://doi.org/10.1016/j.caeai.2024.100321
Yim IHY, Su J (2024) Artificial intelligence (AI) learning tools in K-12 education: a scoping review. J Comput Educ 1–39. https://doi.org/10.1007/s40692-023-00304-9
Yuan CW, Tsai HYS, Chen YT (2024) Charting competence: a holistic scale for measuring proficiency in artificial intelligence Literacy. J Educ Comput Res https://doi.org/10.1177/07356331241261206
Zhang H, Perry A, Lee I (2024) Developing and validating the artificial intelligence literacy concept inventory: an instrument to assess artificial intelligence literacy among middle school students. Int J Artif Intellig Educ 1–41. https://doi.org/10.1007/s40692-023-00304-9
Zhang S, Ganapathy Prasad P, Schroeder NL (2025) Learning about AI: a systematic review of reviews on AI literacy. J Educ Comput Res 1–31. https://doi.org/10.1177/07356331251342081
Acknowledgements
We want to thank the experts who provided feedback on an earlier version of our scale, namely Ruxandra Cojocaru, Eduard Fosch Villaronga, Mohammad Neamul Kabir, Ivan Savin, Eduardo Garcia-Garzon, Anto Čartolovni and Gianluigi Riva. Data collection was funded by the Research Council of Norway, project 275347 “Future Ways of Working in the Digital Economy”.
Author information
Authors and Affiliations
Contributions
All authors wrote the main manuscript. All authors were involved in the data collection. MTS-S initiated the project and had the overall idea for the conceptualization. AA-B was responsible for the data analysis and reporting. CL was responsible for securing funding for the data collection. All authors reviewed and approved the manuscript. All authors contributed to the revisions.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This study received approval from the Institutional Review Board (IRB) of the National University of Singapore (NUS) under protocol number NUS-IRB-20221222, granted on December 12, 2022. It was conducted in accordance with NUS ethical guidelines for research involving human participants and complied with all relevant regulations and standards, including the Declaration of Helsinki. The approval covered the administration of online expert and participant surveys, including procedures for recruitment, informed consent, data collection, reimbursement and analysis, data handling and storage, as well as the use of anonymized responses and quotations in academic publications and presentations.
Informed consent
The informed consent procedure for this study was approved by the IRB of the National University of Singapore (NUS), also under the mentioned protocol (NUS-IRB-20221222) and approved on December 12, 2022. Prior to data collection, all expert and participant survey respondents were informed about the study’s aims, the minimal risks involved, the anonymous nature of their participation, the reimbursement, their right to withdraw at any time, and the procedures for secure data handling and storage. Consent was indicated by their voluntary decision to proceed with the online survey after receiving this information. No vulnerable individuals were involved in the study.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Soto-Sanfiel, M.T., Angulo-Brunet, A. & Lutz, C. The scale of artificial intelligence literacy for all (SAIL4ALL): assessing knowledge of artificial intelligence in all adult populations. Humanit Soc Sci Commun 12, 1618 (2025). https://doi.org/10.1057/s41599-025-05978-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1057/s41599-025-05978-3
