Background & Summary

We introduce a novel behavioral database containing accuracy and speed data related to visual word recognition in healthy Spanish-speaking adults, covering an extensive and diverse set of Spanish verbs that virtually comprise the entire Spanish verb lexicon. To collect this data, we utilized a visual lexical decision task, wherein participants were presented with a letter string on a computer screen and required to press a key to indicate whether the presented string constituted a real word in Spanish. In our case, participants were asked to distinguish between verbs presented as words and pseudoverbs presented as pseudowords. The open-access data files provided encompass both raw and processed data pertaining to word recognition accuracy and latencies for 4,565 verbs and 4,565 pseudoverbs. Additionally, word prevalence scores are included in the dataset.

Orthographic transparency in reading

Orthographic transparency plays a crucial role in reading research, revealing distinct effects among languages, particularly those that vary along the continuum of orthographic transparency and metrical systems: readers of deep orthographies tend to rely more on lexical-semantic processing, whereas readers of shallow orthographies engage more extensively in sublexical phonological decoding — yet reading in any orthography involves the joint activation of both routes1,2,3. Notably, English exemplifies a rather opaque orthography, characterized by highly inconsistent grapheme-phoneme correspondences, a lack of consistent stress assignment, and an unclear syllabic segmentation. In contrast, Spanish features an almost fully transparent orthographic system, characterized by highly consistent grapheme-phoneme correspondences. Stress position and syllabification can be reliably determined from orthography due to clear stress rules and the presence of written stress marks, which also indicate irregular stress patterns when applicable. Given these variations, effects observed in word processing within one language may not readily generalize to others. Consequently, language-specific behavioral studies on word reading are essential to further our understanding of this phenomenon.

Differential effects in nouns and verbs

Some studies4,5,6 have reported behavioral and neuropsychological dissociations in verb/noun processing, indicating faster recognition latencies for nouns compared to verbs. Concerning this noun advantage, it has been established that it cannot be solely attributed to variables such as frequency, word form, or morphological complexity7 but is more closely associated with semantic dimensions8. Drawing from neuropsychological evidence, the theoretical framework known as embodiment or semantic grounding posits that the semantics of action words, essentially verbs, are embodied within the sensorimotor neural circuits activated during the performance of the actions in question9,10. In addition, the analysis of lexical decision times based on verb/noun categories has been scarcely explored in Spanish studies to date, and the question of whether the results obtained with nouns can be directly generalized to words of other grammatical categories such as verbs naturally arises. Therefore, new data on verb recognition can contribute to more in-depth analyses of the convergences and divergences in verb/noun processing.

Study of infinitive and inflected forms of verbs

The relevance of the infinitive form lies in its status as the standard lemma used in Spanish dictionaries and its widespread use in a substantial body of studies involving isolated word processing and normative research on psycholinguistic variables11,12,13. However, there is also empirical work examining the processing of inflected, finite, and non-finite verb forms, both in isolation and in sentence contexts14,15,16,17. These studies have received comparatively limited attention, highlighting the need for further research on the cognitive processing of verb morphology beyond the infinitive.

Databases on visual lexical decision

Methodological and theoretical advantages of large databases have been highlighted in numerous previous works18,19,20. Essentially, megastudies enable the exploration or confirmation of the contribution of a large number of factors to behavioral data, ensuring the meaningful generalizability of findings across an entire language. The noteworthy English Lexicon Project21, featuring chronometric data on word naming and visual lexical decision for 40,481 American English words, has counterparts in the lexical decision task for Dutch with 14,000 words22, French with 38,840 words23, British English with 28,730 words24, and Spanish with over 45,000 words25. These, along with other ongoing megastudies, are listed on the Lexique website26.

There is only one previous laboratory-based study27 to date that has reported visual recognition times for a large number of Spanish words (see details in Table 1). The SPALEX database25 gathered data from an online lexical decision task, where participants were required to respond to 100 items, comprising 70 words and 30 pseudowords, without any time constraints. After filtering for participants from Spain, there were 1,048,576 observations, and from them, 809,592 (77.2%) were correct responses. After selecting only correct responses with RTs between 200 and 2000 ms, 723,555 observations remained (i.e., 8.9% were considered as outliers), with RT statistics reported in Table 1. A more recent online study12 collected and made available lexical decision RTs for a large word set (see also Table 1 for details). It is worth noting that, although SPALEX virtually covers the entire Spanish lexicon and was released in 2018, its impact has been relatively limited compared to other comparable megastudies in English21,25, French23, and Dutch22. Only a few published papers have made effective use of SPALEX data28,29,30. One possible reason for the relatively low use of SPALEX is that it does not actually provide reliable data for the full set of words it includes. First, as shown above, the dataset exhibits a concerningly high mean RTs compared to other megastudies in Spanish12,27, which could potentially reduce sensitivity to experimental effects31. Second, of the 83,622 participants in the SPALEX subsample from Spain, only 53,014 met the 80% accuracy criterion with at least 20 items responded to (i.e., 36.6% of participants scored below that threshold). Participants with low accuracy may have failed to properly attend to the task, may have lacked sufficient motivation, may not have understood the task instructions, or may have been affected by other factors that render their responses unreliable as measures of lexical access and processing. These concerns are especially relevant in crowdsourced experiments, where researchers cannot directly monitor testing conditions. The 80% accuracy threshold has been widely adopted in other megastudies on lexical decision12,22,23,24, where participant exclusion rates ranged from 6.4%12 to 13%24. Third, when applying an 80% accuracy threshold per participant and at least 10 responses by items, SPALEX provides valid RT observations for 3,571 infinitive verbs of the 4,562 present in the word set of the current study –representing 78.2% of the verbs analyzed.

Table 1 Percentage of significant replications detected in the power analysis.

These three databases12,25,27 will be utilized to validate the current database.

Word prevalence

Word prevalence estimates proportion of people familiar with a word, calculated from accuracy in a lexical decision task. This relatively new variable32 exerts a significant facilitating effect on word recognition and production latencies, beyond the impact of other well-established variables such as word frequency, age-of-acquisition, and length. Despite its importance, prevalence has only been studied as a predictive factor in Dutch32 and English33. It is suggested32,33 that word prevalence operates similarly to subjective frequency, serving as a measure to compensate for inaccuracies in word frequency estimates based on text corpus counts. Therefore, the word prevalence effect can be interpreted in a manner akin to the frequency effect. Additionally, word prevalence has been utilized as an estimator of word knowledge34.

Utility

New data on accuracy, chronometric records, and prevalence scores for an extensive list of verbs and pseudoverbs can streamline the stimulus selection process for other researchers, enabling the creation of new virtual or pilot experiments and the exploration of novel research questions. Another use is for replication and for finding convergent evidence for previous results—both of which require new data and improved methods. Moreover, the existence of SpaVerb-WN35, a homologous database containing word reading accuracy and RTs for the same set of words, facilitates comparative studies on verb reading and recognition in Spanish. The open availability of this data is also a valuable resource for scientists working in fields such as language processing, attention, memory, or even in the development of computer models of word recognition. At the clinical level, this database would be particularly useful in the material selection process for rehabilitation. This is because both accuracy and response time (RT) data help determine which words may pose greater or lesser challenges for patients with language impairments.

Methods

Participants

A total of 267 native Spanish speakers participated in this study, with nine of them hailing from Hispanic American countries and the remainder from Spain. The participants had a mean age of 21.1 (range: 18–51; SD: 3.8), and 228 of them (84%) were women. All were enrolled as undergraduate, master’s, or doctoral students at the Faculty of Psychology of the University of Murcia (N = 162) or the University of Oviedo (N = 105) in Spain. Participants took part in the experiment either for academic credit or as volunteers. Each participant initially completed and signed the informed consent, filled out a short questionnaire about their age, gender, handedness, and native language, and then proceeded to the task. All participants had normal or corrected vision and did not exhibit reading, speech, or neurological disorders at the time of the task. Each participant was assigned a participant number (not derived from any personal data) in order to dissociate any personal data that could identify him or her from their responses in the task. After assigning compensation for participation, all individual identification data was destroyed. The study was approved by the ethics committee of University of Murcia (reference number 3105/2020).

We employed two methods to estimate the total number of required responses. First, we conducted an a priori simulation-based power analysis for linear mixed models following the guidelines provided by Kumle et al.36. We set Scenario 1 as our starting point, where simulation is based on a well-powered design. The study chosen as the starting point35 contains the reading aloud times for the same set of verbs analyzed in the present study. We assumed that the significant effects found in that study35 through multiple regression analysis can be extrapolated to any potential similar analysis with the present data. We initially generated a linear mixed-effect model (LMM) that included all the significant effects previously identified in visual lexical decision25,27,33, namely: word age-of-acquisition (AoA)11, frequency (in Zipf scale)37, neighborhood size (i.e., number of orthographic substitution neighbors)37, length (i.e., number of letters), and the interactions between AoA and motor-content (i.e., ratings of the amount of mobility that the action described by the verb entails)13, between AoA and length, between frequency and length, between frequency and neighborhood, and between frequency and AoA. All variables were standardized (z-scored). The terms for the interactions between length and frequency and between frequency and AoA were not significant (at p < 0.05), so we calculated a new model without those terms, in which all fixed factors were then significant. In this model, we simulated 300 replications of the experiment with three different samples of 20, 25, 30, and 35 responses (i.e., participants) per item and a criterion t-value of 2.000 (α ≈ 0.05). Results are shown in Table 2 and demonstrate that a sample size of over 25 participants per item can replicate most of the effects found in the reference model (4 out of 7 effects with at least 87%, and 5 out of 7 effects with at least 79% power). Increasing the sample size is inefficient, from a cost-benefit perspective, to enhance power at acceptable levels for the remaining effects. The power analysis aligns with common practices in the field, advocating for over 20–25 responses per item as a minimum in databases with thousands of items.

Table 2 Summary of previous mega-studies on visual lexical decision in Spanish.

Complementary to the power analysis, we also took into account the recommendations outlined by Brysbaert and Stevens38, which advocate for registering a minimum of 1,600 observations per condition in designs with repeated measures. Although our study does not follow a repeated-measures design, this guideline is informative for evaluating the potential of the database for future simulation-based analyses. In our scenario, encompassing 4,565 items and a potential sample of 25 participants per item, we would amass over 114,000 observations. Factoring in an estimated 10% loss of observations after data trimming (i.e., 102,000 valid responses), a prediction model with up to 64 parameters could be constructed—for example, when simulating a repeated-measures experiment, assuming that the relevant variables are approximately evenly distributed across experimental conditions and not nested—thus ensuring a minimum of 1,600 observations per parameter.

Materials

We employed a set of 4,565 Spanish verbs in either the infinitive form (ending in -ar, -er, -ir, such as abrazar ‘to hug’, peinar ‘to comb’, fallecer ‘to die’, suceder ‘to happen’, abrir ‘to open’, or rendir ‘to surrender’) or pronominal form (i.e., infinitive + the reflexive pronoun -se, such as asustarse ‘to get scared’ or peinarse ‘to comb one’s hair’), sourced from a motor-content database13. The complete set of terms also possessed pre-existing scores for the most prominent psycholinguistic variables, including AoA11, word frequency (Zipf), length (letters), neighborhood size37, and first-syllable frequency35.

It is worth noting that infinitive forms in Spanish offer a lexically rich yet morphologically neutral representation of verbal meaning. Linguistically, the infinitive consists of two components: a lexical-semantic base (the root) and a minimal grammatical marker (the inflectional suffix: -ar, -er, or -ir). The Spanish infinitive lacks morphological markings for gender, number, case, person, mood, tense, and aspect39. Unlike fully inflected forms—such as cantamos, where -amos encodes specific grammatical features like person, number, and tense—the infinitive conveys the core conceptual meaning of the action (e.g., cantar ‘to sing’) without indicating who performs the action, when it takes place, or how it unfolds. Regarding the selected pronominal forms, pronominal infinitives consist of an infinitive verb followed by the unstressed reflexive pronoun -se (e.g., cansarse ‘to get tired’, levantarse ‘to get up’). The pronoun -se does not function as a syntactic argument but instead serves as a morphosyntactic marker that is part of the verb’s lexical entry. While it agrees in person and number with the subject when conjugated, in the infinitive it remains invariable and attached to the verb stem. Pseudowords were created by our own algorithm40. The objective was to create pseudoverbs with similar orthographic and morphological properties to the employed verbs. Essentially, the procedure is grounded in trigram frequency41. The corpus used for calculating type and token frequencies of trigrams was the list of 4,565 verbs itself. Subsequently, we employed the trigram sequence of a verb as an input-seed to generate a pool of potential pseudoverbs matched in length (i.e., number of letters) with the given verb. For each verb, a list of pseudoverb candidates was generated by sequencing all possible trigram combinations that are mutually consistent. The Euclidean distance between the vector of trigram frequencies of the verb and each candidate pseudoverb was then calculated. The selection process involved choosing the candidate with the minimum distance, provided it was neither an active term nor pseudohomophone in different Spanish dialects42 or did not have an acceptable meaning derived from morphological rules (e.g., despactar could mean to undo a pact). Pseudoverbs that were rejected based on these criteria were replaced by the next pseudoverb in the Euclidean distance ranking. Some of the pseudoverbs used in the experiment were poritar, disturar, protoner, zamper, recarcir, acorrir, delearse, and asoclarse.

The selected verbs and pseudoverbs were randomly divided into 18 blocks, each containing 506 or 508 items (50% verbs and 50% pseudoverbs), and these blocks were distributed among the different participants. Additionally, two extra verbs and two pseudoverbs were utilized as warming trials at the start of the task. Furthermore, 10 additional pseudoverbs were presented as fillers, one at a time, at the beginning of each round after the breaks during the task.

Procedure

The lexical decision task was conducted in soundproof booths at the laboratories of the Faculties of Psychology at the University of Oviedo and the University of Murcia. Item presentation and response recording were performed using the DMDX software43. The data collection took place during individual sessions. Participants were seated approximately 60 cm from the monitor (ranging from 15.6 to 17 inches), where verbs were presented one by one in lowercase letters, font Arial 11-point, and black characters on a white background. They were instructed to respond as quickly and accurately as possible. Right-handed participants were required to press the ‘M’ key for a word (i.e., verb) response and the ‘Z’ key for a pseudoword (i.e., pseudoverb) response, while left-handed participants were instructed to do the opposite. Therefore, verbs were always responded to with the dominant hand, in order to optimize accuracy and minimize RT of verb responses.

The sequence of events in each experimental trial commenced with an asterisk serving as a fixation point at the center of the screen for 500 ms. Immediately afterward, an item appeared and remained on the screen until the participant pressed a key or until 1500 ms elapsed (see Fig. 1). Four warming trials were presented at the beginning of each block. Subsequently, each participant encountered the experimental items in a randomized order, with breaks occurring every 50 stimuli, during which participants could decide on the duration of their breaks. The average duration of the task was around 20 minutes. Fifty participants completed one block, 212 completed two blocks, 7 completed three blocks, and only one completed four blocks of items. Each participant performed a maximum of one block per day, and each block was presented to at least 25 participants.

Fig. 1
figure 1

Sequence of event timing in each experimental trial of the lexical decision task. After a fixation point, a verb (e.g., asesorar, ‘to advise’) or a pseudoverb (e.g., recidinar) was randomly shown on the screen. The figure shows the response keys configuration for right-handed participants. Left-handed participants had the opposite configuration. Response time (RT) is the interval between t0 and tr.

Prevalence calculation

Lexical prevalence calculation was based on the item response theory to correct for response bias of random word recognition (i.e., correct responses saying you know a word when it is not actually known) and followed the procedure described by Keuleers et al.32. We employed a logistic LMM with accuracy as the criterion variable. In this model, lexicality (i.e., word or pseudoword) was included as a fixed effect, and the intercepts of participants, items, and blocks, along with the lexicality slope for participants, constituted the random structure. Including these slopes allows the model to account for individual differences in response bias.

Word prevalence was determined as the mean of the predicted scores—in terms of accuracy probability—of the model for each verb, essentially representing the mean item difficulty derived from the model. The prevalence probability for each word reflects a population-level tendency to recognize the word, independent of individual participant bias. This improves the precision and generalizability of the prevalence estimates. Additionally, the logit transformation of the predicted values was calculated, as this variable exhibits lower skewness.

Data Records

The raw dataset44 derived from the lexical decision task is openly accessible in CSV and XLSX formats (refer to files labeled ‘SpaVerb-LD_raw_database’ at https://doi.org/10.17605/OSF.IO/5CPSH). Each row in the file represents the response for a specific item by a particular participant. The column headings and data for each column are as follows: participant, a string denoting a unique individual code (those starting with ‘UMU’ are participants at University of Murcia and those stating with ‘OVI’ participants at the University of Oviedo); hand, a letter indicating the individual’s dominant hand (L = left; R = right); age, a number indicating the individual’s age (in years); gender, a letter indicating the individual’s gender (M = man; W = woman); native, a number indicating the individual’s native dialect of Spanish (1 = European Spanish, 2 = American Spanish); block, a number indicating the block of items (from 1 to 18) to which the item belongs; part_block, is a composite string resulting from the concatenation of participant, underscore, and block; item_id, a number indicating a unique item identification code (those that begin with 1 are verbs, and those that begin with 9 are pseudoverbs); item, a string representing the spelling of verbs and pseudoverbs; item_type, a string indicating the type of item, either word (i.e., verb) or pseudoword (i.e., pseudoverb); ACC, a number encoding the response accuracy (1 = correct response, and 0 = incorrect response); RT, a number indicating to RT (in ms.) as coded by DMDX (positive latencies are from correct responses, negative latencies are from incorrect responses, and ‘−1500’ refers to timeout responses), preval_prob, a number indicating the estimated probability prevalence from the LMM model described in Prevalence calculation; preval_logit, a number resulting from the logit transformation of preval_prob; preval_prob_item, a number indicating the mean preval_prob for an specific item (i.e., word prevalence); and preval_logit_item, a number indicating the mean preval_logit for an specific item (i.e., logit word prevalence).

The file with the filtered RTs, as described below in data cleaning, is also available in CSV and XLSX formats (refer to files labeled ‘SpaVerb-LD_RT_filtered’ at https://doi.org/10.17605/OSF.IO/5CPSH). Each row in the file represents the response for a specific item by a particular participant. The column headings and data for each column are the same as in the raw data file described above, except for the column headed as RT_filtered, which includes the RT (in ms.) for correct responses with outliers filtered out.

Technical Validation

We initially outline the data cleaning procedure for subsequent analyses. Following that, reliability and validity analyses of the dataset44 are presented.

Basic data-cleaning

A total of 252,879 responses were collected from 267 participants for both the 4,565 verbs and an equal number of pseudoverbs. Each response was automatically categorized as correct or incorrect by the employed software. We computed the correct percentage of responses for each of the 499 individual records (i.e., each participant’s performance in a given block of items; M = 82.6%, Mdn = 83.5%, SD = 6.7, Min = 36.0%, Max = 94.4%) and identified outliers as those records with accuracy values 1.5 interquartile ranges below the median of the entire sample. As a result, eight records (with 4,008 responses, representing 1.6% of the total responses) were excluded from the main dataset44. The highest accuracy mean among the removed records was 66.8%. The remaining 248,871 responses, grouped into 491 individual records with a minimum of 24 different responses per item, constitute the raw data records of accuracy and RT.

In order to enhance data management for researchers focusing specifically on lexical decision times, a new dataset44 was generated by considering only correct responses and applying common cut thresholds in psycholinguistics. Initially, the mean RT (in ms.) of correct responses was computed for each individual record (M = 786, Mdn = 786, SD = 103, Min = 546, Max = 1134) on the raw data records. Subsequently, 9,564 responses with extremely fast (RT < 200 ms. or 2 SD below the record’s mean) or slow (RT > 1500 ms. or 2 SD above the record’s mean) latencies were excluded from the dataset44, accounting for 4.6% of the total correct responses. Following these steps, the new filtered dataset44 comprised 197,610 recognition latencies for verbs and pseudoverbs.

The files containing trial-level data, one with raw response accuracy and latencies and the other with RT from only correct responses and without outliers, are available in the data records section.

Reliability analyses

We calculated the standard error of the mean and the coefficient of variation as accuracy measures for response accuracy rate and RT in each block of items (see Tables 3, 4). The standard error of the mean ranged from 0 to 0.1 by items and from 0.05 to 0.06 by blocks for accuracy rate, while for RT, it ranged from 3.6 to 301.6 by items and from 38.5 to 47.8 by blocks. The mean standard error values were highly homogeneous across blocks for both accuracy and RT. Similarly, coefficients of variation exhibited minimal variation by blocks, ranging from 0.40 to 0.49 for accuracy and from 0.23 to 0.26 for RT.

Table 3 Standard error, coefficient of variation, and ICC values for accuracy rate in each block of items.
Table 4 Standard error, coefficient of variation, and ICC values for RT in each block of items.

We additionally assessed the reliability of the accuracy and RT measures by calculating the intraclass correlation coefficients (ICC) for all the blocks of items (see Tables 3, 4). Across all blocks, only good and excellent ICC values were obtained, with a minimum mean ICC of 0.87 for accuracy and 0.85 for RT in one block.

In summary, the data reveal consistently high accuracy and reliability scores for each block of items.

Validity analyses

We compared our data with those reported in previous studies for criterion validity. For accuracy, means by items correlated r(3562) = 0.618, p < 0.001, with their equivalents (after filtering only Spanish participants) from SPALEX25. The correlation between mean RT by items from that study (after filtering 200 < RT < 2000 ms. and only Spanish participants) and our data (after applying the cut thresholds mentioned above at data cleaning) was r(3556) = 0.421, p < 0.001. The correlation between mean RT by items reported in another prior lab-setting megastudy on lexical decision27 and our data (same former cut thresholds applied) was r(529) = 0.665, p < 0.001. When mean accuracy and RTs (filtered between 200 and 2000 ms) were correlated by item with those from another online study12, high correlations were observed: r(1027) = 0.635, p < 0.001 for accuracy, and r(1027) = 0.710, p < 0.001 for RTs.

In addition, three analyses were conducted to assess construct validity. First, the present data were compared with their equivalents in a prior word naming task35, as accuracy and speed in word reading aloud should be positively correlated with those in visual word recognition. Both correlations between accuracy rate and mean RT by items had moderate sizes, with r(4560) = 0.457 and r(4560) = 0.353, respectively, both p < 0.001. Second, we also checked whether the previously reported relationship between word prevalence and RT (r = −0.53 in Dutch33 and r = −0.51 in English34) was also present in our data. The correlation between our data of word prevalence (logit transformation) and the mean RT by items (same former cut thresholds applied) was r(4565) = −0.472, p < 0.001. Third, we performed a comparative analysis to assess the capacity of the present and three additional lexical decision datasets12,25,27 to detect a set of well-known psycholinguistic effects. To this end, we ran four separate item-level linear regression analyses (i.e., one using all items from the present database and the corresponding overlapping items from the other three datasets; data from Latin American native Spanish speakers were excluded for this analysis). Each model included a total of 12 standardized (z-scored) predictor variables—seven main effects and five second-order interactions—and used log-transformed RTs to reduce skewness. Results from the ANOVA are summarized in Table 5. The SpaVerb-LD (present study) model demonstrated the highest adjusted , indicating the best overall explanatory power, and the highest overall mean of the effect sizes (0.041). Importantly, the SpaVerb-LD model also detected 11 out of 12 significant effects—while the others detected only 5-6—and was the only model to capture the effect of motor content—a semantic variable that may play a crucial role in verb processing. In contrast, the model based on SPALEX showed the lowest adjusted and the lowest overall mean of the effect sizes (0.019), indicating the weakest model fit and least accurate predictions among the four.

Table 5 Comparison of regression models from four different lexical decision time datasets.

Overall, the dataset44 shows solid criterion and construct validity, offering robust and well-balanced regression performance across key criteria: high explanatory power, low prediction error, acceptable collinearity, and the ability to detect a wide range of effects—including semantic variables—making it particularly suitable for lexical-semantic analyses (Table 5).