Main

Image datasets have played a foundational role in the history of AI development, with ImageNet12 enabling the rise of deep learning methods in the early 2010s13. While AI technologies have made tremendous strides in their capabilities and adoption since then, bias in data and models remains a persistent challenge2,14. Inadequate evaluation data can result in fairness and robustness issues, making it challenging to identify potential harms1,10,15. These harms include the perpetuation of racist, sexist and physiognomic stereotypes2,4, as well as the exclusion or misrepresentation of entire populations3,5,16. Such data inadequacies therefore compromise the fairness and accuracy of AI models.

The large-scale scraping of images from the web without consent2,6,17 not only exacerbates issues related to data bias, but can also present legal issues, particularly related to privacy7,18,19 and intellectual property (IP)20. Consequently, prominent datasets have been modified or retracted8. Moreover, the lack of fair compensation for data and annotations presents critical concerns about the ethics of supply chains in AI development21,22.

Datasets made available by government agencies such as NIST23 or using third-party licensed images24 often have similar issues with the absence of informed consent and compensation. Many dataset developers mistakenly assume that using images with Creative Commons licences addresses relevant privacy concerns3. Only a few consent-based fairness datasets with self-reported labels exist25,26,27. However, these datasets have little geographical diversity. They also lack pixel-level annotations, meaning that they can be used for only a small number of human-centric computer vision tasks3.

Evaluating models and mitigating bias are key for ethical AI development. Recent methods such as PASS28, FairFaceVar29 and MultiFair30 aim to reduce demographic leakage or enforce fairness constraints through adversarial training and fairness-aware representations. Previous work has also shown that many face-recognition models and benchmarks encode structural biases, underscoring the need for fairness at every stage of development31. Yet, these methods remain constrained by the same dataset limitations that they seek to address, including a lack of consent, demographic self-identification and global representation. Most research in the computer vision fairness literature relies on repurposing non-consensual datasets that lack self-reported demographic information. This lack of self-reported demographic information then leads researchers to guess complex social constructs, such as the race and gender of image subjects, from images alone. These inferences can entrench stereotypes32,33, cause psychological harm to data subjects when inaccurate34,35 and compromise the validity of downstream tasks36.

The dearth of responsibly curated datasets creates an ethical dilemma for practitioners who would like to audit bias in their models. Their options are to use (1) diverse and densely annotated public datasets that carry legal or ethical risks; (2) one of the few publicly available consent-based but highly limited datasets (requiring them to add their own pixel-level annotations); (3) proprietary datasets that do not provide transparency to external parties; (4) datasets that have been quietly retracted due to ethical concerns but continue to circulate in derivative forms37; or (5) nothing—simply to not check for bias7,11,18.

To address these challenges, we introduce the FHIBE, a publicly available, consensually collected, globally diverse fairness evaluation dataset for a wide range of vision-based tasks, from face verification to visual question answering (VQA). FHIBE comprises 10,318 images of 1,981 unique individuals from 81 countries/areas38. Current consent-based fairness datasets25,26,27 lack data from regions with stringent regulations, such as the European Union (EU), making FHIBE, to our knowledge, the first publicly available, human-centric computer vision dataset to include consensually collected images from the EU. FHIBE features the most comprehensive annotations to date of demographic and physical attributes, environmental conditions, camera settings and pixel-level annotations. To assess FHIBE’s capabilities, we used it to evaluate bias in a wide variety of narrow models (designed for specific tasks) and foundation models (general purpose) commonly used in human-centric computer vision. Our analyses spanned eight narrow model tasks (pose estimation, person segmentation, person detection, face detection, face parsing, face verification, face reconstruction and face super-resolution), along with VQA for foundation models. We affirm previously documented biases, and we show that FHIBE can support more granular diagnoses on the factors leading to such biases. We also identify previously undocumented biases, including lower model performance for older individuals and strong stereotypical associations in foundation models based on pronouns and ancestry.

A large number of participants were involved in the data collection, annotation and quality assurance (QA) processes for our project (as described in Supplementary Information C). To collect a dataset as globally diverse as possible, we worked with data vendors to collect data from crowdsourced image subjects. Additional annotations were also collected from crowdsourced and vendor-employed annotators. We provided extensive guidelines to vendors and performed additional steps for QA, privacy preservation, IP protection and consent revocation to further protect the rights of those involved in the data-collection process (Methods). By creating FHIBE, we not only provide researchers with a new evaluation dataset, but we also show the possibilities and limitations of responsible data collection and curation in practice.

FHIBE

Overview

FHIBE comprises 10,318 images of 1,981 unique individuals, averaging six images per primary subject. We used a crowdsourcing approach, working with data vendors that operate globally to collect the dataset. We developed comprehensive data-collection guidelines and implemented a rigorous quality assessment protocol, which we discuss in detail in the Methods.

The dataset includes 1,711 primary subjects (individuals submitting images of themselves; Supplementary Information C) and 417 secondary subjects (individuals who appear alongside primary subjects, increasing the diversity and complexity of the images). Note that some primary subjects are also secondary subjects in other images. In total, 623 images contain both primary and secondary subjects. Captured between May 2011 and January 2024, the images span 81 countries/areas across 5 regions and 16 subregions38. To increase the diversity of the images (location, clothing, appearance, environmental conditions and so on), we permitted participants to submit images that they had previously taken of themselves. The images were taken with 785 distinct camera models from 45 manufacturers, and represent a wide range of real-world conditions, including 16 scene types, 6 lighting conditions, 7 weather scenarios, 3 camera positions and 5 camera distances. Example images with the accompanying subject, instrument and environment metadata are provided in Fig. 1.

Fig. 1: Annotations about the image subjects, instrument and environment are available for all images in FHIBE.
Fig. 1: Annotations about the image subjects, instrument and environment are available for all images in FHIBE.
Full size image

For visualization purposes, we display one type of metadata per image in this figure. Each annotation is linked to the annotators who made or checked the annotation. If the annotator disclosed their demographic attributes (age, pronouns, ancestry), that information is also provided. A full list of annotations is provided in Supplementary Information A. NA, not applicable.

FHIBE also features self-reported pose and interaction annotations, with predefined labels categorized into 16 body poses, 2 head poses and 47 distinct interactions—14 with other subjects and 33 with objects. The dataset offers a rich array of appearance characteristics, including 15 hair and 4 facial hair styles, 7 hair types, 13 hair and 12 facial hair colours, 9 eye colours and 11 types of facial marks.

There are also 6 pronoun categories, 56 integer ages (18 to 75 years) grouped into 5 age categories, 20 ancestry subregions within 5 regions and 6 Fitzpatrick skin tones39. There are 1,234 intersectional groups defined by age group, pronoun, ancestry subregion and Fitzpatrick skin tone, with the number of images per group ranging from 1 to 1,129, with a median of 9 images.

FHIBE includes pixel-level annotations for face and person bounding boxes, 33 keypoints and 28 segmentation categories (Fig. 2). Annotator identifiers (an anonymized ID distinguishing each annotator) are provided for each annotation. Annotator demographic information is also included for transparency, if self-disclosed by the annotators. A complete list of annotations is provided in Supplementary Information A. Distribution plots showing the diversity of FHIBE are shown in Extended Data Figs. 1 and 2 and Supplementary Information B and D. The inter-rater reliability analysis, showing the high quality and consistency of FHIBE annotations, is shown in the Methods and Supplementary Information E.

Fig. 2: Example FHIBE images annotated with detailed pixel-level annotations, keypoints, segmentation masks and bounding boxes.
Fig. 2: Example FHIBE images annotated with detailed pixel-level annotations, keypoints, segmentation masks and bounding boxes.
Full size image

Pixel-level annotations include keypoint annotations (small red circles) indicating the geometric structure (white lines) of human bodies and faces (for example, right eye inner, left foot index); segmentation masks dividing the human body and face into segments, assigning a label to each pixel (for example, left arm, jewellery); and face and person bounding boxes (red and blue rectangles, respectively).

Furthermore, FHIBE includes two derivative face datasets: a cropped-only set with 10,941 images from 1,981 subjects, and a cropped-and-aligned set with 8,370 images from 1,824 subjects. Both face datasets include all annotations.

Comparison with existing datasets

We compare FHIBE against 27 human-centric computer vision datasets that have been used in fairness evaluations in Extended Data Table 1, considering their collection methods, annotations and ethical dimensions.

The majority of the datasets were scraped from Internet platforms or derived from scraped datasets. Seven well-known datasets were revoked by their authors and are no longer publicly available. Reasons for their removal are typically not stated explicitly, but point to growing criticism due to ethical challenges and concerns around web scraping data for AI development37. While a number of datasets have annotated bounding boxes, key points and segmentation masks, their pixel-level annotations do not match the density of FHIBE’s annotations. Datasets with dense pixel-level annotations, like COCO40, VQA2.041 and MIAP42, contain only limited demographic information, none of which is self-reported.

Only four datasets mention that data were collected after obtaining consent from data subjects. CCv226 and the Chicago Face Database27 are consent-based datasets, but provide no further details on how consent was obtained. While Dollar Street43 provides details on how consent was obtained, use in AI development was not stated as its purpose for collection, and there is no indication that the subjects consented to the processing of their biometric or other personal information. FHIBE stands out as the only dataset collected with robust consent for AI evaluation and bias mitigation.

FHIBE also has greater utility for diagnosing bias in AI compared with other consent-based datasets. CCv2 and Dollar Street have no pixel-level annotations. This makes them unsuitable for the diverse computer vision task evaluations that FHIBE enables. CCv2 and Chicago Face Database also only feature videos/images of individuals facing the camera, largely indoors, with only their head and shoulders shown. They lack full-body images and diverse backgrounds and poses, limiting their utility for many computer vision tasks, such as pose estimation, and for evaluating how models might perform in deployment contexts in which the individuals might not be looking at the camera.

Moreover, FHIBE stands out from other consent-driven datasets in terms of its detailed and self-reported demographic labels, which enable the investigation of model performance at complex intersections of demographic attributes (Table 1). Although CCv1 has 4.4 times more images and CCv2 has 2.8 times more subjects than FHIBE, FHIBE has 3.4 times more annotations and 16.9 times more attribute values (Table 2). FHIBE also has greater representation from regions that are under-represented in many computer vision datasets, such as Africa (44.7%) and lower-middle income economies (71.5%) (Table 3), making it uniquely suitable for bias evaluation.

Table 1 Dataset comparison by intersectional subgroups
Table 2 Dataset comparison by summary statistics
Table 3 Dataset comparison by geographical region and income level

Ethical considerations to FHIBE design

In developing FHIBE, we sought to implement best practices for ethical data collection recommended in the literature2,3,44. We focused particularly on consent, privacy protection, compensation, safety, diversity and utility. The design decisions discussed below can also provide a starting point for future responsible data collection and curation efforts, including those not focused on fairness evaluation. Detailed descriptions of how these ethical considerations were implemented are provided in the Methods.

Consent

Informed consent is central to research involving human participants, promoting participant safety and protection while supporting research integrity19,45. It involves the participants having sufficient information regarding the project and the potential risks before deciding to participate. Informed consent is also fundamental to data privacy protection, as encoded in various laws and regulations7,18,19,46.

Our consent processes were designed to comply with comprehensive data protection laws like the EU General Data Protection Regulation (GDPR)46. These processes included developing consent forms with clear language about the uses and disclosures of the collected data, the processing of biometric and sensitive data and the rights of data subjects with regard to their data. Policy considerations imbued in data privacy laws, such as respect for human dignity, also influenced other aspects of our data collection, including decisions regarding the types of attributes we collected (for example, pronouns rather than gender), participant recruitment guidelines (for example, no coercive practices) and restrictions on downstream uses of the dataset (for example, users are prohibited from attempting to reidentify subjects).

To ensure that consent is given on a voluntary basis46, data subjects retain control over their personal information and may withdraw their personal data from the dataset at any time, with no impact on the compensation they received from the project. In the event of consent withdrawal, we commit to maintaining dataset integrity by replacing withdrawn images and preserving the dataset’s size and diversity to the extent possible. This commitment makes FHIBE a first in computer vision—a living dataset designed to evolve responsibly.

Privacy and IP

In addition to obtaining informed consent, we took additional measures to remove incidental personal information from the images. We used a state-of-the-art generative diffusion model47 to in-paint over non-consensual subjects (for example, individuals in the background of an image) and personally identifiable information (for example, license plates, credit cards). We then manually checked each image to verify the personal information had been removed, mitigating potential algorithmic biases in the automated methods48. This approach avoids the limitations of traditional privacy measures, such as automated face blurring49, which can still allow for reidentification through distinctive non-facial features (for example, tattoos, birthmarks)50. We further tested our method to ensure that it did not compromise the utility of the data for model evaluation. Moreover, we coarsened certain attributes and release others only in aggregate form.

To secure appropriate rights to license the images for downstream users, the participants submitting images were also required to review and agree to terms affirming they had the rights to provide the images and understood the nature of their contribution. Furthermore, our instructions to data vendors and participants included requirements to minimize the presence of third-party IP, such as trademarks and landmarks. We also implemented automated checks with manual verification to detect and exclude images with prominent third-party IP, such as logos, from our dataset.

Compensation

Crowdworkers often contend with low wages and demanding working conditions21,22, while individuals whose images are included in web-scraped datasets receive no compensation. To address these concerns, we asked data vendors to report minimum payment rates per task per region and to compensate crowdworker participants—image subjects, annotators and QA annotators (definitions are provided in Supplementary Information C)—at least the applicable local minimum wage based on task-time estimates. Vendors’ reported minimum payment rates were cross-referenced against the International Labor Organization’s Global Wage Report51 or, where this was not applicable, with the minimum wage of a country with comparable GDP per capita. The median compensation for image subjects was 12× the applicable minimum wage (further information about project costs is provided in the Discussion and Methods).

Safety

Webscraped datasets frequently include harmful and illegal content, ranging from derogatory annotations to instances of child sexual abuse material (CSAM)2,6,17. Although the risk of such content appearing in our dataset was low given our sourcing method, instructions to data subjects and vendor QA, we performed additional manual and automated checks to ensure safety. Each image was manually reviewed to identify and remove any harmful content and the image hashes were cross-referenced against a database of known CSAM maintained by the National Center for Missing & Exploited Children (NCMEC). This dual approach—leveraging both technology and human judgement—helped to create a dataset that is both safe and respectful of human dignity.

Diversity

While diversity is a relevant consideration for data collection generally, the fact that FHIBE is a fairness evaluation set made it especially important to optimize for diversity across many dimensions: image subject demographics, appearance (for example, not wearing the same clothing in all images), poses, interactions between subjects and objects, and environment.

FHIBE contains detailed demographic information—such as age, pronouns and ancestry, making it possible to use FHIBE to evaluate model bias along many axes of interest. As FHIBE is a publicly available dataset, we sought to balance minimizing the disclosure of sensitive information while maximizing the availability of useful annotations for bias diagnosis. This led to our decision to collect pronouns, as pronouns are more likely to be public-facing information, while gender identity and sex can be quite sensitive, particularly for gender and sex minorities52. Moreover, while we collected information on data subjects’ disability status, pregnancy status, height and weight to measure the diversity of our dataset along these dimensions, we do not release these annotations with the dataset and only disclose the summary statistics in aggregate for transparency purposes (Supplementary Information B.1). Note that participant disclosures about pregnancy and disability status were optional.

Collecting pronouns rather than gender identity also reduced risks associated with misgendering3,53, and collecting ancestry offered a more stable alternative to country-specific racial categories3,54. We further describe the rationales to use pronouns and ancestry in Supplementary Information J.

We also collected annotations on phenotypic and performative markers to enhance bias analysis. Phenotypic attributes—like skin colour, eye colour and hair type—provide observable characteristics related to relevant demographic bias dimensions9, while performative markers—such as facial hair, cosmetics and clothing—help to identify social stereotypes and spurious correlations55. Moreover, FHIBE includes camera-level metadata and environmental annotations, capturing factors such as illumination, camera position and scene, which are important for understanding model performance across diverse conditions16,56.

With the exception of pixel-level annotations, head pose and camera distance, we focused on the collection of self-reported information to address the limitations (as discussed above) of previous data-collection efforts that used annotators to guess subjects’ attributes. Collecting self-reported attributes (as opposed to labelling them later) had the additional benefit of ensuring that the participants were well aware of the information about them that would be used in the project.

Utility

An evaluation set is valuable only insofar as it enables assessments of model performance on relevant tasks. FHIBE provides extensive annotations for analysing human-centric visual scenes, including face- and person-specific bounding boxes, keypoints and segmentation masks. As a result, FHIBE can be used to evaluate models across a much wider variety of tasks than previously possible using consent-based computer vision datasets. Its combination of pixel-level annotations and attribute labels makes FHIBE to our knowledge the most comprehensively annotated fairness dataset currently available.

Moreover, we compared the utility of FHIBE as a fairness evaluation set with existing datasets. As discussed in the Methods, for each of the eight narrow model computer vision tasks that FHIBE was designed for, we evaluated commonly used models using FHIBE and pre-existing evaluation datasets (Supplementary Information F). The findings are discussed in the ‘Evaluation results’ section below.

Evaluation results

Bias discovery in narrow models

FHIBE’s diverse and comprehensive annotations provide both breadth and depth in fairness assessments, enabling the evaluation of models across a range of demographic attributes and their intersections. We examined the performance of a variety of pretrained narrow models—across eight common computer vision tasks: pose estimation, person segmentation, person detection, face detection, face parsing, face verification, face reconstruction and face super-resolution—on FHIBE’s demographic groups and their intersections (that is, pronoun × age group × ancestry × skin tone). The exact methodology is described in the Methods.

Through our benchmarking analysis, we found that intersectional groups combining multiple sensitive attributes—including pronoun, age, ancestry and skin tone—experience the largest performance disparities (Supplementary Fig. 21). Notably, despite the fact that skin tone is often used as a proxy for ancestry/race/ethnicity in fairness evaluations57, we find that intersections featuring both skin tone and ancestry have much greater disparities than those with only one of these attributes.

For each task, we also examined the intersectional groups for which the models showed the highest versus lowest disparity in performance. Note that, for this particular analysis, we considered only groups with at least ten subjects, and pairwise group comparisons were filtered using the Mann–Whitney U-test for statistical significance. To control for multiple comparisons, we applied Bonferroni correction58 by adjusting the significance threshold based on the number of pairwise tests, therefore considering only pairs with a statistically significant difference (\(P < \frac{0.05}{{\rm{number}}\,{\rm{of}}\,{\rm{pairwise}}\,{\rm{tests}}}\)). Through this analysis (Extended Data Table 2 and Supplementary Information K), we found that younger individuals (aged 18–29 years), those with lighter skin tones and those with Asian ancestry were more frequently among the groups that models performed best on, whereas older individuals (aged 50–59 and 60+ years), those with darker skin tones and those with African ancestry appeared more often among the groups that models performed worst on. However, despite these high-level trends, there was variability across models and specific intersections. For example, for face detection, RetinaFace performed best for ‘she/her/hers × type I × Asia’ and worst for ‘he/him/his × type II × Africa’, whereas MTCNN performed best for ‘she/her/hers × type II × Africa’ and worst for ‘he/him/his × type IV × Europe’.

This variability highlights the importance of testing for intersectional biases on a case-by-case basis, as bias trends can vary depending on the specific model–task combination. Overall, disparities likely arise from a combination of systemic biases—such as demographic under-representation—and task- or model-specific interactions with sensitive attributes. While some patterns align with broader structural inequalities, others reflect localized effects, emphasizing the need for nuanced and intersectional fairness assessments, which FHIBE’s extensive demographic annotations facilitate.

FHIBE further enables in-depth analyses of model performance disparities by identifying the specific features contributing to bias trends with greater granularity than what existing datasets facilitate. For example, we found that face-detection models showed consistently higher accuracy for individuals with she/her/hers pronouns compared with he/him/his pronouns (Supplementary Tables 1416), a finding consistent with previous research59. Through our direct error modelling analysis, we used FHIBE’s extensive annotations to identify attributes that statistically significantly contributed to this performance difference (Extended Data Figs. 3 and 4).

While many of the statistically significant attributes were not obviously related to gender (for example, visible keypoints, camera distance), lack of visible hair was a significant factor driving the gender disparity for RetinaFace (Extended Data Fig. 5). Further analysis conditioned on headwear and qualitative image inspection revealed that no visible hair in ‘he/him/his’ images often indicated baldness, making face detection challenging. Lack of visible hair was not only less common among ‘she/her/hers’ images, but it also typically resulted from headwear closely fitted to the face that preserved clear facial contours, making the task easier. FHIBE can therefore be used to help to explain underlying causes of previously identified biases.

Using FHIBE, we also identified previously unidentified bias trends. For example, face parsing models performed better for younger individuals than for older individuals (Supplementary Table 18). Through our error pattern recognition analysis, we found that much of this disparity was attributable to the models’ particularly poor performance for individuals with grey or white facial hair (Extended Data Fig. 7). For face verification, we conducted fairness evaluations using pretrained models—ArcFace60, CurricularFace61 and FaceNet62. The three mentioned models obtained lower accuracy for the ‘she/her/hers’ pronoun subgroup (Supplementary Table 20), a disparity that we traced to greater hairstyle variability (Extended Data Fig. 8) within this group—a factor that was previously overlooked when using less detailed datasets for bias diagnosis. This level of granularity in identifying the sources of bias can help to inform approaches to bias mitigation. For example, in this case, rather than collecting more training data from individuals of specific demographics, which can exacerbate ethical concerns around the ‘hypervisibility’ faced by certain marginalized groups7, a developer could focus on ensuring their face verification model is robust to hairstyle variability.

Moreover, when assessing models using different observational datasets, conflicting bias trends often emerge. For example, in person-detection tasks, FHIBE found higher accuracy for individuals with darker skin tones, whereas FACET reported superior performance for lighter skin tones (Supplementary Tables 12 and 13). Leveraging FHIBE’s detailed annotations and our direct error modelling approach (Methods and Supplementary Information G), we identified confounding factors such as body pose (for example, lying down), subject interactions (for example, hugging/embracing), image aspect ratio and the number of visible keypoints (which indicate body occlusion) that significantly correlated with person-detection performance (Extended Data Figs. 5 and 6). To investigate these associations systematically, we applied a direct error modelling approach, using regression and decision trees to determine which features were linked to reduced model performance. In the case of faster-rcnn, our analysis identified the number of visible keypoints as a statistically significant factor in person-detection performance, with a higher count of visible keypoints leading to improved accuracy. When we analysed performance disparities by skin tone within a subset of images with a high number of visible keypoints, we found no statistically significant differences in performance across skin tones. This suggests that most performance disparities are driven by cases in which the subject’s keypoints are not fully visible, probably due to occluded body features.

These findings highlight the importance of addressing relevant sources of model errors and can guide developers in refining their models to enhance fairness and accuracy. FHIBE’s extensive annotations can provide valuable insights into the factors contributing to differences in fairness evaluation results across various benchmarks. FHIBE also enables developers to disentangle the source of bias among possible confounders. This is only possible with access to a rich set of accurate annotations, which FHIBE contains, but most comparable fairness evaluation datasets lack.

Bias discovery in foundation models

Large-scale, multimodal generative models, which learn associations between text and images, enable diverse tasks such as classification, image search, image segmentation, image captioning and VQA (answering questions about an image). However, the widespread adoption of these technologies has also amplified their potential for harm. Research has shown that these models can perpetuate existing social biases63, reinforce harmful stereotypes14,64, and marginalize or dehumanize under-represented groups65.

Existing benchmarks for vision–language models (VLMs) focus mainly on improving performance in tasks such as object recognition66, robustness67 or reasoning68, and less on evaluating ethical dimensions such as bias and fairness69. Similar to datasets used to test narrow models, those that aim to evaluate VLM biases are often based on repurposed, web-scraped data70,71 leading to potential data leakage problems, limited coverage of societal dimensions43,72 and reliance on synthetically generated data that do not capture the nuances of real-world contexts73.

We demonstrate FHIBE’s utility for evaluating VLM foundation models across a range of image comprehension and recognition tasks. In particular, we assess two popular models, CLIP74 and BLIP-275 (Methods and Supplementary Information H). We explored how pronoun and ancestry biases show up in general image understanding tasks like scene recognition (with CLIP) and open-ended VQA (with BLIP).

When asked to classify images using 16 provided gender-related prompts (the prompts are provided in Supplementary Information H), we found that CLIP was far more likely to assign a gender-neutral label (unspecified) to those with ‘he/him/his’ pronouns (0.69) than those with ‘she/her/hers’ pronouns (0.38), reinforcing the idea that male individuals are the default people. Moreover, CLIP’s perception of gender was strongly influenced by hairstyle, with individuals who did not conform to stereotypical hairstyles (for example, ‘he/him/his’ pronouns and long hair) being frequently misgendered (Fig. 3a). We also found that CLIP had biased associations with other image attributes such as scene, disproportionately associating individuals of African ancestry with outdoor environments and linking those of African or Asian ancestry with rural settings (Fig. 3b,c).

Fig. 3: Biases in CLIP predictions on FHIBE.
Fig. 3: Biases in CLIP predictions on FHIBE.
Full size image

a, Predicted label probabilities (rows) conditioned on ground-truth pronouns (columns) (left); CLIP more often assigns a gender-neutral ‘unspecified’ label to ‘he/him/his’ than to ‘she/her/hers’. Right, gender-classification error rates vary with both pronoun and hairstyle and are lowest for stereotypical pronoun–hairstyle combinations (for example, ‘he/him/his’ with ‘short/no hair’). b, For indoor environments, masking the person increases the accuracy, whereas, for outdoor environments, masking decreases the accuracy. This suggests that CLIP may treat the presence of a person as a spurious cue for outdoor scenes, with the effect being particularly pronounced for individuals of African ancestry. c, Scene type predictions conditioned on ancestry. CLIP is more likely to predict rural environments for images containing individuals of African or Asian ancestry. The numbers on each bar denote the group size (bottom) and the corresponding probability estimate (top), indicating that perceived rural associations are stronger for these groups.

Next, we assessed BLIP-2 in the VQA setting, prompting it with questions about the images with varying tones—positive, neutral or negative (the prompts are provided in Supplementary Information H). None of the prompts asked about or featured information about gender or ancestry. Nonetheless, we found that the model’s outputs still reflected biases based on gender and ancestry. For example, when asked why an individual is likeable, BLIP-2 frequently generated responses that attributed likability to gender, such as “because she is a woman” (Fig. 4a). As with CLIP, BLIP-2 was more likely to misgender individuals identified as ‘she/her/hers’ (Fig. 4b). Neutral prompts (for example, asking what an individual’s occupation is) sometimes produced benign output text (for example, teacher), but other times yielded terms that reinforced harmful stereotypes against specific pronoun and ancestry groups, such as prostitute, drug dealer and thief (Fig. 4c,d). Moreover, we found that negative prompts, for example, about what crimes an individual committed—which should yield a null response—elicited toxic responses at higher rates for individuals of African or Asian ancestry, those with darker skin tones and those identifying as ‘he/him/his’ (Fig. 4e–g).

Fig. 4: BLIP-2 analysis results.
Fig. 4: BLIP-2 analysis results.
Full size image

Summary of the gender, occupation, ancestry and toxic response analyses. a, Responses to non-gendered likeability prompts show implicit gender attribution. b, Pronoun predictions are more accurate for ‘he/him/his’ than ‘she/her/hers’, which exhibits a fivefold higher error rate. c,d, Neutral prompts about occupations highlight stereotypical associations, revealing gender-based (c) and ancestry-related (d) stereotypes. eg, Negatively framed prompts elicit toxic responses linked to pronouns, skin tone and ancestry, with toxic gender-related responses (e), skin-tone-related responses (f) and ancestry-related responses (g). The numbers on each bar indicate the group size (bottom) and probability estimate (top).

Using FHIBE, we were therefore able to identify these previously undocumented biases. These observations underscore the persistent biases in these models and highlight the need for bias mitigation strategies.

Discussion

FHIBE marks an inflection point in enabling the development of more responsible AI. Developers are able to evaluate and compare bias in their models across many computer vision tasks without relying on non-consensually sourced datasets. One of the key contributions of FHIBE is the implementation of many of the principles that have until now been advocated for only in responsible data curation, therefore paving the way for ethical data collection efforts going forward. Insights from the development of FHIBE also provide important learnings that can inform future directions for research.

Creating an ethics-driven human-centric dataset was challenging, as it required an investment into processes that are currently not the norm in the data-collection ecosystem. Overall, to arrive at the 10,318 images for the initial launch of FHIBE, we collected a total of 28,703 images from three data vendors, which cost nearly US$308,500 (average cost of US$10.75 per image). There were additional fixed costs of around US$450,000 for QA, legal services and the cost of building the data platform.

As this demonstrates, the emphasis on consent, fair compensation, rich annotations and global diversity made the data collection expensive. Furthermore, developing and implementing best practices for data collection, ensuring data quality and analysis further required the work of 25 researchers, engineers and project managers who worked at least part-time on the project at various points over the project’s 2–3 year lifespan, along with the extensive support of legal, privacy, IT and QA specialists.

At a time when there are growing calls for ethical data collection21,22 and realization of the importance of consent and compensation for data rightsholders20, transparency around the costs of data collection is critical for the AI community. Among the 27 human-centric computer vision datasets that we compare FHIBE with, only the Chicago Face Database27 provides information about the costs of data collection (they compensated participants US$20, and US$25 was randomly awarded to raters who completed a survey—compensation was given as gift cards). Data collection not featuring human subjects and personal information might be more cost-effective (for example, GeoDE76 cost US$1.08 per image for a 61,940-image dataset, not including researcher time). However, overall, the costs of consensual, diverse and fairly compensated data collection remain high considering the large amounts of data needed to train state of the art AI models77.

We hope that the practical learnings from FHIBE will help to inform future data collection efforts and encourage more research and investment into developing more scalable ethical data collection methods. As FHIBE is the first of its kind, future efforts can leverage our project as a starting point to substantially reduce the cost and time required, but there is still a need for further research on how to achieve ethical data collection methodologies at a scale that is suitable for AI training.

Aside from cost, compared with web-scraped datasets, there are some additional limitations to consensually collected datasets. Such datasets exhibit less visual diversity compared with web-scraped ones. As shown in Fig. 5, FHIBE’s pixel-level annotations are more standardized, that is, subjects are generally positioned closer to the camera and centred within the frame. FHIBE exhibits moderate segmentation complexity across a range of difficulty levels, but keypoints are predominantly visible and consistently distributed (Fig. 5). These factors probably contributed to models performing better on FHIBE than on web-scraped evaluation datasets for many tasks (Supplementary Information F). That said, FHIBE is much more visually diverse than other consent-based datasets, vastly surpassing both CCv1 and CCv2 (Table 4). Thus, FHIBE helps to bridge the gap between non-consensually and consensually sourced datasets, but future work should explore how to further close this gap.

Fig. 5: Dataset comparison based on bounding box, segmentation mask and keypoint properties.
Fig. 5: Dataset comparison based on bounding box, segmentation mask and keypoint properties.
Full size image

a, The bounding box (BBox) area to image area ratio; larger values indicate larger bounding boxes, suggesting that subjects are closer to the camera (left). Middle left, the face bounding box area to image area ratio; larger values indicate that subjects are closer to the camera. Only FHIBE and COCO were compared as the other datasets lack relevant labels. Middle right, the bounding box width to height ratio; values of <1 suggest that subjects are in vertical positions. Right, the normalized distance between the bounding box centre and image centre; smaller values indicate that the subjects are more centred. b, Person bounding box centre distributions. The centres are normalized by the image size to be in [0, 1]. FHIBE subjects are the most centred ones, with COCO and FACET demonstrating the largest spatial coverage. c, Person segmentation mask concavity, defined as \(1-\frac{{\rm{mask}}\,{\rm{convex}}\,{\rm{area}}}{{\rm{image}}\,{\rm{area}}}\); higher values denote increased mask complexity (left). Right, person segmentation mask area to image area ratio; larger values indicate that subjects are closer to the camera (more detailed masks). Note that non-person categories are ignored. d, The average Euclidean distance between keypoint pairs; a greater distribution spread indicates a higher spatial coverage (left). Middle, heat map of FHIBE keypoint locations, showing a canonical shape with keypoints concentrated around standing humans centred in the image, with red density likely representing facial keypoints. Right, heat map of COCO keypoint locations, displaying a less canonical distribution, with keypoints more uniformly dispersed across the image, suggesting the presence of humans in diverse locations.

Table 4 Comparison of dataset visual diversity

Furthermore, crowdsourcing images made it difficult to verify that the person submitting the image was the same as the image subject. Through our automatic and manual quality checks, including reverse image search and examining consent forms and submission information, we identified possibly suspicious patterns and removed the corresponding images (Methods and Supplementary Information I). It is possible that core ethical considerations, such as fair compensation, increased the potential for fraudulent actors. For example, vendors generally offer higher compensation to demographic and geographical groups that are more difficult to collect consensual data from. This creates greater incentives for individuals to misrepresent themselves (despite the risk of being deplatformed by the vendor), to receive higher payments. It is therefore crucial for dataset curators to consider how their approaches to collecting diverse datasets ethically may attract potentially fraudulent actors. The potential for fraudulent actors is yet another reason for the importance of consent revocation and redaction in the context of ethical dataset collection. Future work should further consider the benefits and shortcomings of different data-collection approaches.

Despite the challenges of implementing a fair human-centric image benchmark, the FHIBE showcases that implementing core ethical considerations in practice is possible. We hope FHIBE will establish a new standard for responsibly curated data for AI systems by integrating comprehensive, consensually sourced images and annotations. FHIBE facilitates nuanced bias evaluations while avoiding many of the ethical concerns typical of modern datasets, particularly related to privacy and IP. Evaluations using FHIBE highlight pressing issues, such as performance disparities and stereotype reinforcement by AI models. By implementing responsible data practices and enabling the computer vision community to test their models for bias, FHIBE can help to enable the development of more inclusive and trustworthy AI systems.

Methods

Ethics statement: participants and consent/recruitment procedures

Data collection commenced after 23 April 2023, following Institutional Review Board approval from WCG Clinical (study number 1352290). All of the participants have provided their informed consent to the use of their data, and those who were image subjects further consented to have their identifiable images published.

We developed an informed consent form designed to comply with the EU’s GDPR46 and other similarly comprehensive data privacy regulations. Vendors were required to ensure that all image subjects (that is, both primary and secondary) provided signed informed consent forms when contributing their data. Vendors were also required to ensure that each image was associated with a signed copyright agreement to obtain the necessary IP rights in the images from the appropriate rightsholder. Only individuals above the age of majority in their country of residence and capable of entering into contracts were eligible to submit images.

All of the image subjects, regardless of their country of residence, have the right to withdraw their consent to having their images included in the dataset, with no impact to the compensation that they received for the images. This is a right that is not typically provided in pay-for-data arrangements nor in many data privacy laws beyond GDPR and GDPR-inspired regimes.

Data annotators involved in labelling or QA were given the option to disclose their demographic information as part of the study and were similarly provided informed consent forms giving them the right to withdraw their personal information. Some data annotators and QA personnel were crowdsourced workers, while others were vendor employees.

To validate English language proficiency, which was needed to understand the project’s instructions, terms of participation, and related forms, participants (that is, image subjects, annotator crowdworkers and QA annotator crowdworkers) were required to answer at least two out of three randomly selected multiple-choice English proficiency questions correctly from a question bank, with questions presented before project commencement. The questions were randomized to minimize the likelihood of sharing answers among participants. An example question is: “Choose the word or phrase which has a similar meaning to: significant” (options: unimportant, important, trivial).

To avoid possibly coercive data-collection practices, we instructed data vendors not to use referral programs to incentivize participants to recruit others. Moreover, we instructed them not to provide participants support (beyond platform tutorials and general technical support) in signing up for or submitting to the project. The motivation was to avoid scenarios in which the participants could feel pressured or rushed through key stages, such as when reviewing consent forms. We further reviewed project description pages to ensure that important disclosures about the project (such as the public sharing and use of the data collected, risks, compensation and participation requirements) were provided before an individual invested time into the project.

Image collection guidelines

Images and annotations were crowdsourced through external vendors according to extensive guidelines that we provided. Vendors were instructed to only accept images captured with digital devices released in 2011 or later, equipped with at least an 8-megapixel camera and capable of recording Exif metadata. Accepted images had to be in JPEG or TIFF format (or the default output format of the device) and free from post-processing, digital zoom, filters, panoramas, fisheye effects and shallow depth-of-field. Images were also required to have an aspect ratio of up to 2:1 and be clear enough to allow for the annotation of facial landmarks, with motion blur permitted only if it resulted from subject activity (for example, running) and did not compromise the ability to annotate them. Each subject was allowed to submit a maximum of ten images, which had to depict actual subjects, not representations such as drawings, paintings or reflections.

Submissions were restricted to images featuring one or two consensual image subjects, with the requirement that the primary subject’s entire body be visible (including the head, and a minimum of 5 body landmarks and 3 facial landmarks identifiable) in at least 70% of the images delivered by each vendor, and the head visible (with at least 3 facial landmarks identifiable) in all images. Vendors were also directed to avoid collecting images with third-party IP, such as trademarks and landmarks.

To increase image diversity, we requested that images ideally be taken at least 1 day apart and recommended that images submitted of a subject were taken over as wide a time span as possible, preferably at least 7 days apart. If images were captured less than 7 days apart, the subject had to be wearing different clothing in each image, and the images had to be taken in different locations and at different times of day. Our instructions to vendors requested minimum percentages for different poses to enhance pose diversity, but we did not instruct subjects to submit images with specific poses. Participants were permitted to submit previously captured images provided that they met all requirements.

Annotation categories and guidelines

We provided extensive annotation guidelines to data vendors that included examples and explanations. A complete list of the annotations, their properties (including whether they were multiple-choice), categories and annotation methods is provided in Supplementary Information A.

A key component of our project was that most annotations were self-reported by the image subjects as they were best suited to provide accurate information about subject demographics and physical characteristics, interactions depicted and scene context. The only annotations that were not self-reported were those that could be objectively observed from the image itself and would benefit from the consistency offered by professional annotators (that is, pixel-level annotations, head pose and camera distance, as defined by the size of an image subject’s face bounding box). We also provided examples and guidance for subject–subject interactions, subject–object interactions and head pose based on the request of our data vendors due to ambiguities in those labels.

We included open-ended, free text options alongside closed-ended responses, enabling subjects to provide input beyond predefined categories. These open-ended responses were coded as ‘Not Listed’. For privacy reasons, we do not report the specific text provided by the subjects. This approach enabled subjects to express themselves more fully79,80, resulting in more accurate data and informing better question design for future data collection. Given the mutability of most attributes, annotations were collected on a per-image basis, except for ancestry.

For the pixel-level annotations, face bounding boxes were annotated following the protocol used for the WIDER FACE dataset81, a commonly used face detection dataset. Keypoint annotations were based on the BlazePose topology82, a composite of the COCO40, BlazePalm83 and BlazeFace84 topologies. While the 17-keypoint COCO topology is widely used in computer vision, it lacks definitions for hand and foot keypoints, making it less suitable for applications such as fitness compared to BlazePose. For person segmentation, we defined 28 semantic segmentation categories based on the most comprehensive categorical schemas for this task, including MHP (v.2.0)85, CelebAMask-HQ86 and Face Synthetics87. Finally, person bounding boxes were automatically derived from human segmentation masks by enclosing the minimum-sized box that contained the entirety of each person’s segmentation mask.

Each annotator, QA annotator and QA specialist was assigned a unique identifier to link them to each annotation that they provided or reviewed, as well as any demographic information they chose to disclose. For annotation tasks involving multiple annotators, we provided the individual annotations from each annotator, rather than aggregated data. These annotations included those made before any vendor QA and those generated during each stage of vendor QA.

For our analyses, images with multiple annotations within a single attribute category (for example, ancestry subregion) are included in all relevant attribute value categories. For example, if an image subject is annotated with multiple ancestry subregions, the subject is counted in each of those subregions during analyses. Nested annotations—such as when a broad category is selected (for example, ‘Africa’ for ancestry)—are handled by counting the image subject in all corresponding subregions (for example, each subregion of ‘Africa’).

Quality control and data filtering

Quality control for images and annotations was conducted by both the vendors and our team. Vendor QA annotators handled the first round of checking images, annotations, and consent and IPR forms. For non-self-reported annotations, vendor QA workers were permitted to modify the annotation if incorrect. For imageable attributes (such as apparent eye colour, facial marks, apparent head hair type), they could provide their own annotations if they believed the annotations were incorrect, but this would not overwrite the original self-reported annotation (we report both annotations). Vendors were instructed not to QA non-imageable attributes (such as pronouns, nationality, natural hair colour), with the exception of height and weight if there were significant differences in the numbers for the same subject in images taken 48 h or less apart.

Moreover, we developed and ran various automated and manual checks to further examine the images and annotations delivered by the vendors. Our automated checks verified image integrity (for example, readability), resolution, lack of post-processing artifacts and sufficient diversity among images of the same subject. They also assessed annotation reliability by comparing annotations to inferred data (for example, verifying that a scene labelled as ‘outdoor’ corresponds with outdoor characteristics in the image), checked for internal consistency (for example, ensuring body keypoints are correctly positioned within body masks), identified duplicates and checked the images against existing images available on the Internet. Moreover, the automated testing checked for CSAM by comparing image hashes against the database of known CSAM maintained by the National Center for Missing & Exploited Children (NCMEC).

Images containing logos were automatically detected using a logo detector88 and the commercial logo detection API from Google Cloud Vision89. They were then excluded from FHIBE to avoid trademark issues. We used a detection score threshold of 0.6 to eliminate identified bounding boxes with low confidence, and the positive detection results were reviewed and filtered manually to avoid false positives. However, despite these efforts, logo detection remains a complex challenge due to the vast diversity of global designs, spatial orientation, partial occlusion, background artifacts and lighting variations. Even manual review can be inherently limited, as QA teams cannot be familiar with every logo worldwide and often face difficulty distinguishing between generic text and logos. Our risk-based approach to logo detection and removal was informed by the relatively low risk of IP harms posed by the inclusion of logos in our dataset. The primary concern is that individuals might mistakenly perceive a relationship between our dataset and the companies whose logos appear. However, this is mitigated by the academic nature of this publication and the clear disclosure of author and contributor affiliations.

Manual checks on the data were conducted predominantly by our team of QA specialists, as well as by authors. The QA specialists were a team of four contractors who worked with the authors to evaluate the quality of vendor-delivered data, and conduct corrections where needed. The QA specialists had a background in ML data annotation and QA work, and received training and extensive documentation regarding the quality standards and requirements for images and annotations for this project. Furthermore, they remained in direct contact with our team throughout the project, ensuring that they could clarify quality standards as needed.

The manual checks focused on ensuring the accuracy of annotations for imageable attributes, such as hair colour, scene context and subject interactions. Non-imageable attributes, representing social constructs, such as pronouns or ancestry, were not part of the visual content verification. Moreover, even though the probability of objectionable content (for example, explicit nudity, violence, hate symbols) was low given our sourcing method, instructions to data subjects and QA from vendors, we took the additional step of manually reviewing each image for such content given the public nature of the dataset.

Overall, to arrive at the 10,318 images for the initial launch of FHIBE, we collected a total of 28,703 images from three data vendors. As the result of initial internal assessments, a set of 6,868 images were excluded due to issues with data quality and adherence to project specifications. Another 5,855 images were excluded for consent or copyright form issues. Of the remaining 15,980 images collected from vendors, approximately 0.07% were excluded for minor annotation errors (for example, missing skin colour annotations), 0.17% for offensive content (in free-text or visual content) and 0.01% for other reasons (for example, duplicated subject IDs) before the suspicious-pattern exclusions described in the following section.

Detection and removal of suspicious images

It was difficult to determine whether the people who submitted the images were the same as the subjects in the image while respecting the privacy of the subjects. There can be fraudulent actors who submit images of other people without their consent to be compensated by data vendors. Given the public and consent-driven nature of our dataset, we did not rely exclusively on vendors to detect and remove suspicious images. We used a combination of automated and manual checks to detect and remove images where we had reason to suspect the data subject(s) might not be the individual who submitted the image. Combining automated and manual checks, we removed 3,848 images from 1,718 subjects from the dataset.

For automated checks, we used Web Detect from Google Cloud Vision API89 to identify and exclude images that could have been scraped from the Internet. This was a conservative check as images found online could still have been consensually submitted to our project by the image subject. However, given the importance of consent for our project, and the use of the dataset for evaluation, we excluded these images out of an abundance of caution.

This check resulted in removing 321 images, across 70 subjects, as we removed all the images for a given subject, as long as a single image was found online. However, there were some limitations to this automated approach. Vision API had a high false-positive rate: 62% for our task (that is, images that are visually similar, due to scene elements or popular landmarks). Google Web Detect returned limited results for images containing people and, in some cases, the returned matches focused on clothing items or the landmark. Furthermore, some social media images may not have been indexed by the Vision API because the websites required authentication.

We therefore also performed manual review methods for removing potentially suspicious images. Manual reviewers were instructed to track potentially suspicious patterns during their review of images and consent/copyright forms. For example, they were instructed to examine inconsistencies between self-reported and image metadata (for example, landmarks that contradicted the self-reported location). These patterns were later reviewed for exclusion by the research team.

Moreover, one of our QA specialists developed a manual process to find additional online image matches. The QA specialist used Google Lens to identify the location of the image. For images with distinctive locations (for example, not generic indoor locations or extremely popular tourist locations), the QA specialist performed a time-limited manual search to try to find image matches online. While we were not able to apply this time-intensive process to every image, using this approach, we were able to assess the risk level of different qualitative suspicious patterns and make additional exclusions.

After these exclusions, 2,017 subjects remained. From these subjects, we randomly sampled a set of 400 subjects and conducted the above manual QA process. In total, 14 subjects were found online while inspecting this sample, and we excluded them from the dataset. On the basis of this analysis, we estimated a baseline level of suspiciousness of 3.5 ± −1.7% with 95% confidence.

It is important to note that removing suspicious images also had an impact on the demographic distribution of subjects in the dataset (Supplementary Information I). We found that excluded images were more likely to feature individuals of older ages, with lighter skin tones and of Europe/Americas/Oceania ancestry. While it is not possible for us to determine the true underlying reason why some people might have submitted fraudulent images, we can speculate that some of the ethical design choices of our dataset may have inadvertently incentivized fraudulent behaviours. For example, requiring vendors to pay at least the applicable local minimum wage may have encouraged people to falsely claim to be from regions with higher wages, submitting images from the Internet taken in those locations. Similarly, in our pursuit of diversity, our vendors found certain demographics were more difficult to obtain images of (for example, people of older ages). As such, higher compensation was offered for those demographics, increasing the incentives to fraudulently submit images featuring those demographics.

The priorities of our data collection project also made fraud more feasible and difficult to detect. Given that FHIBE is designed for fairness evaluation, we sought to maximize visual diversity and collect naturalistic (rather than staged) images. As a result, we opted for a crowd-sourcing approach and allowed subjects to submit past photos. Compared with in-person data collection or bespoke data collection in which the setting, clothing, poses or other attributes might be fixed or specified, it was more difficult for our project to verify that the images were intentionally submitted by the data subject for our project. We therefore encourage dataset curators to consider how their ethical goals may inadvertently attract fraudulent submissions.

Annotation QA

We verified the quality of both pixel-level annotations and imageable categorical attribute annotations using two methods. First, we compared the vendor-provided annotations with the average annotations from three of our QA specialists on a randomly sampled set of 500 images for each annotation type. For pixel-level annotations, agreement between the collected annotations and the QA specialist annotations was above 90% (Supplementary Information E), at a similar or higher level as related works90,91,92, showing the robustness and quality of our collected annotations.

Second, we assessed intra- and inter-vendor annotation consistency by obtaining three sets of annotations for the same 70 images from each vendor. Within each vendor, each image was annotated and reviewed three times by different annotators. To ensure independent assessments, no individual annotator reviewed the same annotation for a given image instance, resulting in mutually exclusive outputs from each labelling pipeline. For dense prediction annotations, intra- and inter-vendor agreement is above 90%, confirming a high quality of collected annotations. For attribute annotations, intra-vendor agreement is above 80% and inter-vendor agreement is at 70%, which indicates that they are more noisy labels than the dense prediction ones (Supplementary Information E).

Regarding metrics for these comparisons, for bounding boxes, we computed the mean intersection over union between the predicted and ground truth bounding boxes. For keypoints, we computed object keypoint similarity93. For segmentation masks, we computed the Sørensen–Dice coefficient94,95. For categorical attributes (for example, hair type, hairstyle, body pose, scene, camera position), we computed the pairwise Jaccard similarity coefficient96 and then the average. Using these analyses, we were able to verify the consistency of the annotations between vendors and our QA specialists, within individual vendors and between different vendors.

Privacy assurance

We used a text-guided, fine-tuned stable diffusion model47 from the HuggingFace Diffusers library97 to inpaint regions identified by annotator-generated bounding boxes and segmentation masks containing incidental, non-consensual subjects or personally identifiable information (for example, license plates, identity documents). The model was configured with the following parameters: (1) text prompt: “a high-resolution image with no humans or people in it”; (2) negative text prompt: “human, people, person, human body parts, android, animal”; (3) guidance scale: randomly sampled from a uniform distribution, w ~ U(12, 16); (4) denoising steps: 20; and (5) variance control: η = 0, enabling the diffusion model to function as a denoising diffusion implicit model98.

We also manually reviewed the images to ensure the correct removal of personally identifiable information and identified any redaction artifacts. Around 10% of images had some content removed and in-painted. To evaluate any potential loss in data use, we compared performance on a subset of tasks (i.e., pose estimation, person segmentation, person detection and face detection) before and after removal and in-painting. No significant performance differences were observed.

To further address possible privacy concerns with the public disclosure of personal information, a subset of the attributes of consensual image subjects (that is, biological relationships to other subjects in a given image, country of residence, height, weight, pregnancy and disability/difficulty status) are reported only in aggregate form. Moreover, the date and time of image capture were coarsened to the approximate time and month of the year. Subject and annotator identifiers were anonymized, and Exif metadata from the images were stripped.

Consent revocation

We are committed to upholding the right of human participants to revoke consent at any time and for any reason. As long as FHIBE is publicly available, we will remove images and other data when consent is revoked. If possible, the withdrawn image will be replaced with one that most closely matches key attributes, such as pronoun, age group and regional ancestry. To the extent possible, we will also consider other features that could impact the complexity of the image for relevant tasks when selecting the closest match.

FHIBE derivative datasets

We release both the original images and downsampled versions in PNG format. The downsampled images were resized to have their largest side set to 2,048 pixels while maintaining the original aspect ratio. These downsampled versions were used in our analyses to prevent memory overflows when feeding images to the downstream models.

FHIBE also includes two face datasets created from the original images (that is, not the downsampled versions), both in PNG format: a cropped-only set and a cropped-and-aligned set. These datasets feature both primary and secondary subjects. For the cropped-and-aligned set, we followed a procedure similar to existing datasets99,100 by cropping oriented rectangles based on the positions of two eye landmarks and two mouth landmarks. These rectangles were first resized to 4,096 × 4,096 pixels using bilinear filtering and then downsampled to 512 × 512 pixels using Lanczos filtering101. Only faces with visible eye and mouth landmarks were included in the final cropped-and-aligned set.

For the cropped-only set, facial regions were directly cropped based on the face bounding box annotations, with each bounding box enlarged by a factor of two to capture all necessary facial pixels. This set includes images with resolutions ranging from 85 × 144 to 5,820 × 8,865 pixels. If facial regions extended beyond the original image boundaries, padding was applied using the mean value along each axis for both face derivative datasets.

Datasets for fairness evaluation

We evaluated FHIBE’s effectiveness as a fairness benchmarking dataset by comparing it against several representative human-centric datasets commonly used in the computer vision literature. These datasets were selected based on their relevance to fairness evaluation, the availability of demographic annotations, and/or their use in previous fairness-related studies. Our analysis is limited to datasets that are publicly available; we did not include datasets that have been discontinued, like the JANUS program datasets (IJB-A, IJB-B, IJB-C, IJB-D)102. The results are shown in Supplementary Information F.

COCO is constructed from the MS-COCO 2014 validation split40, COCO Caption Bias103 and COCO Whole Body104 datasets. We used the images and annotations from the MS-COCO 2014 validation set, and added the perceived gender and skin tone (dark, light) annotations from COCO Caption Bias, excluding entries for which the label was ‘unsure’. We then used COCO Whole Body to filter the dataset for images containing at least one person bounding box. After filtering, this dataset contained 1,355 images with a total of 2,091 annotated person bounding boxes.

FACET24 is a benchmark and accompanying dataset for fairness evaluation, consisting of 32,000 images and 50,000 subjects, with annotations for attributes like perceived skin tone (using the Monk scale105), age group and perceived gender. For our evaluations, we used 49,500 person bounding box annotations and 17,000 segmentation masks, spread across 31,700 images.

Open Images MIAP42 is a set of annotations for 100,000 images from the Open Images Dataset, including attributes such as age presentation and gender presentation. In our evaluations, we used the test split, excluding images for which the annotations of age or gender are unknown, as well as the ‘younger’ category—to ensure that only adults were included in the evaluation. With this filtering, we used a set of 13,700 images with 36,000 associated bounding boxes and masks.

WiderFace81 is a face detection benchmark dataset containing images and annotations for faces, including the attributes perceived gender, age, skin tone, hair colour and facial hair. We used the validation split in our evaluations after excluding annotations for which perceived gender, age and skin tone were marked as ‘Not Sure’. After the filtering, we used a set of 8,519 face annotations across 2,856 files.

CelebAMask-HQ86 consists of 30,000 high-resolution face images of size 512 × 512 from the CelebA-HQ dataset, which were annotated with detailed segmentation of facial components across 19 classes. From this dataset, we used the test split in our evaluations, consisting of 2,824 images with binarized attributes for age, skin colour and gender.

CCv1106 contains 45,186 videos from 3,011 participants across five US cities. Self-reported attributes include age and gender, with trained annotators labelling apparent skin tone using the Fitzpatrick scale. For dataset statistics, we extracted a single frame per video. For Vendi score computation, we used 10 frames per video.

CCv226 contains 26,467 videos from 5,567 participants across seven countries. Self-reported attributes include age, gender, language, disability status and geolocation, while annotators labelled skin tone (Fitzpatrick and Monk scales), voice timbre, recording setups and per-second activity. For dataset statistics, we extract a single frame per video. For Vendi score computation, we use three frames per video.

IMDB-WIKI107 is a dataset of public images of actors crawled from IMDB and Wikipedia. The images were captioned with date taken such that age could be labelled. From this dataset, we randomly sampled 10% to use for face verification task, resulting in 17,000 images.

Narrow models for evaluation

To assess the use of FHIBE and FHIBE face datasets, we compared the performance of specialized narrow models (spanning eight classic computer vision tasks) using both FHIBE and pre-existing benchmark datasets as listed above. As FHIBE is designed only for fairness evaluation and mitigation, we did not train any models from scratch. Instead, we evaluated existing, pretrained state-of-the-art models on our dataset to assess their performance and fairness. The results are shown in Supplementary Information F.

Pose-estimation models aim to locate face and body landmarks in cropped and resized images derived from ground truth person bounding boxes, following108,109,110. For this task, we used Simple Baseline108, HRNet109 and ViTPose110, all of which were pretrained on the MS-COCO dataset40.

Person-segmentation models generate segmentation masks that label each pixel of the image with specific body parts or clothing regions of a person. For this task, we used Mask RCNN111, Cascade Mask RCNN112 and Mask2Former113, all of which were trained on MS-COCO dataset40.

Person-detection models identify individuals from images by relying on object detection models, retaining only the outputs for the class ‘person’. For this task, we used DETR114, Faster RCNN115, Deformable DETR116 and DDOD117 with the ResNet-50 FPN115 backbone, all of which were trained on MS-COCO dataset40.

Face-detection models locate faces in images by predicting bounding boxes that encompass each detected face. For this task, we used the MTCNN118 model trained on VGGFaces2119 and the RetinaFace120 model trained on WiderFace81 using publicly available source code121,122.

Face-segmentation models generate pixel-level masks that classify facial regions into specific facial features (such as eyes, nose, mouth or skin) or background, enabling detailed facial analysis and manipulation. For this task, we used the DML CSR123 model trained on CelebAMask-HQ86.

Face-verification models determine whether two face images belong to the same person by comparing their facial features against a preset similarity threshold. For extracting facial features, we used FaceNet62 trained on VGGFaces2119, and ArcFace60 and CurricularFace61, both trained on refined MS-Celeb-1M124, using publicly available implementations61,121,125.

Face-reconstruction models encode facial images into latent codes and decode these codes back into images, enabling controlled manipulation of facial attributes. For this task, we used ReStyle126 applied over e4e127 and pSp128, and trained on FFHQ99.

Face super-resolution models generate high-resolution facial images from low-resolution inputs, enhancing facial details and overall image quality. For this task, we used GFP-GAN129 and GPEN130, trained on FFHQ99.

Narrow model evaluation metrics

We used the standard metrics reported in the literature to assess the performance of the narrow models on different tasks.

For pose estimation, we reported the percentage correct keypoints at a normalized distance of 50% of head length (PCK@0.5)131, which measures the portion of predicted landmarks (keypoints) falling within 0.5 × head-length radius from their true positions.

For person segmentation, person detection, and face detection, we reported the average recall across intersection over union (IoU) thresholds ranging [0.5, 0.95] with step size 0.05, to assess the average detection completeness of the models across multiple IoU thresholds.

For face segmentation, we reported the average F1 score (that is, the Sørensen–Dice coefficient94,95) across all segmentation mask categories, where F1 measures the intersection between the predicted and ground truth masks relative to their average size.

For face verification, we sampled image pairs of the same person (positive) and different people (negative) within each demographic subgroup. For each subgroup, we reported true acceptance rate (TAR) at a false acceptance rate (FAR) of 0.001. TAR@FAR = 0.001 measures the proportion of correctly accepted positive pairs when classification threshold is set to allow only 0.1% incorrectly accepted negative pairs.

For face reconstruction and face super-resolution, we reported learned perceptual image patch similarity132, which evaluates the perceived visual similarity between reference image Iref and generated image Igen by comparing their feature representations extracted by a pretrained VGG16133 model.

For face reconstruction, we also assessed perceptual quality using peak signal-to-noise ratio and measured identity preservation using cosine similarity between facial embeddings of Iref and Igen extracted by a CurricularFace model61.

Dataset diversity

To compare FHIBE’s visual diversity with other datasets, we used the Vendi Score134,135, which quantifies diversity using a similarity function.

To construct the similarity matrix K, we first extracted image features (embeddings) using the self-supervised SEER136 model, which exhibits strong expressive power for vision tasks. We then constructed K by computing the cosine similarity between every feature pair. For extracting feature embeddings with SEER, all images are pre-processed using the ImageNet protocol: rescaling to 224 × 224 and applying z-score normalization using the ImageNet per-channel mean and s.d.

Bias discovery in narrow models

We tested and compared FHIBE’s capabilities for bias diagnosis using a variety of methods.

Benchmarking analysis

For this analysis, we evaluated FHIBE on seven (note that for this analysis we excluded face verification owing to the inability to compute per-image scores for that task) different downstream computer vision tasks: pose estimation, person segmentation, person detection, face detection, face parsing, face reconstruction and face super-resolution. For each task and its respective models, we obtained a performance score for each image and subject, enabling us to conduct a post hoc analysis to explore the relationship between labelled attributes and performance.

For every task and model, we performed the following analyses. For each annotation attribute (for example, hair colour), we first isolated individual attribute groups (for example, blond, red, white). For each group, we compiled a set of performance scores (for example, scores for all subjects with blond hair, red hair or white hair). Only groups with at least ten subjects were considered in the analysis. We next performed pairwise comparisons (for example, blond versus red, blond versus white) using the Mann–Whitney U-test to determine whether the groups had similar median performance scores (null hypothesis, two-tailed). To control for multiple comparisons, we applied the Bonferroni correction58 by adjusting the significance threshold based on the number of pairwise tests. For pairs with a statistically significant difference (\(P < \frac{0.05}{{\rm{number}}\,{\rm{of}}\,{\rm{pairwise}}\,{\rm{tests}}}\)), we identified the groups with the lowest and highest median scores as the worst group and best group, respectively, and computed the min–max group disparity, D, between them:

$$D=1-\frac{{\rm{MED}}({\rm{worst}}\,{\rm{group}})}{{\rm{MED}}({\rm{best}}\,{\rm{group}})},\,\,D\in [0,1],$$

where MED(g) denotes the median performance score for group g. A value D → 0 indicates minimal disparity, while D → 1 indicates maximal disparity. We repeated this process for each attribute, identifying group pairs with statistically significant disparities and their corresponding values. For each attribute, we selected the pair with the highest disparity.

Direct error modelling

Using this approach, we aimed to examine which features were associated with reduced model performance using regression analysis. Although regression analysis is widely used to identify underlying relationships within datasets, its application to image data has traditionally been limited due to the lack of extensive structured annotations. However, the comprehensive scope and detail of the FHIBE annotations enabled us to effectively apply this method and achieve meaningful results. For each task and model, we predicted the model’s performance on individual images as the target variable. To this end, we collected, processed and extracted a range of annotations related to both images and subjects, including features derived from pixel-level annotations, such as the number of visible keypoints or visible head hair, or the absence of it (categorized as the binary attribute ‘bald’), which served as predictor variables. We used decision trees and random forests—an ensemble of decision trees—due to their interpretability, modelling power and low variance. We used the available implementation in the scikit-learn v.1.5.1 library for both of these models. Feature importance was obtained from the random forests model by assessing how each variable (for example, body pose) contributed to reducing variance when constructing decision trees, helping to identify the most predictive features. We then identified the most significant features (top six in most experiments) using the elbow method137. These selected features were then used in a decision tree model to assess the direction of their contribution to prediction—determining whether higher feature values are associated with better or worse model performance. To assess the robustness and statistical significance of observed differences across subgroups, we conducted bootstrap resampling with 5,000 iterations estimating standard errors. This approach enabled us to evaluate differences across groups even within smaller intersectional subgroups.

Error pattern recognition

We used association rule mining, a method frequently used in data mining to identify relationships between variables within a dataset. We applied association rule mining to identify attribute values that frequently co-occur with low performance. This approach enabled us to systematically identify and analyse patterns of bias within the model’s outputs. We used the FP-growth algorithm138. After obtaining the frequently occurring rules, we identified the attributes that are potential modes of error and investigated them further. We did this by studying the error disparities across the unique values of the attribute and evaluating its effect in conjunction with the sensitive attributes.

For face verification, we modified the protocol described above in the ‘Narrow model evaluation metrics’ section. Given that we wanted to look at the whole dataset, unconstrained to specific attributes, positive and negative pairs were computed using all face images from the FHIBE face dataset. All possible positive pairs were computed (15,474 pairs), while all negative pairs were sampled with the constraint as described previously139 to extract hard pairs: the gallery and probe images had the same pronoun, and their skin colour differed by no more than one of the six possible levels, yielding 4,945,896 pairs.

Bias discovery in foundation models

Our analysis focuses on two foundation models: CLIP and BLIP-2. CLIP74 is a highly influential vision-language model that is widely recognized for its applications in zero-shot classification and image search. BLIP-275 advances vision–language alignment by using a captioning and filtering mechanism to refine noisy web-scraped training data, thereby enhancing performance in image captioning, VQA and instruction following.

CLIP

We used the official OpenAI CLIP model74. We analysed CLIP in an open-vocabulary zero-shot setting to examine the model’s biases towards different image concepts, such as demographic attributes or image concepts (for example, scene). For each value of the given attribute, we presented four distinct text prompts. These prompts were intentionally varied in wording to reduce potential bias or sensitivity to specific phrasing. The prompts were standardized, clear and consistent across various values to minimize the influence of prompt engineering (the set of prompts is provided in Supplementary Information H). We further encoded FHIBE images using the CLIP image encoder. For pre-processing, we used the same pre-processing function as the official implementation. We analysed different variants of the FHIBE dataset to control for various effects related to the human subject and image background. These variants included the original images, images with individuals masked in black, images with individuals blurred with Gaussian noise of radius 100 and images with the background blacked out.

For the zero-shot classification analysis, we calculated the cosine similarity between the image embeddings and the text embeddings for each attribute. For example, for the scene attribute, we used two sets of prompts, each consisting of four text descriptions for indoor and outdoor environments. We computed the similarity between each text description and the image, selecting the description with the highest similarity as the assigned label for the image.

BLIP-2

The analysis protocol for the BLIP-275 model consisted of specifying a prompt pair composed of an image and a text string in the form of a request or a question (the prompts are provided in Supplementary Information H). In our experiments, we instantiated the model to perform VQA tasks. We used the HuggingFace open-source BLIP-2 implementation (Salesforce/blip2-opt-2.7b) for all our bias analysis experiments. We defined a set of prompts that will be used to probe the model along with images drawn from the FHIBE dataset. Each prompt from the prompt set was used to probe the model and the final set of prompts was chosen to maximize the model response rate (that is, the prompts for which the model provided the most non-empty responses). The full set of prompts, their corresponding model response rate and the chosen prompts are provided in Supplementary Information H.

Each prompt from this chosen set was paired with every image from the FHIBE dataset and the pair was used to prompt the BLIP-2 model. Depending on the type of prompt, the model response was filtered to extract the relevant information. The filtering mechanism varied according to the prompt tone.

The positive prompts asked the model to determine the reason for likeability, while the neutral prompts asked the model for the person’s occupation based only on the provided image input. The model responses were in the form of ‘(S)he is likable because (s)he is a good person or (S)he’s a teacher’. To filter responses like these, we used regex expressions to isolate the predicted gender pronoun and the predicted occupation/likability reason from the model response.

When the model was prompted with the negatively toned prompt about convictions, the model responses included toxic and discriminatory language. The model response was labelled toxic if it contained any of the words contained in our keyword set as shown in the Supplementary Information H.

For this analysis, we prompted BLIP-2 with questions about individuals’ social attributes, but we do not condone the use of these tasks outside of bias detection. Predicting social attributes from facial images has long been a popular but problematic task in computer vision. For example, the ChaLearn First Impressions Challenge140 tasked participants with predicting personality traits like warmth and trustworthiness from videos or images. Deep learning models have been used to map facial features to social judgements141,142. With the rise of foundational models, such uses have also emerged for VQA models, which have been employed to predict personality traits of individuals from a single image of them143.

Such tasks are highly problematic due to their reliance on physiognomic beliefs that personality traits or social attributes can be inferred from appearance alone144. We use such tasks in our paper solely to identify biases in the model, not to use the model’s inferences themselves. While VQA models should in theory refuse to answer such questions, BLIP-2 generally did answer them, with its answers revealing learned societal biases. Building on recent efforts to identify biases in VQA models by using targeted questions to identify biases145,146,147 (for example, “Does this person like algebra?” and “Is this person peaceful or violent?”), our work shows how FHIBE can reveal biases in foundation models and cautions against the flawed assumptions they may promote.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.