Main

Various groups, ranging from grassroots communities to academics to legislators, have drawn attention to and organized against the rise of mass surveillance8,9, arguing that artificial intelligence (AI) research—particularly in computer vision—serves as a foundation for the design, development and implementation of modern surveillance1,2,4,5,6,7. If these claims are true, the rapidly growing field of computer vision is contributing to the legacy of surveillance technologies, technologies that have infringed on privacy, limited free expression, exacerbated disparities and created conditions facilitating abuse of power4,6,9,10,11,12,13,14,15. A precise account of the pathway from computer-vision research to surveillance is valuable because it empowers individuals and communities to make informed decisions regarding their role and could effectively influence the development of computer vision and surveillance. We characterize the nature and extent of the surveillance AI pipeline and illuminate the critical role of computer vision in facilitating surveillance.

Computer vision refers to AI that focuses on measuring, recording, representing and analysing the world from visual inputs such as image and video data. Computer vision has historical roots in military and carceral surveillance, where it was historically developed to identify targets and gather intelligence in war, law enforcement and immigration contexts16,17,18. Over time, the priorities of computer vision have continued to be shaped by a confluence of social influences beyond individual researchers’ interests, including the interests of academic institutions, funding agencies, governments, companies and larger systemic pressures19. The field of computer vision now generally conceptualizes itself as a scientific, statistical, engineering and data-driven endeavour inspired by human vision20. An emphasis is often placed on mathematically well-founded approaches to training computers to interpret, classify, identify patterns, model and reproduce the visual world20,21. Stated topics of interest include general and application-agnostic computer-vision techniques as well as applications such as robotics and autonomous driving20,22. Prestigious computer-vision conferences have also highlighted other applications, often referred to as ‘computer vision for social good’, such as computer-vision tools to facilitate designing new proteins, creating art and modelling climate change23,24. Yet, prominent computer-vision tasks, such as facial recognition, continue to be tightly tied to military and carceral use, heavily shaping core aspects and uses of these subfields18,25. This motivates interrogation into the extent to which the field of computer vision as a whole has been shaped in a way that continues to power mass surveillance.

Drawing upon surveillance studies, surveillance is defined as an entity gathering, extracting or attending to data connectable to other persons, whether individuals or groups10. Frequently, bodies, behaviours, relationships, and social and physical environments are datafied, modelled and profiled. This formal conceptualization of surveillance includes a wide range of activities. In recent years, surveillance has become ‘extensive’: entities, who are often minimally visible, use big datasets and aggregation to extend their reach, accessing previously unseen persons, locations or information2,19,26. Prominent examples are practices where entities in positions of power observe, monitor, track, profile, sort or police individuals and populations in private and public spaces through devices such as CCTV, digital traces on social network sites or biometric monitoring of bodies1,4. Surveillance frequently occurs at loci of influence and control, such as targeted recommendation and personalization algorithms that have become widely used on the internet27. Through such ubiquitously connected networks, data are gathered, shared and aggregated28. Many scholars emphasize that surveillance is inextricable from purposes such as influence, management, coercion, repression, discipline and domination10,29.

A foundational understanding in surveillance studies is that technologies and processes may enable surveillance, even when they are not labelled as surveillance, not universally perceived as nefarious or not inflicting immediate, visible violence4,15. Technologies enabling the possibility of monitoring human data suffice to foster conditions of fear and self-censorship where this approach is a key means of social control12. In some cases, technologies that monitor humans to enable surveillance may be proposed and perceived by particular communities as connected to benevolent purposes. Importantly, these technologies continue to constitute tools enabling surveillance. Because value assessments are subjective and contested, the same technologies may be perceived and experienced by other communities as oppressive where such technologies are frequently ‘spun into’ mass surveillance by entities in positions of power25,30,31,32. In Supplementary Information sections 1.3, 1.4 and 1.6, we provide a more extensive review of the contextualizing literature and descriptions of the types of data transferal and institutional uses that contribute to surveillance.

To study the pathway from computer-vision research to surveillance, we collected and analysed a corpus linking more than 19,000 computer-vision research papers from the longest standing computer-vision conference, the Conference on Computer Vision and Pattern Recognition (CVPR), to more than 23,000 downstream citing patents. Using a mixed methods content analysis and large-scale lexicon-based analysis, we characterized the roots, extent, evolution and obfuscation of computer-vision-based surveillance, collectively forming what we identify as the surveillance AI pipeline.

The extent of human data extraction

Extensive evidence shows public distrust and fear concerning the capturing and monitoring of human data, including substantial concern about computer-vision technologies operating on data ranging from online personal data traces to biometric and body data9,25,33. To identify the potentially numerous and subtly expressed variants of human data extraction actively enabled by computer vision, we conducted a mixed methods content analysis of a randomly sampled subset of the corpus. We analysed 100 computer-vision papers and 100 downstream patents, annotating all declarations and demonstrations of human data extraction. (Herein we refer to these as ‘Paper X/Patent X’ and further details can be found in ref. 34). In the context of manual content analysis, this constitutes an in-depth, large-scale analysis.

Quantitative analysis

We quantify the uncovered types of human data targeted in the computer-vision papers and downstream patents in Fig. 1a. We additionally present a stratification of this data to compare human data extraction in papers versus patents in Supplementary Information Fig. 2. We found that 90% of papers and 86% of downstream patents extracted data relating to humans. Most (71% of the papers and 65% of the patents) explicitly extracted data about human bodies and body parts. In particular, 35% of papers and 27% of patents targeted human body part data, and at least another third of both papers and patents (36% of papers and 38% of patents) claimed or demonstrated targeting human bodies at large. Another portion of papers and patents (18% of papers and 16% of patents) extracted data about human spaces. We found a small number of papers and patents (1% of papers and 5% of patents) presented their technology as useful only for analysing non-body-related socially salient human data. Finally, the remaining small portion of papers and patents (9% of papers and 13% of patents) claimed to capture and analyse ‘images’, ‘text’, ‘objects’ or similarly generic terms, leaving unstated whether they anticipated these categories would include humans or human data. Strikingly, only 1% of papers and 1% of patents were dedicated to extracting only non-human data, revealing that both computer-vision research and its applications are extensively involved with datafying humans, specifically human bodies.

Fig. 1: Human data extraction in computer-vision papers and downstream patents.
figure 1

a, Relative frequencies of data types extracted from computer-vision papers and patents. For each year from 2010 to 2019, we randomly sampled and analysed ten paper–patent pairs (= 200). Most of the annotated computer-vision papers and patents (88%, s.d. = 5.7%) refer to data about humans. Most of the papers and patents (68%, s.d. = 4.7%) specifically refer to data about human bodies and body parts. Only 1% (s.d. = 0.7%) of the papers and patents targeted exclusively non-socially salient data. b, Examples of images analysed in computer-vision papers. For a random sample of the computer-vision papers (n = 50), we display one example of an image analysed by the paper. For papers that analysed any images containing humans, we display an example of these images of humans (highlighted in red). For papers that did not analyse any images of humans, we display an example of these non-human-depicting images (highlighted in grey). Many papers analysed images of humans. Images in b are adapted from the following references and are, unless otherwise stated, from IEEE, under a Creative Commons licence CC BY ND. Top row, left to right: refs. 51,52,53,54,55,56,57,58,59,60. Second row, left to right: refs. 61,62,63,64,65,66,67,68,69,70. Third row, left to right: refs. 71,72,73,74,75,76,77,78,79,80. Fourth row, left to right (except second image): refs. 81,82,83,84,85,86,87,88,89. Fifth row, left to right: refs. 90,91,92,93,94,95,96,97,98,99. Fourth row (second image): ref. 100, arXiv, under a non-exclusive licence to distribute.

Qualitative analysis

Our findings challenge narratives that most kinds of computer vision and data extraction are largely benign or harmless and that only a small portion is harmful. Rather, we found that the computer-vision papers and patents prioritize intrusive forms of data extraction that are well established in surveillance studies scholarship. The four targets of human data extraction that emerged during the content analysis form a series of increasingly focused categories: socially salient human data, human spaces, human bodies and human body parts. We present the uncovered types of human data extraction alongside textual examples in Table 1.

Table 1 Targets of human data extraction in computer-vision papers and patents

The papers and patents broadly prioritize tasks targeting ‘human body part’ data, particularly facial analysis, and sometimes enable activity classification. This validates the substantial concerns that have been put forth regarding the collection, aggregation and sharing of biometric and related body part data35. Biometrics (for example, faces, fingerprints and gait), which constitute uniquely personal data that is often inseparable from identities, have proliferated as a form of surveillance in recent years36. Their pervasiveness has significantly infringed on fundamental human rights, including rights to privacy and freedom of expression and movement9,37.

The papers and patents targeting ‘human bodies’ at large frequently targeted humans in the midst of everyday activities (for example, walking, shopping and attending group events), and the named purposes included body detection, tracking and counting, as well as security monitoring and human activity recognition. The dominance of analysis of human bodies in everyday settings aligns with the view of new surveillance by Browne4, who characterized the new practices of surveillance as often undetected—for example, cameras hidden in everyday benign objects—or even invisible. In these forms, data are frequently collected without the consent of the target and then shared, permanently stored and aggregated. Browne4 characterized surveillance as focused monitoring and cataloguing of that which was previously left unobserved, with the human body as a primary site of surveillance.

Beyond human bodies, the analysis of ‘human spaces’, such as homes, offices and streets, is widespread. Scene analysis, understanding or recognition are presented as a core contribution of the field in this portion of papers and patents. Additionally, analyses of ‘other socially salient human data’ appeared in a small portion of papers and patents. The consequence of gradually rendering these public and social spaces and making them amenable to observation is a fundamental mechanism of surveillance38. These forms of extraction contribute to the gradual cataloguing, documenting, mapping and monitoring of human affairs in its rich complexities1,7. It accumulates to what Zuboff calls the condition of ‘no exit’, where there are diminishing spaces in which to opt out or ‘disconnect’7. Taken together, our analysis indicates that the forms of data extraction presented in these papers and patents align with and facilitate established, intrusive forms of surveillance.

The rise of surveillance AI

To study the roots and evolution of computer-vision based surveillance, we conducted a large-scale lexicon-based analysis. We identified surveillance-enabling patents by scanning the corpus of patents for those containing words in the verified surveillance indicator lexicon. The full method, including descriptions of the surveillance indicator keywords and the extensive lexicon verification process, is described in ‘Lexicon-based analysis’ in Methods. Figure 2a,b presents the evolution of CVPR papers and downstream patents. We found a substantial increase in papers used in surveillance-enabling patents. Comparing decades, we found that the 1990s produced significantly fewer computer-vision papers with downstream patents than the 2010s, and only half of these were used in surveillance-enabling patents (53%, s.d. = 1%, n = 664). Two decades later in the 2010s, there had been a tripling of computer-vision papers with downstream patents, 78% of which were used in surveillance-enabling patents (78%, s.d. = 1%, n = 2,327). The twin forces of the increase in computer-vision papers with downstream patents and the increase in the proportion of these used in surveillance-enabling patents combined to large effect: from the 1990s to the 2010s, there was a more than fivefold increase in the number of computer-vision papers used in surveillance-enabling patents.

Fig. 2: The rise of computer vision with downstream surveillance.
figure 2

a, Across three decades of patented computer-vision papers (n = 11,917), there has been a steady increase in the proportion used in surveillance-enabling patents. Whiskers represent the standard deviation. b, Number of computer-vision papers used in surveillance-enabling patents. Comparing the 1990s to the 2010s, the number of computer-vision papers used in only non-surveillant patents has remained relatively stable, whereas the number of computer-vision papers used in surveillance-enabling patents has risen more than fivefold. Whiskers represent the standard deviation. c, Differences in word frequencies between paper titles from the 1990s versus those from the 2010s. We report highly polarized words (with z scores computed using weighted log odds ratios). All word associations shown are statistically significant (P < 0.01). There is a clear qualitative shift from the more generic paper focus of the 1990s (turquoise bars) to an increased focus on analysing semantic categories and humans (for example, ‘semantic’, ‘action’ or ‘person’) in the 2010s (pink bars).

We gained further insight into the evolution of computer vision by inductively identifying patterns of linguistic evolution. To study the linguistic evolution that has occurred over the past several decades, we compared the log odds ratios of word frequencies in paper titles from the 1990s versus those from the 2010s. We used an informative Dirichlet prior to obtain measures of statistical significance and control for variance in word frequencies39. Figure 2c shows highly polarized word associations in both directions with computed z scores. More methodological details are presented in ‘Longitudinal analysis’ in Methods. We see a clear, qualitative shift from more generic application-ambiguous language in the 1990s (for example, ‘shape’, ‘edge’ and ‘surfaces’; turquoise bars) to an increased focus in the 2010s on analysing semantic categories and humans (for example, ‘semantic’, ‘action’ and ‘person’; pink bars). As this was an inductive analysis that identified the dominant patterns of linguistic evolution, this finding indicates that not only has there been a major change towards enabling surveillance but it is one of the most salient changes that have occurred in the field over the past several decades. Taking our results together, we infer that the language and patenting practices in computer vision have evolved in ways that increasingly focus on analysing humans and enabling surveillance.

The normalization of surveillance AI

Surveillance technology does not emerge in a vacuum. Research and subsequent applications are actively conducted, incentivized, funded and commercialized by numerous stakeholders. In the previous section, we applied a large-scale lexicon-based analysis to identify surveillance-enabling patents. In this section, we consider the institutional affiliations, national affiliations and subfields of the computer-vision papers, and we study the links from these entities to the identified surveillance-enabling patents. The full method, including descriptions of the surveillance indicator keywords and the extensive lexicon verification process, is described in ‘Lexicon-based analysis’ in Methods.

Figure 3a presents the top ten institutions and nations authoring the most CVPR papers with downstream surveillance-enabling patents, as found in the corpus. As shown, for each of these top institutions and nations, most of the patented papers have been used in surveillance-enabling patents. Many of these are top institutions including ‘big tech’ corporations and elite universities, including many that have been identified as top producers of computer science papers and computer-vision papers generally40. Additionally, many of the institutions we identified as authoring a substantial number of papers with downstream surveillance-enabling patents have been identified by previous research as aligning with well-established historical legacies of the military–industrial–academic complex41.

Fig. 3: Field-wide dominance of downstream surveillance.
figure 3

a, Institutions and nations that have produced the most CVPR papers with downstream surveillance-enabling patents. For each institution or nation, most patented papers have been used in surveillance-enabling patents. b, Percentage of patented papers used in surveillance-enabling patents, stratified by institution, nation and subfield. For each institution, nation or subfield that has published at least ten papers with downstream patents, we show the percentage of these papers that have been used in surveillance-enabling patents (vertical grey bars) (n = 13,804, n = 18,272 and n = 19,413, respectively). We found a pervasive norm: if an institution, nation or subfield authors papers with downstream patents, most are used in surveillance-enabling patents. (Vertical grey bars are frequently above the 50% threshold; orange line). Whiskers represent the standard deviation.

To understand the influence of nations on surveillance-enabling patents, we additionally present in Fig. 3a the distribution of ties to surveillance across nations. The nations were obtained from the location of paper authors’ institutional affiliations. The top two nations producing papers with downstream surveillance-enabling patents are the USA and China by a large margin, with the USA producing more of these papers than the next several nations combined. Our findings correspond to previous reports about AI-driven surveillance across nations, which state that on a global scale, China and the USA are the main drivers in supplying both AI and advanced surveillance technologies42.

These findings provide the basis for a salient question: are only a few key entities and authors contributing to surveillance or are ties from research to surveillance found across the field? We found substantial evidence showing a pervasive fieldwide norm: when an institution or nation authors computer-vision papers with downstream patents, most have been used in surveillance-enabling patents (Fig. 3b; the vertical grey bars for institutions and nations are frequently above the orange 50% threshold). This norm describes the behaviour of 71% of institutions (575 out of 805) and 78% of nations (45 out of 58), which may provide evidence for the wide-spanning normalization of computer vision used in surveillance. Similarly, we found substantial evidence against the notion that there are merely a few implicated applications of computer-vision research within a broader non-surveillance-oriented field. Rather, we found an extension of the above norm: when a subfield produces computer-vision papers with downstream patents, most have been used in surveillance-enabling patents (Fig. 3b; the vertical grey bars for most subfields are above the orange 50% threshold). It may be expected that the stated norm describes frequently implicated subfields such as facial recognition, but in fact, we found that the norm describes most subfields (69%; 2,922 out of 4,247). We present the details of which entities are implicated in this norm in Supplementary Information section 1.5. Our findings indicate that, across institutions, nations and subfields, the practice of producing computer vision that enables surveillance is a pervasive norm.

The obfuscating language of surveillance AI

Finally, in addition to our analysis of ties to surveillance, we found trends for using obfuscating language that minimized or sidestepped mentions of surveillance. Drawing upon our manual inspection of 100 computer-vision papers and 100 downstream patents, we highlight and describe two salient qualitative themes that emerged:

Theme 1. What is said: humans are subsumed under the term ‘objects’.

“Since the surveillance system detects and can be interested on vehicles, animals in addition to people, hereinafter we more generally refer to them with the term moving object”. (Paper 53)

Establishing the explicit conceptualization of humans as merely a kind of object, as many papers and patents do, enables the rest of those documents and may, crucially, enable all other papers and patents to merely discuss problems related to objects or scenes, as they can rely on the understanding that humans are objects. Because humans are considered objects and because scenes contain humans, documents can rely on the covert assumption that any paper or patent that discusses objects or scenes—most of the field—may enable human data extraction and surveillance. Theme 2 describes one form of reliance on this covert assumption. Reflecting the continuing tight relationship between computer vision and surveillance, a paper about panoptic segmentation makes no distinction when summarizing the field: ‘Early work on face detection … helped popularize bounding-box object detection. Later, pedestrian detection datasets helped drive progress in the field’ (Paper 96). During our qualitative analysis we frequently encountered papers and patents stating that they use the term ‘objects’ as shorthand for human entities, including humans, human body parts, vehicles, students and pedestrians.

Theme 2. What is not said: even when documents do not mention humans, the figures or datasets may contain images of humans.

The pattern of papers and patents claiming to target objects, while briefly defining these terms as subsuming humans, sets a clear precedent that we found has already played out: we found that other documents lean on these norms. Although claiming to target objects, in actuality, they target humans and, thus, leave no textual trace of the human data extraction they are engaged in. For example, one paper describes itself as improving object classification and makes no mention of humans. Yet, close inspection of the first figure in the paper reveals (in 3-point font) that it classifies so-called objects into classes including ‘person’, ‘people’ and ‘person sitting’ (Paper 5). A second paper describes itself as identifying salient regions of images and does not mention humans. Yet, inspection of the datasets reveals that the authors demonstrate their technology by detecting regions of interest, such as humans walking on a sidewalk (Paper 1). Figure 1b presents examples of images targeted in a random sample of the computer-vision papers (n = 50). Our annotators observed that papers frequently analysed images depicting human bodies, including these in datasets and often featuring these in figures, despite many papers lacking explicit mention of these entities.

The nature of these themes is such that, first, humans and objects of all kinds may be targeted in parallel, despite the vastly different implications, and, second, that humans can be primary targets of technologies without leaving a textual trace of surveillance.

Discussion

The field of computer vision has frequently emphasized its place as a scientific and statistical endeavour inspired by human vision, referencing a range of commercial and industrial applications and highlighting the use of computer vision for good20. Yet, based on the studies presented in this paper, we contend that such characterizations of the field underrecognize or misrecognize the extent to which the field, taken as a whole, simultaneously engages in the mass extraction of human data7,43,44. Cutting across research motivations and subfields, we found a fieldwide norm in which the analysed computer-vision papers and patents extensively and increasingly extract human body data and other socially salient data. The normalization of such extraction is particularly striking when considered alongside evidence that the field frequently fails to address concerns regarding the use of human data. Exemplifying this are the themes in which the analysed papers and patents obfuscate their extraction of human data and conceptualize humans as objects to be studied without special consideration. These patterns align with existing literature that has established that AI research frequently fails to mention or mitigate concerns regarding human agency, consent or privacy and fails to engage with these ethical considerations in many of the ways expected of other fields that analyse human data17,18,45. Our findings do not comment on the intentions of computer-vision researchers. Rather, they bring into focus the systematic pattern of extracting human data and enabling surveillance.

Although many computer-vision researchers conceptualize the overall rise and proliferation of computer-vision technologies as field success, this rapid proliferation might alternatively be understood as the perpetual practice of rendering visible what was previously shielded and unseen, a practice that surveillance studies scholars such as Browne4 view as the core of surveillance. Technologies that enable the monitoring of human data, which may be perceived as differentially malevolent or benevolent by different communities, nonetheless have historically established consequences: these technologies engender fear and self-censorship; it is lucrative and standard practice for entities in positions of relative power to use these technologies to access, monetize, coerce, control or police individuals or communities with lesser power; and these technologies are frequently deputized by state surveillance organizations2,4,25. Crucially, in addition to individualized consequences, the rapid generation and proliferation of technologies monitoring humans accumulates to what Zuboff7 calls the condition of ‘no exit’, where there are fewer and fewer spaces left to opt out, ‘disconnect’ and seek respite.

The uncovered features of computer vision tie into a broader literature about the historical narrative of neutrality in science. Scientific findings are frequently presented as facts that emerge from an objective ‘view from nowhere’ in a historical, cultural and contextual vacuum. Such views of science as ‘value-free’ and ‘neutral’ have been deconstructed by a variety of scholarships, including the philosophy of science, science and technology studies, and feminist and decolonial studies. A purported view from nowhere is always a view from somewhere and usually a view from those with the greatest power46. Social and cultural histories and norms, funding priorities, academic trends, researcher objectives and research incentives, for example, all inevitably constrain and shape the production of scientific knowledge47,48. An assemblage of social forces have shaped computer vision, resulting in a field that now mass-produces highly specific technologies. Viewing computer vision in this light, it becomes clear that shifting away from surveillance requires, not a small shift in applications, but rather a reckoning and challenging of the foundations of the discipline.

Rapidly evolving AI research agendas, narratives, norms and policies afford opportunities to intervene. For individuals and communities concerned about surveillance, there are historical precedents and frequent examples in which key figures have made informed decisions regarding the role they wish to play, for example, by adopting critical technical practice, exercising the right to conscientious research including the right to conscientious objection, collectively protesting against and cancelling surveillance projects, and changing their focus to study the ethical dimensions of a field, educate the public or put forward informed advocacy49,50. In this context, this paper serves to illuminate the roots, extent, evolution and obfuscation of the surveillance AI pipeline and, in doing so, aims to provide access to information with which individuals and communities may understand, influence or disrupt these pathways to surveillance.

Methods

Corpus of computer-vision papers and downstream patents

To study the pathway from computer-vision research to surveillance, we collected and analysed a corpus linking more than 19,000 computer-vision research papers to more than 23,000 downstream patents. Research papers and patents have unique advantages making them revealing artefacts. First, they are primary sources written in the researchers’ and patenters’ own words and there exist professional and institutional expectations that they accurately describe the research and technologies. The connections between research papers and citing patents serve as a rich data trail of the path from research to applications101,102. These documents also include comprehensive metadata, as papers necessarily list their authors and their primary affiliated institutions and the publication year, thus enabling analyses of how these factors influence the pathway to applications. They are available online, and they have a consistent overall structure facilitating consistency of annotation and reliable comparisons. These papers and their collected downstream patents served as the basis for the mixed methods content analysis and large-scale lexicon-based analysis presented in this paper. We studied papers published in the proceedings of the longest standing computer-vision conference, CVPR, as metrics indicate it has the highest impact of all computer-vision conferences by an extremely large margin. By h5-index, CVPR proceedings are among the top five highest impact publications in any discipline, alongside Nature and Science. This research is widely seen to be an ‘indicator of hot topics for the AI and machine learning community’103. Acceptance and publication are marks of approval of the research as work that exemplifies the core values of the computer-vision community. As such, these papers both represent the state-of-the-art in current computer vision and effectively reveal the values held in high regard within the community. We obtained all the proceedings published from 1990 to 2021, and, for each paper that has been cited in one or more patents, we obtained all citing patents. We refer to these as a patented paper and its downstream patents. Extended Data Fig. 1 presents randomly sampled pairs consisting of a paper and a downstream patent and is, thus, a snapshot of our corpus.

Implementation

We analysed the corpus of CVPR papers from 1990 to 2021. CVPR was not held in 1990, 1995 or 2002, so there are no papers from those years. In constructing our corpus, we leveraged and linked the papers in the Microsoft Academic Graph104, the paper–patent citation linkages inferred by Marx and Fuegi105 and the patents in the Google Patents database. Manual verification found the paper–patent citation linkages to have over 99% precision and 78% recall. All papers presented at CVPR were published in English. For patents that were published in other languages, the English translations in Google Patents were used.

Content analysis

Following best practices in content analysis, we conducted an in-depth analysis of a purposive sample of papers and patents distinctively informative of the development of computer-vision research and applications for enabling surveillance. For each year from 2010 to 2019, we randomly sampled ten paper–patent pairs that consisted of a CVPR research paper published in the year and a downstream patent. This formed a total of 100 papers and 100 downstream patents. In the context of content analysis, this constitutes a large-scale annotation.

We conducted the content analysis using a close reading of the documents and a rigorous qualitative methodology. An interdisciplinary six-person team analysed the documents using an integrated inductive–deductive methodology. In the inductive component, each document was read line by line, including figures. We inductively coded key emergent features in the treatment of human data by the technology and iteratively accumulated a list of these key features and their relationships. We complemented this with a deductive component to ensure that we actively looked for and captured instances of papers and patents with key features that inhibited usage for surveillance, even if rare. The inductive and deductive codes were ultimately clustered into data type, data transfer and data use. The codes are discussed in this section as well as in ‘The extent of human data extraction’ section and Supplementary Information sections 1.3 and 1.4. During this process, our annotation team had several strengths: our team included both published experts in computer vision and field outsiders to allow expert insights and translation as well as fresh perspectives that could illuminate computer-vision disciplinary biases. We used the constant comparative method. Throughout the coding process, the team held frequent, extensive discussions to develop the precise meanings of codes and their relationships and to revise and refine the code list. At the end of all coding, the team unanimously agreed upon the key emergent dimensions and features of the treatment of human data by the technologies, along with the relationships among these dimensions and features. Additionally, as we coded papers and downstream patents, we encountered and discussed salient examples of obfuscating language being used to describe or avoid describing surveillance, and we present these findings in ‘The obfuscating language of surveillance AI’ section.

Based on our in-depth, interdisciplinary content analysis, we present the surveillance AI topology in Supplementary Fig. 1, which brings to the fore the dimensions, features and dynamics of the treatment of human data in computer vision and connects these to concepts in surveillance studies that elucidate the complexity and consequences of these particular findings. Our analysis identified three key dimensions capturing the treatment of human data by these technologies: (1) Data type—what type of data does the technology extract, attend to, capture, monitor, track, profile, compute or sort and to what extent is it human and personal? (2) Data transferal—to what extent do the data remain under the control of the datafied person or become transferred to others? (3) Use of data—for what purpose are the data used? These three dimensions are discussed in detail, with examples and analysis, in ‘The extent of human data extraction’ section and Supplementary Information sections 1.3 and 1.4.

We discuss the primary dimension of the topology in detail in ‘The extent of human data extraction’ section. In this primary dimension, the inductively identified types of human data extracted form a series of nested, increasingly focused categories: socially salient human data, human spaces, human bodies and human body parts. A fifth inductively identified target of data extraction was general and unspecified data, which tended to target generic tasks such as ‘identifying objects’, did not specify targeting human data but also did not commit to targeting only non-human data. In addition to these data types, which were inductively found only through a close reading of the papers and patents, the annotation team deductively included non-human data in the annotation scheme from the start. This was to ensure that we captured mentions of any non-surveillance technologies in papers and patents, even if rare. To enable a quantitative analysis of this primary dimension, we identified for each paper and patent the innermost (most focused) type of human data extracted. Half of the documents were annotated by more than one annotator, which was particularly valuable for allowing the annotators to become accustomed to types of cases in which a single sentence or figure influenced the appropriate code. The existence of such cases is discussed in ‘The obfuscating language of surveillance AI’ section. In these cases with several annotators, the final code of each document was determined through discussion until consensus. We then quantified the annotations for all documents and present the relative frequencies of the data types in Fig. 1a and Supplementary Fig. 2. Figure 1b additionally presents, for each of the first 50 annotated papers, one example of an image analysed by that paper. We found that the second and third dimensions of the topology were less consistently discussed in papers and patents. Nonetheless, key areas of surveillance studies scholarship are dedicated to how these dimensions (data transfer and data use) are important to understanding the roles, dynamics and consequences of surveillance. Given the importance of these dimensions, Supplementary Information sections 1.3 and 1.4 include a full discussion of these dimensions, the inductive and deductive codes, demonstrative examples and findings, and connections to nuanced dynamics of surveillance that have been discussed in the surveillance studies literature.

Lexicon-based analysis

Surveillance indicator lexicon

As introduced at the start of this article, drawing upon surveillance studies, surveillance is defined as an entity gathering, extracting or attending to data connectable to other persons, whether individuals or groups10.

During our manual content analysis of computer-vision papers and downstream patents, careful attention was paid to sentences in patents that revealed that a patent enabled the above conceptualization of surveillance, that is, sentences that revealed the patent was self-reported as enabling the gathering, extracting or attending to data connectable to other persons. During the content analysis, the team accumulated a list of candidate surveillance indicator keywords (one- or two-word phrases) that featured centrally in these sentences and that in one or more encountered documents played a key role in revealing that a patent enabled this conceptualization of surveillance.

After constructing this candidate keyword list, we began an extensive pruning process, as the aim was to create a reliable lexicon of keywords indicating technologies enabling surveillance. Our preference was to err on the side of more pruning, so that we ultimately undercounted rather than overcounted surveillance-enabling patents. Accordingly, we applied two phases of pruning. In the first phase, for each candidate keyword, we scanned the corpus for all patents containing this keyword and produced a random sample of ten such patents. Three team members conducted an independent manual inspection of these patents. After the manual inspection, the three team members came together to identify and remove, by consensus, candidate keywords that were not reliable indicators (typically because we found the keyword had other word senses or usages; for example, a ‘store’ could be a human space but was frequently a technical term related to data or memory storage, so ‘store’ was removed from the list). To strengthen the reliability of the lexicon, we undertook a second pruning phase two months later. During this second phase, for each candidate keyword, we obtained a random sample of nine patents containing the keyword to serve as a verification sample. Each of these patents was assigned to a team member. This team member was provided the full text of the patent as well as the paragraph and sentence containing the keyword. The team members independently annotated for each assigned patent whether it enabled the above conceptualization of surveillance. If a team member encountered any instances of a keyword being used as a word sense clearly different from the word sense expected and theoretically connected to the above conceptualization of surveillance, the keyword was removed from the lexicon. If a team member encountered any instances of a patent that did not enable the above conceptualization of surveillance, the respective keyword was removed from the lexicon. During this second phase, keywords were required to meet a strict threshold of, for the verification sample, obtaining 100% precision in predicting that a patent enables the above conceptualization of surveillance.

Following this extensive lexicon verification procedure, the final surveillance indicator list contains 30 keywords, which, for the verification sample, each met the strict threshold of 100% precision in predicting that a patent enables the above conceptualization of surveillance. The keywords are listed in Supplementary Information section 2.1.

Downstream patent identification and analysis

To study the breadth and variation of surveillance across years, institutions, nations and subfields, we conducted a large-scale lexicon-based analysis of more than 43,022 papers and patents. For each paper, we scanned its downstream patents to identify patents containing one or more of these surveillance indicator keywords. We refer to these as surveillance-enabling patents. Given that many patents do not explicitly state that they are intended for surveillance (for example, see ‘The obfuscating language of surveillance AI’ section), the identification and subsequent scanning for surveillance indicator keywords is a rigorous method for capturing the patents that enable surveillance. We present the distribution of surveillance-enabling patents across institutions, nations, subfields and years, along with a contextualizing discussion, in ‘The rise of surveillance AI’ and ‘The normalization of surveillance AI’ sections. We present further methodological details in Supplementary Information sections 2.1 and 2.3.

Longitudinal analysis

To conduct an analysis across years (Fig. 2a,b), we filtered the corpus years. In emerging and developing fields, the estimated time from a paper being published to a downstream patent being published is 3 to 4 years. This also incorporates the time spent during the patenting process106. This seems to be in line with our corpus, as the number of computer-vision papers with downstream patents stabilized in the early 2000s and, from the early 2000s onwards, remained above 200 every year until 2018 (exactly 4 years before our analysis began), at which point it suddenly dropped by nearly a half. Accordingly, for the analysis across years, we removed papers from 2018 and 2019, as these were published less than 4 years before our analysis began and the patenting process may not yet have played out for many of them, so that using them would have made the analysis less reliable. This filter had the added benefit that, in our analyses comparing the 1990s to the 2010s, both decades consisted of 8 years, putting these decades on a fair playing field for totalling when comparing the number of downstream patents of various types.

To study the linguistic evolution that has occurred, we computed the log odds ratio with a Dirichlet prior of words appearing in paper titles from the 1990s versus paper titles from the 2010s39. We removed stop words (as listed in the Natural Language Toolkit, as well as ‘using’ and ‘via’ because these are common stop words in computer-vision titles). We present ten highly polarized word associations in both directions with computed z scores in Fig. 2c. These are the strongest word associations by z score with the exception that, because we are interested in changes in the focus of papers and patents and not in the well-known evolution of specific types and names of models being used, we skipped the words ‘machine learning model(s)’ and ‘neural network(s)’.

Defining big tech and elite universities

Following Ahmed and Wahed40, we relied on the QS World University Rankings for our definition of an elite university. To determine what is considered big tech, we relied on the criteria established by Abdalla and Abdalla107 and Birhane et al.,48 namely, Alibaba, Amazon, Apple, DeepMind, Element AI, Facebook, Google, Huawei, IBM, Intel, Microsoft, Nvidia, Open AI, Samsung and Uber.