Introduction

Biopsy-based histology is a pillar of medical practice and has revealed insights into the diagnosis, progression, and treatment of disease. However, the need for biopsies fundamentally constrains access to the important cellular information revealed by histology. From a clinical decision-making perspective, the suspicion of disease must be sufficiently high to warrant a biopsy procedure, and choosing not to biopsy lesions risks the potential oversight of subtle or early-stage malignancies. Moreover, the need to process surgical samples onto glass slides, interpret the results, and generate a report can take days to weeks, and remote access to images of the tissue can be limited unless extra steps are taken to digitize the glass slides. In clinical research, biopsies inherently prevent longitudinal observation and introduce a barrier to timely participant recruitment. Real-time, non-destructive digital image generation could improve access to in vivo cellular information, expanding patient care and research opportunities.

Multiple methods to improve the observation and digitization of tissue anatomy have emerged, especially in the field of dermatology, where the skin’s accessibility enables ease of study. Three major approaches include: first, proxy technologies, sometimes paired with digital analysis, such as spectroscopy1, clinical photography2 and dermoscopy3, which have a low burden to deploy but do not image the underlying histology; second, digital pathology of scanned, hematoxylin- and eosin-stained (H&E) slides, which improves access to digital images of pathology, but does not shorten the processing time or reduce the invasiveness of biopsy; and third, noninvasive high-resolution imaging methods, such as optical coherence tomography (OCT)4,5, reflectance confocal microscopy (RCM)6, line-field confocal OCT (LC-OCT)7, and multiphoton microscopy (MPM)8 which have generated excitement and evidence supporting their ability to better identify disease without the need for biopsy. However, each noninvasive high-resolution imaging modality alone possesses important limitations that may impede widespread adoption.

An ideal noninvasive clinical imaging solution would view the skin in cross-section (across skin layers) and promote visualization of molecular composition, both important features of H&E histopathology. OCT can rapidly produce cross-sectional images with millimeter scale fields of view, although traditionally with limited optical resolution in skin9. Moreover, as a monochromatic modality, OCT is unable to provide information on the molecular composition of skin. Traditional OCT is available for medical use in the United States (US) with Food & Drug Administration (FDA) clearance (VivoSight Dx, Michelson Diagnostics).

RCM with near-infrared (NIR) light has gained traction as an important source of information for dermatological decision-making10. RCM is a monochromatic modality that detects backscattered photons from illuminated tissue. High resolution in scattering tissues arises from the placement of a small pinhole (~ 10 µm) in a conjugate focal plane of the illumination that prevents out-of-focus photons from reaching the detector (finite-finite conjugates). Raster scanning across the field of view (FOV) pixel by pixel while recording the resulting reflected signal generates the images. With high optical resolution (< 1.25 µm horizontal resolution), physicians can discern cellular morphology and architecture of skin without biopsy for evaluation of skin disease11,12,13. Limitations of RCM images include an en face (in the horizontal plane) orientation and lack of molecular contrast, increasing the need for specialized training in clinical practice10. Traditional RCM is available for medical use in the US with FDA clearance (Vivascope 1500, Caliber I.D.)14.

LC-OCT is an emerging technique that combines optical design elements of traditional OCT with RCM to yield improved image resolution (axial: 1.1 µm, lateral: 1.3 µm) with the option to visualize images in either the vertical or horizontal planes9. LC-OCT further benefits from a FOV and imaging depth (1200 µm × 500 µm) that is greater than RCM but less than traditional OCT. Like RCM and traditional OCT, LC-OCT’s monochromatic nature is unable to provide molecular compositional detail and contrast. LC-OCT has received CE Mark and the advantages of using LC-OCT to detect dermatologic malignancy have been reviewed15.

MPM with pulsed NIR light is another method to provide complementary information about in vivo skin histology. The mechanism of MPM is the detection of fluorescence originating from a nonlinear optical excitation event due to focused laser illumination. High resolution is achieved through the highly localized occurrence of nonlinear fluorescence or harmonic generation at the focus, negating the need for a pinhole16. Traditional MPM systems raster scan the focused illumination in a similar manner to RCM, experiencing the same limitation of an en face orientation. MPM, however, offers a distinct advantage over OCT, RCM, and LC-OCT in that various molecules within the skin emit differing wavelengths of light when excited, revealing contrast between tissues of differing molecular composition17. In particular, H&E and MPM modalities both highlight the presence of collagen, which is a molecule modified by and implicated in important normal and pathological processes. MPM imaging of skin cancers and other skin disorders has been extensively studied, both ex vivo and in vivo8,18,19,20,21. In addition, MPM, as developed by JenLab GMBH, has received CE Mark (Conformité Européenne [European Conformity] certification)22.

We have developed a noninvasive, cross-sectionally scanned, multimodal microscopy (cross-modal) system that addresses many of the challenges of in vivo histology (Fig. 1, Supplementary Video S1). This technology combines MPM and RCM in the same handheld architecture using a single Class 1M, 780 nm ultrafast pulsed NIR light source. Cross-modal generates real-time, cross-sectional, skin images in color using four simultaneous optical channels: pinhole-free reflectance confocal microscopy (pf-RCM); second harmonic generation (SHG); two-photon autofluorescence, short (2PS); and two-photon autofluorescence, long (2PL). A clinically deployable cross-modal system was recently evaluated and cleared for clinical use by the FDA (VIO System, Enspectra Health).

Fig. 1
figure 1

A benchtop, handheld imaging system for multimodal, cross-sectional images. (a) Handheld wand. (b) System hub, including display and wrapped cable. (c) Example H&E section of skin with the same FOV as that of cross-modal images, and (d) example cross-modal composite image of normal skin. FOV setting; standard. Example H&E and cross-modal are from different participants. Scale bars: 100 µm. (e) Schematic of how oblique scanning produces a slanted transection of the point spread function (PSF) for improved cross-sectional resolution. A-PSF, axial PSF; L-PSF, lateral PSF; O-PSF, oblique PSF; and angle, θ (30º). The resulting image is represented as a vertical projection (VP) such that the vertical axis of the images represents a true axial depth. (f) Schematic of the internal layout of the handheld wand and light path (dashed line). (g) Schematic of the compact reflection chamber within the wand used to reject out-of-focus light for pf-RCM.

Results

Practical clinical use: miniature integrated system with cross-scanned multimodal imaging

Achieving widespread clinical adoption of a noninvasive dermatology imaging solution will depend substantially on the system’s size and mobility, and the familiarity of image features to interpreting clinicians. To promote usability and allow the device to be carried by one person (Supplementary Fig. S1), the entire cross-modal system was integrated into a form with overall dimensions of 40 × 33 × 30 cm, and a total weight under 10 kg. Cross-modal combines technologies such as miniaturized fiber delivery of ultrafast pulsed laser excitation, multi-spectral signal generation and processing, microelectromechanical systems (MEMS)-actuated laser scanning, solid-state silicon photomultipliers (SiPM), and direct cross-sectional imaging with subcellular resolution. The handheld wand, which contains the laser scanning microscope and dual collection pathways for the RCM and MPM signals, is of a similar size and weight to portable ultrasound imagers at 160 × 50 × 40 mm and 285 g23, which allows for freehand operation.

The cross-modal system’s light source is a small ultrafast fiber laser (50 × 135 × 194 mm; < 2.2 kg; < 80 fs) with low overall power requirements (Toptica Photonics, AG). This laser provided many advantages: a robustly aligned system necessary for portability, device flexibility (cable versus gantry arm), and improved light transmission from laser to tissue. A common challenge with fiber delivery of ultrafast pulses is pulse broadening due to dispersion. To overcome this challenge, we used a 1560 nm fiber laser with a miniature frequency doubling module in the handheld wand itself to create the 780 nm light for signal excitation.

Combining RCM and MPM in the cross-modal format required a strategy beyond simply miniaturizing traditional laser scanning architectures. To enable miniaturization, we based the optical design on maintaining a small beam waist (SBW) of approximately 1.3 mm throughout the entire wand and performing most optical manipulations in the infinity space. This approach minimized aberrations as lenses functioned primarily in the paraxial region. For the objective, we designed a custom lens configuration with an intermediate numerical aperture (NA) of approximately 0.6 and an entrance pupil matching the SBW to negate the need for beam expansion. Intermediate NA objectives have demonstrated superior deep MPM signal generation in scattering media by delivering more photons to the focus via a shorter pathway24.

A key advantage of the SBW design was implementing lateral scanning with a small MEMS mirror (Mirrorcle Technologies Inc.) versus galvanometers. Relative to the size of the cross-modal wand, galvanometers are bulky and suffer the additional disadvantage of shifting the optical axis during scanning due to mirror separation. Unlike paired galvanometers, the dual-axis MEMS mirror rotates along intersecting axes that keep the optical path concentric with the mirror regardless of scan angle. Additionally, the MEMS scanner was very compact (2 mm diameter mirror; 15 × 15 × 1.57 mm printed circuit board [PCB]; 2 g) and delivered a satisfactory frame rate due to its small mass and high resonant frequency.

SBW design further promoted system miniaturization by enabling efficient fiber collection of the generated signal photons. The collected photons converge at the small objective pupil where a dichroic mirror separates MPM signals from de-scanned RCM signals into a pair of flexible fibers for delivery to the hub (Supplementary Fig. S2). This avoided the need to position the collection optics and sensors directly behind the objective inside the wand, which could hinder flexibility and accessibility. The slight loss in transmission through the fibers was offset by the miniaturization gained by locating the SiPMs (Hamamatsu Photonics K.K.) and supporting electronics remotely.

Within the hub, we filtered and routed collected light to separate optical channel sensors based on known emission spectra of dermatologically relevant molecules, primarily reduced nicotinamide adenine dinucleotide phosphate (NADPH), keratin, collagen, elastin, and melanin25 to provide more molecular composition information than a single monochromatic image. Combining RCM, SHG, and two-photon autofluorescence (2PA) into a single system with multiple channels as we have implemented here has been previously identified as a promising strategy to generate complementary information about cellular morphology and architecture26,27. Each cross-modal optical channel is named for the optical phenomenon that generated the light: RCM (780 nm), second harmonic generation (SHG, 390 nm), and 2PA divided into short (2PS, 414–510 nm) and long (2PL, 510–650 nm) wavelength emissions (Supplementary Fig. S3). The shared excitation source ensured spatial registration across the RCM, SHG, 2PS, and 2PL signals, yielding four simultaneous monochromatic images (Fig. 2). The inherently decoupled data from RCM and MPM signals can be viewed independently, e.g. monochromatically, or in pseudocolor combination to reveal complementary details of structure and composition in live tissue.

Fig. 2
figure 2

Simultaneous multimodal, four-channel imaging with cellular resolution. (a) An in vivo, four-channel capture of normal human skin (Fitzpatrick skin type [FST] 4) with each channel shown individually as a grayscale image: (a) 2PS, (b) 2PL, (c) RCM, (d) SHG, and (e) composite. All four channels are collected simultaneously in vivo for registered features (fi). Zoom of the dash-bounded box: (f) 2PS, (g) 2PL, (h) RCM, and (i) SHG. (ai) Scale bars, 100 µm. FOV setting; standard. (j) In vivo, cross section detail of summed 2PS and 2PL signal in the basal layer of the epidermis. Intercellular spaces and filamentous structures (white arrows) within the epidermis are smaller than 2 µm diameter white circles. FOV setting; zoom. (k) Representative image of a 0.5 µm fluorescent bead. Scale bar, 15 µm. FOV setting; zoom. (l) Representative intensity plot through bead axis used to calculate system resolution.

Practical clinical use: freehand control and direct skin contact

We committed to freehand imaging to provide the user with flexibility in choosing regions to image, the ability to capture a multitude of images across broad regions or lesions, and to avoid processes such as restraining the participant and securing the objective lens to the skin. We drew inspiration from the ubiquity and intuitive operation of handheld, freehand ultrasound as a model for clinical ease of use. While motion artifacts are typically a substantial risk when freehand imaging small, cellular-scale features, we found simple friction and a high frame rate limited motion artifacts enough to make freehand imaging in living skin feasible (Fig. 1d).

The cross-modal system employed an internal depth-focusing mechanism to enable scanning in the z-direction while keeping the probe tip in static contact with the skin28, a key requirement for freehand imaging. Unlike traditional laser scanning architectures that scan through the tissue depth (z-scanning) by varying an axial gap between the terminal end of the objective and the tissue surface, the cross-modal system shifted the z-position of the focus remotely from within the microscope. Without a gap between the probe tip and the skin, there was no relative motion between the tissue and the optical pathway, eliminating the need for rigid fixation to participants or the use of elaborate stabilization mechanisms (Supplementary Fig. S4). We optimized the frictional interface at the skin surface by using a disposable, terminal silicone cover with high optical clarity combined with glycerin as a tacky, optical coupling fluid. Additionally, a single-use sticker helped center the probe tip over regions of interest (Supplementary Fig. S5).

We leveraged the high resonance of the MEMS mirror to further support stable freehand imaging. Lines were scanned at approximately 1,200 Hz with the overall frame rate determined by the number of lines scanned per frame. We implemented a lower resolution “scan” mode at 4.5 frames/s that was sufficient for real-time navigation to identify features of interest. Once the operator identified a region of interest, a high-resolution image was captured by averaging four successive scanned lines to reduce noise. Each captured image of 1024 × 768 pixels spans a FOV of 400 µm × 300 µm with an acquisition time of approximately 2.56 s (Supplementary Fig. S6).

Cross-modal imaging artifacts and errors were occasionally observed, and their causes were identified. Examples of images affected by intentional motion, bubbles in immersion fluid, incorrect or absent use of the silicone probe cover, too much ambient light, or light-blocking substances on the surface of the skin are shown in Supplementary Fig. S7.

Enhanced imaging performance: pf-RCM mechanism utilizing finite-infinite conjugates

To simplify and improve the resilience of confocal optical alignment, we implemented a pf-RCM mechanism that produces high-resolution in vivo reflectance images in skin. Unlike traditional RCM, pf-RCM rejects out-of-focus light by collecting photons in the de-scanned infinity space of the objective, operating in a finite-infinite conjugate configuration. The de-scanned reflected beam passes through multiple reflections in a double-mirrored chamber, accruing sufficient distance for out-of-focus light to diverge from the collimated path before reaching the collection fiber (Fig. 1g). The reflection chamber kept the overall mechanism compact and within the envelope of the wand housing. Out-of-focus rejection improved with increased distance up to approximately the Rayleigh length, beyond which the beam begins to diverge linearly. Because the Rayleigh length scales by the square of the beam diameter, we de-expanded the collimated beam to reduce the number of reflections within the chamber.

Pf-RCM provides optical alignment advantages over traditional RCM. By effectively eliminating axial sensitivity in alignment, it was possible to couple the reflected signal into the collection fiber using a single adjustment mirror for lateral positioning, simplifying assembly and manufacturing. The de-scanned beam and collection fiber are orders of magnitude larger than a traditional pinhole, reducing the necessary precision in the alignment actuator. Moreover, the mechanism was robust to thermal perturbations which had little effect on the overall pathlength in pf-RCM.

Enhanced imaging performance: coordinated axial and lateral scanning for high-resolution cross-sectional imaging without z-stacking

Cross-modal presents a cross-sectional orientation of skin histology (across skin layers) to allow visualization and identification of skin layers in a familiar image orientation for interpreting dermatologists and pathologists. We scanned lateral lines with the MEMS scanner while continuously shifting the line depth in the z-direction using a previously described approach for remote scanning with an afocal lens arrangement28. Control software coordinated the continuous z-direction movements with the scanning mirror waveform to directly scan a plane through the tissue cross-section. This strategy allowed us to image through the full focal range in the skin within a single frame interval (2.56 s) rather than building a z-stacked sequence of multiple images. An additional advantage is there are no spatial gaps in the depth scan, unlike the coarser sampling in a z-stack.

One of the primary challenges with cross-sectional imaging is axial resolution. While both axial and lateral resolution worsen with lower NA, axial resolution suffers more severely. Assessing histopathology requires an image resolution near 1.0 µm to visualize cells and nuclei. At the excitation wavelength of the cross-modal system (780 nm), this resolution is possible in the lateral direction with the NA near 0.4 but requires NA > 1.0 in the axial direction. Increasing the objective NA above 1.0 becomes increasingly challenging when balancing FOV, off-axis aberrations, probe tip diameter, and signal collection for sensing.

We developed a method to produce high resolution cross-sections with an intermediate NA by scanning in a slanted plane, creating an image from the intersected 3-dimensional PSF. Because the PSF is shaped like an elongated ellipsoid, even at modest angles, the effective PSF dimension was substantially reduced compared to pure axial imaging. To produce the effect, we simultaneously scanned the y-axis of the MEMS mirror in tandem with the z-stage to scan at an oblique angle between the horizontal (xy) and vertical (xz) orientations. A scan angle near 30° from vertical achieved a balance of increased resolution without distorting the appearance of the cross-section. Using this approach, we built three cross-modal microscopes using the same manufacturing procedure and measured resolution by imaging sub-resolution 0.50 µm fluorescent beads (n ≥ 3 beads per system). The mean full-width distance at half the maximum (FWHM) cross-sectional values for the three cross-modal systems were 0.94 µm, 1.03 µm, and 0.77 µm, which is low enough to resolve subcellular features such as nuclei.

While cross-modal’s inherent optical resolution is sufficient to capture subcellular features, image quality also depends on the number of photons collected. The endogenous fluorophores that provide molecular contrast in cross-modal imaging generally produce few photons. We developed a dual FOV scanning method to balance the need to survey wide skin regions with the need to visualize cellular detail. The user can toggle between the standard FOV (400 µm × 300 µm; “standard”), and a zoomed FOV (133 µm × 100 µm; “zoom”) at will during active imaging. Toggling between settings ensures accurate targeting of regions of interest for enhanced detail while freehand imaging. The “zoom” setting increases the dwell time due to the proportionally slower point-scan speed, yielding more photons and improving signal from small cellular features (Fig. 3). The “zoom” setting further increases the pixel density three-fold, improving spatial sampling of the point spread function. In the “zoom” setting, cellular nuclei, thin fibrils, and variations in cytoplasmic pigmentation are readily visible. The maneuverability of the cross-modal wand compensates for the smaller FOV by enabling dynamic freehand imaging to effectively scan larger skin regions.

Fig. 3
figure 3

Tradeoff between resolution and FOV addressed with user-selected FOV in freehand, live skin imaging. In a single session of freehand imaging of living skin, the user captured an image of interest with (a) a FOV optimized for signal-to-noise ratio for improved clarity of small features (“zoom”; 133 µm × 100 µm). Cell nuclei in the epidermis and thin fibrils in the dermis are resolved. Variations in cytoplasmic color from green to yellow indicate variations in melanin concentration. White circles, 1 µm diameter. (b) While freehand imaging the same region of skin (approximate location of dotted rectangle), the user captured a larger FOV (“standard”; 400 µm × 300 µm), which captures a different view and context of histologic features. Scale bars, 100 µm.

Validation of clinical performance in living, normal human skin: measures of dermal elastosis and epidermal pigmentation

To evaluate cross-modal’s in vivo performance, we conducted noninvasive imaging on 122 human participants throughout the development of the technology (Supplementary Table S1). A standardized device configuration and methodology were employed to image 49 normal participants across a diversity of age, sex, and FST and three distinct anatomical sites: the forehead, dorsal forearm, and ventral forearm. We then analyzed the signal composition of the epidermis and dermis according to participant demographics. We identified and masked pixels associated with the dermis with a custom macro (FIJI29) based on the presence of SHG signal and depth of imaging.

We initially observed that sun-exposed skin of participants with FST 1–3 and age > 49 compared to age < 30 was correlated with a diminished SHG signal and an augmented green autofluorescence signal in the dermis. This pattern is consistent with the displacement of collagen by elastin, a hallmark of photodamage. To quantify these changes, SAAID (SHG to autofluorescence aging index of dermis) based on signal intensity was calculated for each imaging location30. Notably, as SAAID values decreased, qualitative visualization of elastosis in cross-modal images became more pronounced (Fig. 4).

Fig. 4
figure 4

Demonstration of imaging across a variety of subjects with normal skin. Images of skin from sun-exposed, dorsal forearm in female participants with FST 2 or 3 of different ages: (a) age 24, (b) age 51, and (c) age 81. White triangles point to examples of green autofluorescence within the dermis. FOV setting; standard. Scale bar, 100 µm. SAAID values of dorsal forearm and ventral forearm dermis for (d, e) FST 1 and 2 (age < 30, n = 3; age > 49, n = 8).; (f, g) FST 3 (age < 30, n = 3; age > 49, n = 11); and (h, i) FST 4 and 5 (age < 30, n = 4; age > 49, n = 3). **, p < 0.01, effect size (ES) > 1.3; NS, not significant, p ≥ 0.05. (e, g, i) Linear trends of SAAID by age. Data points are skin regions of individual participants; ventral forearm by open circles with solid trendlines (R2: 0.77 [FST 1–2], 0.57 [FST 3], < 0.001 [FST 4–5]) and dorsal forearm by solid triangles with dotted trendlines (R2: 0.53 [FST 1–2], 0.47 [FST 3], 0.04 [FST 4–5]).

SAAID values were negatively correlated with FST, consistent with a previous report31. This is likely due to melanin in the epidermis absorbing backscattered SHG signal disproportionately compared to 2P signal, which would diminish SAAID values. A second factor that could influence diminished SAAID values is the presence of melanin in the region that has been segmented as dermis, either caused by small errors in the automatic estimation of the dermal–epidermal junction or the presence of melanin in the dermis (e.g., hair shafts). Because melanin fluoresces at the same wavelength as elastin, it is captured in the metric meant to estimate the presence of fluorescent elastin. To counteract the second factor, we introduced a modest erosion (< 10 µm) of dermal masking to avoid erroneously including epidermal regions in dermal SAAID calculations. We also calculated SAAID exclusively with 2PS signal values, which we reasoned would be less influenced by melanin than 2PL signal values.

Proof-of-concept clinical performance in live human skin: seborrheic keratosis and basal cell carcinoma

We sought to assess the feasibility performance of cross-modal imaging in participants with suspected skin cancer. We captured cross-modal images of two lesions, ultimately diagnosed as seborrheic keratosis (SK) and basal cell carcinoma (BCC), before surgical biopsy (Fig. 5).

Fig. 5
figure 5

Cross-modal imaging of lesional skin with paired histopathology from the same lesion. Seborrheic keratosis (a) H&E and (b) cross-modal. White arrows, elongated rete ridge; asterisks (*), pseudohorn cyst. Basal cell carcinoma (c) H&E and (d) cross-modal. White arrows, basaloid nodule. H&E histopathology is from the same lesion as cross-modal, but not the same plane or location within the lesion. Images (ad) at the same scale, scale bar: 100 µm. (b, d) FOV setting; standard.

SK is a benign skin lesion, for which surgical removal is not medically required, but it is frequently biopsied due to the difficulty in clinically distinguishing SK from malignancy. SK, therefore, represents an important disease state to characterize with noninvasive imaging. In a cross-modal image of SK, we observed elongated, pigmented rete ridges and a putative pseudohorn cyst (Fig. 5b) consistent with associated H&E histopathology (Fig. 5a) and the diagnosis of SK.

BCC is the most common form of skin cancer and requires accurate diagnosis to guide appropriate treatment. Noninvasive identification of BCC could hasten definitive treatment for patients. In a cross-modal image of BCC (Fig. 5d), we observed an intradermal cellular nodule and a very thin epidermis, consistent with associated H&E histopathology (Fig. 5c) and the diagnosis of BCC. These images provide a first look at how cross-modal technology could be used to evaluate potentially malignant lesions noninvasively at the point-of-care.

Discussion

Cross-modal technology could offer important advantages for dermatology care and clinical research. For dermatology, it could provide a noninvasive evaluation for patients, reduce the barriers to inspecting inflammatory or benign-appearing lesions, shorten the time to definitive treatment, enable point-of-care decision-making, help to evaluate patient suitability for and assess response to noninvasive therapies, and assist in estimating malignant margins before surgery.

Cross-modal could facilitate clinical research by improving the recruitment of participants who are reluctant to undergo biopsies or would be excluded by wound healing limitations, and enabling longitudinal investigation of the same lesion, which could reveal the dynamic changes in the skin histology over time. The use of noninvasive skin imaging complements traditional histopathology by enabling histology of functional or sensitive skin, increasing use of comparative histological information from control participants and distributed skin locations within participants beyond what would be ethical to biopsy, and supporting studies involving rare diseases and rare skin types. Noninvasive histology could also enable prohibitive study designs that would be otherwise impractical with biopsies, such as using noninvasive histological criteria to screen participants for study suitability or to personalize study interventions (e.g., dose), which could improve the success rate of interventions in clinical studies. Finally, quantitative, noninvasive assessments of keratin, collagen, and elastin could benefit aesthetic medicine.

Cross-scanning obliquely improves resolution while still providing a cross-sectional view through the layers of the skin. Cross-modal’s measured oblique resolution using the zoom FOV setting (0.91 µm ± 0.06 µm) is lower than the reported axial and lateral resolutions of other commercialized noninvasive imaging technologies32. Oblique cross-scanning could be readily applied to other laser scanning imaging modalities, including fluorescent lifetime imaging microscopy (FLIM), coherent anti-stokes raman scattering (CARS), and traditional confocal microscopy, especially in applications in which the temporal and/or spatial resolution of z-stacks is not sufficient for visualizing biological strata.

One limitation of this cross-modal system is its limited FOV (400 µm × 300 µm), which is a limitation shared with other high-resolution, in vivo imaging systems33, especially compared to the large FOVs possible with traditional histopathology using a brightfield microscope (several millimeters). The ability to see a large region of a skin lesion at once provides important assessment benefits to interpreting physicians, including the presence or absence of lesion symmetry, the prevalence of histologic findings throughout the lesion, and visualization of larger structures (e.g., the hair and follicle complex) that could be difficult to assess without the entire structure visible in one frame. Brightfield microscopes achieve customizable FOVs by providing a selection of interchangeable objective lenses, which is a strategy that could be developed for use in future versions of cross-modal technology to provide FOV customization. We prototyped a strategy to capture a portion of the benefits of a large FOV by stitching consecutively captured tiles into linear mosaics (Supplementary Fig. S8). Stitched mosaics, usually of spatially adjacent, high-resolution tiles, deliver many of the benefits of a large FOV. Tiles captured consecutively in time, in contrast, although not necessarily spatially adjacent, could still provide information about histology feature prevalence and, with user technique, could provide information about lesion symmetry and large features. For example, we found that users could follow a hair shaft through the layers of the skin with freehand imaging (Supplementary Fig. S9).

Additional limitations of the cross-modal system include a maximum imaging depth near 300 µm due to its use of reflectance confocal microscopy and multiphoton microscopy, the requirement to image with minimal ambient light, and inaccessible skin regions due to tortuous three-dimensional geometry (e.g., some regions of the ear). With the expanding landscape of noninvasive imaging technologies, we foresee opportunities to leverage complementary modalities to more comprehensively analyze aspects of skin health. For example, LC-OCT excels at imaging larger fields of view at greater depths whereas cross-modal provides molecular composition with a smaller form factor that provides more flexibility for study settings.

The inherently digital nature of cross-modal’s images presents a unique opportunity for the application of artificial intelligence (AI) to enhance the interpretation and utility of this imaging modality. The high-resolution, cross-sectional images produced by cross-modal, which provide detailed information about cellular morphology and molecular composition, could benefit from AI-driven analysis. Analytical outputs need not be restricted to diagnoses; they could include automated image quality improvements to assist human readers, automated digital “stains” to highlight biological features or regions of interest, or curation of related, biopsy-proven cases for human review. This synergy could lead to noninvasive diagnoses, personalized treatment plans, and a deeper understanding of skin pathologies.

Finally, this technology’s utility could extend beyond dermatological applications. Adaptation of cross-modal for intraoperative or minimally invasive imaging of other organs would be a natural progression of this technology. Furthermore, the SBW optical principles described here are ripe for endoscopic integration. Such an adaptation could provide clinicians with unprecedented views during endoscopic procedures, facilitating better diagnosis and treatment of internal pathologies. These potential applications underscore the versatility and impact of cross-modal imaging, suggesting a future where dynamic, real-time histological imaging becomes accessible and routine.

Methods

System overview: hub

The system hub contains the laser light source, touchscreen tablet, optical sensors, system software, and electronics. The hub chassis mechanically anchors the major optical and electrical components in a compact and robust configuration weighing less than 10 kg. The hub has a cradle for securing the wand and a spool for wrapping the cable for safe transport. The touchscreen tablet is mounted on the front of the chassis for visibility and access to the software controls. The chassis contains a metal back plate that serves as a mount and heat sink for the air-cooled ultrafast laser. The ultrafast laser delivers 780 nm light to the wand via a 2 m optical fiber with pulse durations < 80 fs and a repetition rate of 100 MHz. A pair of 2 m long signal fibers return the RCM and MPM signals to the optical sensors within the hub. The MPM fiber couples to a collection box containing lenses, dichroic mirrors, and spectral filters to separate and route the SHG, 2PS, and 2PL signals to three separate SiPM sensors while the RCM fiber couples directly to a fourth SiPM sensor.

System overview: wand and cable

The wand is tethered to the hub by an excitation fiber from the laser, an electronics cable that provides the scanning waveforms for the MEMS mirror and Z stage, and a pair of 2 m long signal fibers that return collected photons back to the optical sensors within the hub. The optical pathway of the wand was custom-designed to meet the requirements for freehand imaging of the skin in a miniature form factor. Similarly, the optomechanical mounts and alignment elements were custom-designed to minimize size. Coatings of the optical lenses and elements were selected to maximize the transmission of the 780 nm light for excitation and the visible spectrum of the MPM signals for collection.

The optical design centers around a custom-designed infinity-corrected objective using commercially available aspheric lenses. The MEMS scanner is positioned at the back focal plane of the objective to produce a near telecentric scan and minimize distortion when scanning through the tissue depth. Before light reaches the MEMS mirror, an afocal telescope arrangement with a moveable lens enables remote focusing as described in a previous application28. A miniature linear actuator translates the moveable lens along the optical axis in synchronization with the MEMS scanning waveform to yield the cross-sectional scan pattern.

The pf-RCM mechanism receives back reflected 780 nm light from the skin via a beam splitter in the optical pathway. Immediately following the beam splitter, a de-expanding telescope reduces the beam waist of the collimated confocal beam. A custom-designed alignment mirror actuator reflects the de-expanded light into the reflection chamber and then into the collection fiber. We used a corner cube reflector within the reflection chamber to negate sensitivity to mechanical misalignment due to manufacturing and assembly tolerances.

Image construction from signals

The handheld wand scans the focused laser through the tissue to construct an image from the pointwise intensities of the captured signal photons. The x-axis of the MEMS mirror employs a bidirectional sinusoidal scan pattern at 600 Hz yielding an effective line scan frequency of 1200 Hz. The lateral scan range is set at 400 µm and verified using a calibrated Ronchi grid. The z-axis actuator scans in a symmetric triangle waveform and produces an overall focal shift of 300 µm in the tissue. We verify the range of the z-scan by imaging a 150 µm thick glass cover slip. The MEMS mirror y-axis is driven with a synchronized triangle wave generated from the waveform used on the z-axis scan. The range of the y-scan is approximately 200 µm and is verified using a calibrated Ronchi grid.

During a capture, we oversample the SiPMs during the line scan and then re-bin and average the measurements into the final pixel locations to correct for the sinusoidal variation in velocity. The sinusoidal nature of the scan causes the pixels towards the edges of the image to contain more samples and less overall noise. The final image of 400 × 300 µm is mapped to 1024 × 768 pixels (0.39 µm/pixel). This process yields four spatially synchronized monochromatic images, one for each SiPM. To further reduce noise in the image, we apply a median and then a Gaussian blurring kernel filter to each image. We generate the pseudocolor composite image by assigning each channel fully or in part to the red, green, or blue channel of an RGB image file. We assign the SHG signal to blue, the 2PS channel to green, and the 2PL channel to red. We further add one-third of the RCM signal to each of the red, green, and blue channels as a grayscale undertone. The final data output for one capture is a set of 5 images: a grayscale image for each of the RCM, SHG, 2PS, and 2PL signals, and a pseudocolor composite image. All five images are exported in DICOM (Digital Imaging and Communications in Medicine) format along with a duplicate of the pseudocolor composite in PNG format and a duplicate four-channel stack of the monochromatic images in TIFF format.

Optical characterization

To quantify optical performance, we measured the cross-modal’s ability to generate signals, collect signals, and render images with high resolution by imaging 0.5 µm sub-resolution beads. We measured the FWHM of bead intensity profiles. We established a maximum FWHM limit of 1.2 µm for any completed device to ensure we could resolve micron-scale features, such as cellular nuclei, and generate efficient multiphoton signals.

This test uses a commercial test slide containing fluorescently stained microspheres (F36909, ThermoFisher Scientific, Waltham, MA, US). The beads are stained with fluorophores that are excitable only in the visible spectrum, meaning they can only be visualized if the cross-modal system successfully generates multiphoton excitation. Confirming the presence of 2PL signals verifies that the laser is functioning properly, all optical fibers are intact, the imaging optics within the wand are correctly arranged and aligned, and the laser is tightly focused.

To ensure optical performance was consistent across devices, we built and tested 3 different cross-modal systems using the same assembly procedure. We then measured the FWHM for at least 3 beads per system and confirmed the measurements were beneath the 1.2 µm limit.

Human participants

In vivo imaging was based in Mountain View, CA, human participants were enrolled according to the principles of the Declaration of Helsinki and Good Clinical Practice guidelines, and the protocol was approved by a third-party institutional review board (Salus IRB, Austin, TX, US). All participants or their guardians provided written informed consent before undergoing study-specific tests or evaluations. The study was registered on clinicaltrials.gov (NCT05410964).

The primary eligibility criteria were that the participant be age 2–90 years old, not have any general health condition or systemic disease that may represent, in the opinion of the Principal Investigator, a potential increased risk associated with device use, not be currently infected with a communicable skin infection (e.g., shingles or methicillin-resistant S. aureus), which does not include local and minimally pathogenic or non-pathogenic infections distant from the imaging location(s) (e.g., warts, acne), not have any known allergies to any materials used in the preparation of skin and/or device use, and not have a temporary or permanent electrical implanted medical device.

Participants (N = 49 healthy: 20 female, 29 male; N = 2 suspected disease: 2 female) were enrolled from December 2021 to July 2022 as a subset of the total N = 122 participants of the study using a standardized device configuration and methodology. Detailed participant demographics are reported in Supplementary Table S1. FST was estimated for participants by a questionnaire administered by non-clinician study staff.

No identifying information of study participants has been included in this publication. Photographic identifying information of a study author has been included with permission (Supplementary Video S1).

In vivo cross-modal imaging

Four-channel cross-modal imaging was performed in at least three anatomical locations per healthy participant: central forehead, dorsal forearm, and ventral forearm or, for participants with suspected disease, at the site of suspected disease. Images with known artifacts (e.g., as in Supplementary Fig. S7) were excluded. Glycerin was used as an immersion fluid for all participants. Healthy participants: We recorded a minimum of 10 images per location in each participant, yielding a total of 1,951images. The 10 or more images per location provided cross-sectional data across an effective distance of at least 4 mm.

Biopsy and histopathology

Participants with suspected or known skin disease who were undergoing biopsy as routine standard-of-care were recruited from a community dermatology clinic (Golden State Dermatology; Walnut Creek, CA, Mountain View, CA). Cross-modal imaging was performed before biopsy. Duplicate sections (“recuts”) of H&E formalin-fixed, paraffin-embedded skin were de-identified, mounted on glass slides, and provided to the study team. The slides were scanned and digitized (Aperio, Leica Biosystems, Histowiz, NY, US). De-identified pathology reports were also provided.

Computational segmentation of skin layers and metric quantification

The living epidermis and dermis were computationally estimated using custom macros in FIJI29, the top 50 µm of the images were cropped, as this region was dominated by signal from debris on the surface of the skin (not skin) or stratum corneum (non-living skin). Additionally, for all images, regions of the image with low signal below a pre-specified level were excluded from analysis since ratios of signals in these regions are likely to be dominated by noise. The custom macros enriched for targeted skin regions, but did not usually capture the entire intended skin regions or fully exclude adjacent skin regions. Masked regions of interest used to calculate metrics were saved with the images. Finally, instead of calculating SAAID for each image and finding the mean or median of the images for a particular participant-anatomical region, we normalized SAAID across the entire masked area of all the images of a participant-anatomical region.

Dermis estimation: Dermal location was estimated by detecting regions where the difference between SHG signal (correlated with collagen presence) and 2PL signal (usually associated with melanin, but also present in regions of general saturation) was above a pre-determined threshold. Additionally, the bottom 105 µm regions of the images were assumed to be dermis.

Epidermis estimation: Epidermis was identified by detecting regions in the top 150 µm that were absent of SHG signal (associated with collagen and the dermis) above a pre-specified threshold. Because some of the stratum corneum is excluded from the epidermal mask, the epidermis estimation is enriched for regions of living epidermis.

SAAID was calculated as previously reported30 as the difference of intensity of SHG and 2PA divided by the intensity sum of SHG and 2PA. Because our cross-modal device has two 2PA channels (2PS and 2PL), we needed to choose how to represent the 2PA signal in the SAAID formula. We calculated SAAID for the metrics shown in Fig. 4 using the intensity of the 2PS signal because it is less sensitive to epidermal pigmentation than SAAID calculated with 2PL. SAAID was calculated only within the region of the image estimated to be in the dermis.

Statistics

Central metrics of groups were represented with mean values. All error bars and in-text variability values (e.g., ± value) represent the standard error of the mean. Trends were represented with scatter plots and were fit with linear regression trend lines, in which each plotted data point represents a metric from a participant. For each trend line, the coefficient of determination (R2) was reported to estimate the strength of the relationship between the independent and dependent variables.

Pairwise comparisons between the two age groups (Fig. 4d,f,h) were performed with two-tailed Student’s t-tests. Hedges’ g was used to estimate effect size (ES) of significantly different group values.