Introduction

Railway stations, in China, have proposed an Intelligent Railway Passenger Station (IRPS)1 blueprint, which explicitly states that station operation management should be implemented based on digital twin technology. A digital twin station replicates the physical space of the station in cyberspace. To achieve a digital twin station, the primary step involves constructing a static exterior model of the railway station and its internal entities. For dynamic targets, the process begins with establishing a static model of the target in a particular state. This static exterior model serves as the cornerstone of the station’s digital twin, transforming the physical entities of various production elements2 in the real-world station into three-dimensional models recognizable by computers, based on their visual appearance and static characteristics. These models vividly express the geometric dimensions, textures, and initial spatial positions of the production elements. This process requires the creation of a three-dimensional model in the cyberspace that is consistent with the architectural structure and equipment layout of the real world.

The primary technique for constructing three-dimensional models of passenger stations involves manual methods include manual3. However, for large-scale stations, such as superclass stations, the area of passenger terminal buildings is substantial, with some exceeding 100,000 square meters. The design of these stations often follows aesthetic principles, typically adopting symmetrical forms both laterally and longitudinally. However, key facilities and equipment such as gate indicator lights and static directional signs are not completely symmetrical. During the data organization process, the symmetry of the structural decorations and the asymmetry of equipment can easily lead to confusion about the positions of data from different sides, resulting in inaccurate modeling. The manual data collection process on-site requires a significant amount of labor and time, leading to an immense workload.

To address the issues, this paper, inspired by paper4, reconstructs and introduces a novel modeling method for static models of passenger station elements, termed the Mobile vehicle-Sparse Sampling-Colmap-Resolution adjustment-Gaussian Splatting (MSCRAGS) method. This method utilizes mobile vehicle for Sparse Sampling extraction frame stream and employs a multiview stereo pipeline, Gaussian splatting, and resolution adjustment. It automates the collection of passenger station production element data and generates high-fidelity static models of these elements. Simultaneously facilitated the synchronized collection and alignment of station exterior color and geometric dimensions, reducing the labor intensity of manual data collection and minimizing the workload for subsequent data alignment, thereby enhancing the efficiency of initial data collection for station static exterior modeling. The contributions of this paper are as follows:

  • It analyzes the production elements that need to be modeled at passenger stations and proposes a flexible control concept for static models.

  • It designs an autonomous mobile vehicle hardware that replaces manual camera-based data collection.

  • It designs sparse sampling with multi-angle data capture for the initial colors and appearances of production elements from multiple heights and angles.

  • It proposes a multi-resolution control and rendering method for static models, enabling the three-dimensional reconstruction of production elements at varying levels of detail.

The rest of the paper is structured as follows: Section “Related work” describes related work in railway reconstruction method. In Section “Flexible Control of Digital Twin Models for Station”, we introduce the components of station production elements and the requirements for its flexible control. The proposed MSCRAGS for reconstruction static exterior models is described in detail in Section “Methods”. The experimental procedures and results are described in Section “Experiment”. Finally, conclusions are drawn in summary in Section “Conclusion”.

Related work

Railway station reconstruction technology

Railway reconstruction provides convenience for information management, with Building Information Modeling (BIM) technology being widely adopted for new station construction, and photogrammetry being commonly used for railway station outdoor large-scale scenes.

BIM technology in railway station

Liu3 created an integrated three-dimensional informational management platform for entire railway bureaus combining foundational Geographic Information Systems (GIS) based geospatial data with BIM derived models of fixed railway operational equipment and facilities. Wang et al.5 developed a BIM-based Virtual Reality immersive training scenario for underground high-speed railway station evacuations during construction, which included processing BIM materials, scene decomposition and transmission, longitudinal and lateral expansion, normal direction unification, and scene baking. Lou et al.6 developed a visualization and control platform for the Nanning North Station building project, integrating BIM data, drone aerial photogrammetry, and topographical data. This platform aligns and coordinates these datasets based on their relative positions, thereby creating an integrated BIM platform. Hao et al.7 employed Revit architectural modeling software and the Unity3D engine to construct a large-scale railway route map for exploration, enabling the realization of a railway train driving training simulator. Yang et al.8 employed BIM and optimization graph convolution algorithms to construct intelligent transportation node models, novel enhancing transportation system performance and information processing efficiency. However, modeling techniques based on BIM and GIS predominantly employ manual modeling methods, which involve a substantial workload.

Reconstruction of real-world railway station outdoor

Zhu et al.9,10 proposed a methodology for the 3D reconstruction of real-world railway scenes that partitions the spatial scale into different resolution elements for 3D modeling across all life cycle stages, from planning and design to construction and maintenance, which facilitates the definition of building syntax, parameters, models, and textures. Liu et al.11 developed a real-world railway communication construction system equipped with functional modules for real-time display, maintenance management, emergency inquiry, and intelligent analysis. This system integrates visualization of real scenes, fusion of multi-source data collection, and specialized communication maintenance functions. Wang et al.12 utilized location search and drone attitude analysis to project and dynamically fuse original drone imagery, facilitating rapid image matching and forming a high-precision real-world 3D technology. Fan et al.4 and his team employed drone oblique photography to capture the current status of land use around high-speed railway stations and analyzed the land development potential surrounding the stations based on a real-world 3D platform. Su et al.13 conducted aerial orthophotography of the Su-Hu Intercity Railway using drones, resulting in the creation of a 190 \(\text{km}^2\) real-world 3D model. Moreover, the utilization of real-scene modeling techniques for large outdoor railway scenes, such as those involving unmanned aerial vehicle, offers new perspectives for indoor station modeling technology.

High-fidelity reconstruction

Beyond real-world scenes, three-dimensional scene representation techniques include point clouds, voxels, and triangular meshes. DeepPano14 utilizes deep neural networks to extract features from panoramic images and reconstruct them into 3D point cloud representations. Tan et al.15 proposed a novel video-based deep differentiation segmentation neural network for foreign object detection in real-world urban rail stations, effectively capturing the subtle shape features of the car door and platform seams. AtlasNet16 a neural network-based method, generates 3D meshes from point cloud data using an encoder-decoder architecture that incorporates both local and global feature maps to produce detailed, high-quality 3D meshes. 3DShapeNets17 transforms 3D voxels into a binary probability distribution using a convolutional deep belief network. Mildenhall introduced the concept of Neural Radiance Field (NeRF)18 , which employs implicit neural scene representations and volumetric rendering to achieve high-quality view synthesis. Subsequent developments such as NeRF++19 , Mip20 , Mip36021 , 3d gaussian splatting(3dgs)22 and tools23,24,25 have further advanced NeRF technology. The emergence of these methods has provided new technical means for digital twin passenger stations, enabling 1:1 mirroring of physical world stations.

Flexible control of digital twin models for station

The railway passenger stations in the physical world manages passenger transport operations and organizes the rapid and safe boarding and alighting of passengers2. The station’s management personnel need to oversee the entire process of serving passengers within the station, requiring not only broad overall control of various areas but also detailed monitoring of key operational areas. This has led to different levels of granularity in monitoring requirements for different objects at the station. Digital twin stations need to adapt to the requirements of station staff for flexible control. These objects are referred to as the production elements of the station, which encompass six categories. A passenger station comprises six production elements: personnel, trains, equipment, environment, station buildings, and business, each containing a diverse range of subcategories. Except for business, which do not have a tangible entity, all other production elements are represented by tangible entities.

Composition of the static exterior model

The six production elements can be categorized based on their intended use into daily operations and emergency management. Production elements specific to emergency management are not required for daily operations; however, in emergencies, these elements are utilized in addition to those needed for regular daily operations.

Daily operational elements provide passengers with standard transportation services, detailed as follows:

  • Personnel. This category includes passengers and their luggage, as well as staff members.

  • Equipment. This is categorized into ticketing equipment, security inspection devices, electromechanical equipment, passenger service devices, and facilities.

    • Ticketing equipment. This category encompasses manual real-name verification ticket machines, real-name verification gates, columnar ticket machines, gate-style automatic ticket machines, seat-number dispensers, receipt printers, ticket replenishment machine, integrated ticket sale/return machines, and police certification devices.

    • Security inspection devices. This category encompasses security gates and scanners.

    • Electromechanical equipment. This category includes escalators, elevators, air conditioning, heating radiator , ventilation, water supply systems, and sewage extraction devices.

    • Passenger service devices. This category comprises entrance screens, advertising screens, ticket checking screens, platform displays, departure and arrival announcement screens, intelligent inquiry machines, triangular screens, exit screens, check-in screens, remaining ticket screens, ticket window screens, cameras, lighting, intrusion detection devices at platform, local area broadcasting devices, loudspeakers, Bluetooth tags, wheelchair, and handheld terminals.

    • Facilities. This category encompass blind walk way signs, static signs, and directional signage.

  • Trains. This category includes originating trains, terminating trains, passing trains, and shuttle trains.

  • Environment. This refers to various sensors that monitor environmental conditions, including smoke detectors, air quality sensors, temperature sensors, humidity sensors, noise sensors, and brightness sensors.

  • Station buildings. These are the architectural components of the station, including entrance gates, waiting rooms, ticket checking areas, corridors, platforms, tracks, exit gates, subway connection areas, comprehensive service centers, nursing rooms, business and leisure areas, children’s play areas, cultural reading zones, and comprehensive service areas (military waiting areas, medical aid rooms), safety instruction signs, entrance guidance signs, floor schematic guidance signs, comprehensive service desks, restrooms, washrooms, and drinking fountains rooms.

Emergency management operational elements provide the necessary resources for staff during emergency responses, detailed as follows:

  • Personnel. This includes railway policemen, doctors, firefighters, and other external rescue forces.

  • Equipment. Comprises fire sprinkler systems, fire hoses, fire hydrants, rolling shutter doors, water pumps, portable fire extinguishers, box-type fire, fire alarms, disinfectant sprayers, barrier tapes, protective gear, and emergency megaphones.

  • Trains. Rescue trains.

  • Station buildings. Fire engine access.

Static exterior model flexible control

Flexible Control26 is a key concept in industrial manufacturing, referring to the ability to adapt control strategies based on changing production demands. This concept offers innovative approaches to managing passenger stations by addressing various production elements across different levels of management. Flexible control of passenger station static models involves simplifying (lightweighting) and refining (detailing) the three-dimensional models according to specific management needs. In passenger station management, areas with high business density require detailed control, whereas less busy areas do not. This flexible control approach is designed to meet varying application scenarios and performance requirements.

Model fidelity adjustment, involving both the lightweighting of high-precision models and the refinement of low-precision models, is achieved by altering the quantity of points, lines, and surfaces as well as the resolution of textures. This process creates a hierarchy of details with varying geometric face counts and texture resolutions, enabling variations in the model’s level of detail. In accordance with the surveying and mapping standards27,28 the flexible control of the static model for passenger stations is divided into three levels: lightweight, standard, and fine. Flexible Control on station buildings should be implemented from four aspects: Architectural Complexity (AC), Ground Complexity (GC), Elevation Accuracy (EA), and Texture Detail (TD). The specifics are presented in Table 1, the categories I, II, and III in the table follow the guidelines outlined in standards27,28.

Table 1 The classification of static exterior model.

Method

We propose the Mobile vehicle-Sparse Sampling-Colmap-Resolution adjustment -Gaussian Splatting (MSCRAGS) method for station reconstruction. Modeling based on mobile vehicle integrates Sparse sampling theory to collect appearance color data of passenger transportation production elements from multiple heights and angles. Subsequently, this foundational data is fed into a motion structure multi-view stereo pipeline to obtain sparse point cloud information of the geometric dimensions, color appearance, and other aspects of passenger transportation production elements. The foundational data is adjusted in resolution to accommodate the requirements of flexible control. Gaussian splatting is employed to render the sparse point cloud information, resulting in high-fidelity reconstruction of production elements. The overall structure is illustrated in Fig. 1.

Fig. 1
figure 1

MSCRAGS architecture.

The overall of mobile vehicle hardware design

The overall of hardware architecture of the mobile vehicle integrates various sensors and power systems, consisting of LiDAR, motion cameras, industrial computers, external mobile power supplies, and the vehicle base. The overall structure is illustrated in Fig. 2a. The final assembled mobile vehicle is illustrated in Fig. 2b and Fig. 2c.

Fig. 2
figure 2

Mobile vehicle physical.

The components and their functions are as follows:

  • Lidar. By emitting laser pulses and calculating the time required for the signal to return, a laser radar can precisely measure the geometric dimensions of production elements, obtaining high-precision 3D point cloud data for map construction.

  • Motion cameras. The motion cameras are responsible for capturing continuous sequences of images, collecting texture and color information of production elements for the virtual static model. Supporting high frame rate recording with image stabilization capabilities, the motion camera model employed in this paper is a commercially available action camera capable of recording video streams up to a resolution of 5312x2988, facilitating the capture of detailed information.

  • Mobile platform. The mobile platform is equipped with a omnidirectional intelligent wired-controlled chassis, featuring a four-wheel drive and steering system that allows for lateral movement and on-the-spot rotation, enhancing maneuverability and flexibility. It is equipped with the Robot Operating System29, which is particularly suited for performing stable rotations around the production elements of the virtual static model.

  • Power. For long-duration tasks or remote operations, a reliable power supply system is essential for ensuring the stable functioning of the vehicle, powering devices such as industrial computers and monitors through portable power sources.

Based mobile vehicle-sparse sampling for collection point cloud

Utilizing mobile vehicle hardware, initial data collection involves the continuous movement of vehicles combined with Sparse sampling approach to gather exterior color data of production elements from multiple heights and angles. Subsequently, this foundational data is input into a multi-view stereoscopic pipeline for motion structures, yielding sparse point cloud information on the geometric dimensions and color appearance of the passenger service elements.

Mobile vehicle collect raw data

Data collection for constructing a static exterior model of a tangible production element in passenger station begins by pinpointing the intended location of the model on a mobile vehicle map and marking it. Subsequently, routes circumventing this point are established, along with corresponding speeds for the circumnavigation, followed by initiating a data collection run with the mobile vehicle. The motion camera’s height is then adjusted for multiple collections. Upon completion, a series of appearance video streams containing production element data for the passenger station are obtained.

Sparse sampling extraction frame stream

Geometric and texture information of production elements is collected via mobile vehicles. When collecting geometric information, the entire length, width, and height dimensions of the production element must be fully covered. For appearance information, the color data of the production element must be obtained from all angles: front, back, left, right, top, and bottom. Thus, the collection of geometric dimensions and appearance color information for station production elements is divided into horizontal and vertical perspectives.

In the horizontal perspective, the action camera is kept at a fixed height, and the mobile vehicle circumnavigates the station production element along a predetermined route. For the vertical perspective, the action camera adjusts to a fixed height, and the mobile vehicle again circumnavigates to collect data. For geometric dimension collection, one complete circumnavigation at a horizontal dimension is sufficient. For vertical dimension data collection, to ensure full coverage, the action camera must adjust its height multiple times.

The number of vertical dimension collections (HC) required for station production elements is detailed in Eq. (1). Here, F is the field of view angle of the action camera, H is the height of the station production element, and d is the distance from the action camera to the production element. The number of vertical dimension collections HC is obtained by rounding up.

$$\begin{aligned} HC \ge \left\lceil \frac{H}{2d \tan \left( \frac{F}{2}\right) } \right \rceil \end{aligned}$$
(1)

For texture information collection, the video stream captured by the action camera is used. If all image information from the video stream is utilized, the sequential frames will have significant overlapping areas, leading to information redundancy and computational waste during subsequent data processing. Therefore, image sequences from the video stream are extracted through frame sampling. In the task of reconstructing production factors, a single collection of the minimum spatial unit of production factors cannot fully restore accurate texture and geometric information. Therefore, at least two collections from different angles are required to achieve precise reconstruction. The constraints for horizontal and vertical calculations are given by Eq. (2).

$$\begin{aligned} \left\{ \begin{aligned} c_s&= \left( 1 - \frac{\frac{n}{v}}{2h \tan \left( \frac{F}{2}\right) }\right) \times 100\%, \\ c_h&= 1 - \frac{H}{W \times HC} \end{aligned} \right. \end{aligned}$$
(2)

The frame rate of the camera per minute is denoted as n, the speed of the moving vehicle as v, the distance traveled by the moving vehicle as s, the vertical height from the moving camera to the ground as h, the distance moved per frame as D, the ground width per frame as W, the overlap ratio between adjacent horizontal frames as \(c_s\), and the overlap ratio between vertical frames as \(c_h\). The corresponding number of video frames can be extracted according to different overlap ratios.

Colmap for structure-from-motion

Colmap processes the decimated video stream into an image sequence with overlapping regions, utilizing a multi-view stereo pipeline (Colmap)30,31 to construct an initial point cloud of the station’s production elements. Colmap is a pipeline equipped with Structure-from-Motion (SfM), designed for reconstructing ordered and unordered image collections. The data preprocessing via Colmap begins with feature extraction from the input images, followed by feature matching and geometric verification. This initial phase results in validated geometric image pairs, internal correspondence points, and geometric relationships. Subsequently, this information undergoes incremental reconstruction, including selection of initial image pairs, image registration, triangulation, and bundle adjustment, ultimately producing estimates of camera poses, registered image data, and scene point cloud information. Following this, LLFF32 integrates image information, camera poses, and parameters. The above procedures provide initial data for subsequent flexible control at different resolutions.

Resolution adjustment for flexible control

To address the need for precision control in static exterior models of production elements, we enhances general precision models through upsampling to achieve high-resolution models with increased detail. Conversely, downsampling is utilized to create lightweight models. The upsampling process is completed prior to using Colmap, whereas the downsampling process occurs afterward. Upsampling can enrich the original image with additional feature points, which is beneficial for the initial point cloud modeling in SfM; however, downsampling results in a significant loss of feature points, which adversely affects the initial point cloud modeling in SfM.

Resolution upsampling method

Resolution Upsampling based on ESRGAN33: Common approaches to increase image resolution from low to high include bilinear interpolation, bicubic interpolation, and super-resolution reconstruction. This paper employs Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN)33 to upscale from a basic model to a refined model. ESRGAN excels in sharpness and edge detail while eliminating artifacts.

Resolution downsampling method

Resolution Downsampling based on Gaussian filtering: Initially, the input image is subjected to Gaussian filtering, which eliminates high-frequency components (detailed parts of the image) while preserving low-frequency components (smooth areas of the image). The Gaussian filter replaces the pixel value at a point with the weighted average of the pixels in its neighborhood, where the weights decrease monotonically with distance from the center point. The purpose of Gaussian filtering is to remove high-frequency noise, preparing the image for downsampling. After filtering, the image undergoes downsampling by discarding even rows and columns, thereby reducing each dimension (width and height) by half. This process is repeated using the cv2.pyrDown function from the OPENCV library to construct progressively smaller layers of the image, each half the size of the preceding one.

High-fidelity rendering

The aforementioned method has successfully generated sparse point clouds from image sequences with overlapping regions using SfM process. However, the sparsity of these point clouds is too great to directly represent the appearance and color of station production elements accurately, leading to significant discrepancies with reality. To address this, high-fidelity rendering techniques such as Neural Radiance Fields (NeRF) and its derivatives are commonly employed. In this study, we utilize the real-time neural rendering approach, 3D Gaussian Splatting, to reconstruct high-fidelity representations of station production elements.

Introduced by 3D Gaussian Splatting22 has had a profound impact due to its efficient generation speed and high-fidelity output. The process of 3D Gaussian Splatting begins with the initialization of the SfM point clouds into multiple 3D Gaussian spheres. Subsequently, using the camera’s extrinsics parameter, points are projected onto the image plane through the splatting algorithm. This is followed by differentiable rasterization to render the images. The results are then compared to real images to calculate loss values, which are used for backpropagation.

Experiment

Evaluation metrics

The objective of constructing a static model for passenger stations is to simultaneously reduce manual labor, shorten the modeling time, and meet the requirements for establishing high-fidelity models. To alleviate manual labor and decrease the duration of model construction, this study considers automation as the primary method, positing that less time consumed signifies greater efficiency.

  • Time consumption calculations. The time aspects are divided into several parts: duration of mobile vehicle data collection, duration of precision control generation, duration of initial point cloud construction through SfM, and duration of rendering. Different duration metrics are selected for various experimental conditions.

    • The duration of the initial point cloud construction through SfM comprises Extract Time (1), Feature Extraction Time (2), Feature Matching (3), and Initial Point Cloud Reconstruction Time (4).

    • The duration for rendering that is composed of Training Time (5) and Generating Time (6).

  • High fidelity indices. This paper utilizes three widely recognized benchmarks for achieving high fidelity in models: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). Higher values of PSNR and SSIM indicate better quality, while a lower LPIPS score signifies improved perceptual likeness.

Experimental setup

The proposed model was developed using the PyTorch 11.8 framework on a Linux Mint 21.2 system, accelerated by four NVIDIA GeForce RTX 3080Ti GPUs, each with 12GB of memory. The physical site for the experiment is the Qinghe Station of the Beijing-Zhangjiakou High-Speed Railway. The modeled elements of passenger service production included Daily operational elements (safety instruction signs, entrance guidance signs, floor schematic guidance signs, real-name verification gates) and Emergency management elements (fire hydrants, portable fire extinguishers, fire alarms, box-type fire extinguishers).

Ablation studies

Sparse sampling analysis

The initial image sequences collected by mobile vehicle form the basis for constructing the initial point cloud in SfM and are fundamental to rendering. According to the sparse sampling approach, multiple video streams were subsampled, resulting in multiple frame sequences with horizontal and vertical dimensions sampled at more than twice the original rate. This approach was used to verify the effects of different subsampling results on time efficiency and fidelity. This paper focuses on a box-type fire extinguisher as an object of study in emergency management production elements, analyzing the sampling rate through multiple experiments. These experiments involved seven collection instances at varying heights, with an initial video stream totaling 32 seconds and comprising 1007 frames. After repeated trials, the results were recorded as averages. Due to the similarity in mobile vehicle data collection, this section focuses on the duration of the initial point cloud construction through SfM and the rendering duration in the time consumption calculations.

Table 2 SfM time of different initial sequence sampling quantities.

The duration for various stages of SfM are listed in Table 2, in minutes. The results indicate that as the sampling frame interval increases from 2 to 30 frames, the time required for frame extraction decreases, the number of frames obtained reduces, and the processing time for SfM decreases with an increasing sampling frame interval.

The duration for rendering based on different initial image sequence sampling counts, are shown in Table 3, also in minutes. The sampling frame interval ranged from 2 frames to 30 frames. As a comprehensive indicator for evaluating processing efficiency, the total processing time decreased from 24.533 minutes to 16.000 minutes, demonstrating a general trend of decreasing processing duration with increasing frame interval.

Table 3 Total rendering processing time of different initial sequence sampling quantities.

The high fidelity for different initial image sequence sampling counts is presented in Table 4.These findings suggest that reducing the sampling interval can increase the quantity of preliminary point clouds in SfM, potentially leading to higher-quality 3D reconstruction results. While smaller sampling intervals have an advantage in terms of preliminary point cloud quantity, there exists a turning point where, beyond this interval, PSNR and SSIM significantly decrease.

Table 4 High-fidelity comparison of different initial sequence sampling quantities.

As shown in Fig. 3, the results depict a comparison between the ground truth (GT) and the final reconstruction outcomes at various sampling frame intervals. At a 2-frame sampling interval, the text information is clearly discernible, both in Chinese and English sections (“FIRE EXTINGUISHER BOX”), easily recognizable. However, as the sampling frame interval increases to 30, the rendered results indicate a significant loss of text information, posing a substantial challenge to text recognition. Details become less distinct, reflections of light are entirely absent, and there is noticeable presence of “dirty floating objects,” which adversely affect the final outcome, indicating lower fidelity in the frame-rate-based modeling.

Fig. 3
figure 3

Comparison of rendering results with different initial image sequence sampling quantities.

Flexible control analysis

To assess the usability of flexible control, this experiment adjusted the resolution of safety instruction signage for daily operational production elements. The initial video stream had a total duration of 107 seconds, comprising 3208 frames. The initial sampling frame is designated as the baseline. The motion camera used for initial capture had a pixel resolution of 5312*2988, which is already considered fine-grained. For experimentation purposes, the initial images were downsampled by factors of 4, 8, and 16, followed by upsampling the results of the 8d operation to produce 8u for validation. As illustrated in Fig. 1, downsampling occurs after SfM, while upsampling occurs before SfM, thus maintaining consistency with the baseline values. After multiple trials, the experiment results were averaged, and processing durations for various stages of SfM were recorded, as shown in Table 5, with units in minutes. Processing durations for rendering at different resolutions are presented in Table 6, also in minutes. Table 7 provides a comparison of high-fidelity adjustments at different resolutions.

Table 5 SfM processing time of different resolution.
Table 6 Total rendering processing time of different resolution.
Table 7 High-fidelity comparison of different resolution.

Since downsampling occurs after Structure-from-Motion (SfM), the duration of SfM and the initial number of point clouds remain consistent with the baseline for 4 times (4d), 8 times (8d), and 16 times (16d) downsampling, while rendering time decreases with increasing downsampling factors. SfM processing time after 8times upsampling is less than that of the baseline, with a greater initial point cloud number, and rendering time slightly below the baseline. In terms of fidelity, 8 times (8d) downsampling achieves the optimal values for PSNR, SSIM, LPIPS. However, at a 16 times (16d) downsampling, the image size is reduced to 16*7, a resolution at which the LPIPS value can no longer be calculated.

Comparisons

To more comprehensively evaluate the performance of the proposed model, it was compared with state-of-the-art rendering methods under identical environmental configurations. Specifically, the model was benchmarked against NeRF18, NeRF++19 , Mip20 , Mip36021 , and 3dgs22. The comparison metrics included duration of rendering, PSNR, SSIM, and LPIPS. Experiments were conducted on production elements of passenger stations in Daily operation elements (including safety instruction signs [dth], entrance guidance signs [ftq], floor schematic guide signs [jzk], and real-name verification gates [zj]), and Emergency management operational elements (fire hydrants [xfs], box-type fire extinguishers [mhq], fire alarms [hz], and portable fire extinguishers [stm]).

Daily operation elements

The model focused on the daily operational production factors of passenger stations, which include safety instruction signs (dth), entrance guide signs (ftq), floor schematic guide signs (jzk), and real-name verification self-service gates (zj). Specific results are presented in Table 8, with the best results in each category highlighted in bold. Additionally, a qualitative analysis of the static daily operation model was conducted, with rendering results from different methods shown in Fig. 4.

Table 8 Comparison of high fidelity of daily operation by different methods.
Fig. 4
figure 4

Comparison of rendering results with daily operation by different methods.

The bold entries in Table 8 represent the best results. For daily operational production factor modeling, the time required using the method described in this paper is significantly reduced compared to NeRF18, Mip20, Mip36021, and NeRF++19. Although the total duration is relatively increased compared to 3dgs22. The LPIPS of dth decreased by 72.00% compared to the NeRF method, the PSNR of jzk increased by 35.29% compared to Mip, and the SSIM of ftq improved by 55.31% compared to NeRF. Although the PSNR of ftq is slightly lower than that of Mip360, its LPIPS decreased by 51.05%, and SSIM increased by 26.62% compared to Mip. In terms of overall performance, our method demonstrate statistically significant improvements in both total processing time and high-fidelity metrics compared to other approaches. This indicates that the methods presented in this chapter have an advantage in overall static model reconstruction quality.

Emergency management operational elements

Similarly, this paper compares the modeling of emergency management production elements, including fire hydrants (xfs), box-type fire extinguishers (mhq), fire alarms (hz), and portable fire extinguishers (stm). The proposed model was benchmarked against NeRF18, NeRF++19 , Mip20 , Mip36021 , and 3dgs22 using metrics including total training time (Total Time), PSNR, SSIM, and LPIPS. The specific results are shown in Table 9.

Table 9 Comparison of rendering results of emergency management by different methods.
Fig. 5
figure 5

Comparison of rendering results of emergency management by different methods.

The bolded entries in Table 9 represent the best results. For emergency management operational elements, the processing time of our method shows a significant reduction compared to NeRF18, Mip20 , Mip36021, and NeRF++19 , and a relative improvement over the total duration compared to 3dgs22. Specifically, xfs exhibits a 52.49% decrease in LPIPS and increases of 42.38% in PSNR and 16.16% in SSIM relative to the Mip360 method. Similarly, mhq demonstrates a 37.85% reduction in LPIPS and improvements of 29.08% in PSNR and 4.85% in SSIM compared to NeRF++. In the stm, our method achieved an LPIPS value of 0.3130356, which is 20.66% lower than the Mip method; although our PSNR value is lower than that of the Mip360 and NeRF++ methods, the image similarity is higher than both, demonstrating the excellent performance of our method in maintaining image structural fidelity.

As illustrated in Fig. 5, the distant scenes in the hz results are not as effective as those in Mip360, which accounts for the lower metrics. However, the intrinsic modeling quality of the production factor is visually indistinguishable from that of Mip360, which does not affect the reconstruction of this production factor. Other results in the figure indicate that our method significantly outperforms others in terms of perceptual quality, execution efficiency, visual similarity, and structural fidelity.

Conclusion

To address the challenges of labor-intensive and time-consuming digital twin modeling, this paper proposes a novel static modeling method for stations based on moving vehicles-MSCRAGS (Mobile vehicle-Sparse Sampling-Colmap-Resolution adjustment-Gaussian Splatting). This method meets the need for flexible control of static models by using sparse sampling to collect multi-level and multi-angle appearance color data of passenger service elements. These data are input into a multi-view stereo motion structure pipeline to generate sparse point cloud information on the geometric dimensions and color appearance of passenger service elements. Subsequently, 3d Gaussian splatting technology is used to render the sparse point cloud information, resulting in high-fidelity reconstructions of service elements. To meet fidelity requirements, the rendered results undergo super-resolution reshaping to create static models of station service elements at different resolutions. Experiments conducted at Qinghe Station demonstrate that, compared to other state-of-the-art methods, this approach significantly reduces modeling time and improves accuracy, as verified by comparisons of modeling time and high fidelity metrics. The proposed method targets indoor modeling of railway passenger stations and is not applicable to modeling extensive railway lines spanning thousands of kilometers or dynamic complex scenes with large moving crowds. Based on these findings and limitations, future research will focus on exploring new technologies to facilitate static modeling of complex railway scenes with low computational resource requirements.