Introduction

As a new-type app-based door-to-door transport mode, ride-hailing has exploded in popularity all around the world. In China, for instance, the number of certificated platforms, drivers, vehicles, and users had reached 337, 6.57 million, 2.79 million, and 528 million by December 2023, completing 894 million trips in the single month of December. However, there are many service issues behind its explosion. For instance, users in Shanghai reported multifaceted grievances in the first quarter of 2023: (a) procedural failures (e.g., drivers abruptly terminating trips before destinations), (b) ethical breaches (e.g., big data-enabled price discrimination exploiting frequent users), (c) safety compromises (e.g., dangerous driving patterns), and (d) economic exploitation (e.g., unjustified detours and surge pricing). There is ample scope to improve ride-hailing services, but strategies should be notified by an in-depth knowledge of users’ needs from the service quality perspective. Therefore, how to measure, then evaluate and finally improve ride-hailing service quality to sustain its competitive edge has become the development priority of this industry.

Berry et al. (1990) stated that “customers are the sole judges of service quality”. Service quality hinges on customers’ perceptions of each specific attribute characterizing the service (De Oña et al., 2013). How to develop a scale (measure, instrument) embracing appropriate attributes (attribute-specific items, indicators) to measure service quality from the customer’s perspective has been extensively explored. It’s generally recognized that service quality is multidimensional (Parasuraman et al., 1988). SERVQUAL (Parasuraman et al., 1988), SERVPERF (Cronin and Taylor, 1992), and E-S-QUAL (Parasuraman et al., 2005) are three multidimensional generic measures that are most commonly used and replicated in literatures. However, doubts have long been raised in using these scales and their simple adaptation versions across a broad spectrum as they can’t cover some unique features of a particular service. For example, these generic scales lack constructs (dimensions, factors) or items to capture ethical breaches, such as big data-enabled price discrimination, which is unique to ride-hailing services and also a common grievance among users. As a consequence, research emphases have transferred from adapting generic measures to developing customized measures for specific industry settings. Customized measures have attracted great research interest from practitioners and researchers in traditional transportation industries like air transport (e.g., Bezerra and Gomes 2016), railway transport (e.g., Nathanail, 2008), and especially public transport (bus and urban rail) (e.g., Wen et al., 2005; Lai and Chen, 2011; De Oña et al., 2013; Yaya et al., 2015; Soltanpour et al., 2018; Zhang et al., 2019). By contrast, customized measures for the new emerging paratransit industry of ride-hailing have received very limited research attention and were confined to a specific region that can’t apply to other geographical contexts. Therefore, this study, as the first step of an extensive research project aiming at improving ride-hailing services and attracting more users, places its emphasis on scale development and validation. The objective and contribution is twofold. Firstly, customize a multidimensional service quality measure for the less researched ride-hailing industry. Second, and more importantly, propose a generic approach (for the development and validation of a multidimensional scale) that can be readily and effectively transplanted to any service setting (any industry or region). To accomplish these objectives, this study employs a mixed-methods design combining qualitative and quantitative analysis. First, grounded theory is applied to derive service quality dimensions and items directly from user narratives through focus group discussions. These dimensions and items then undergo rigorous statistical validation via exploratory and confirmatory factor analyses. Finally, multi-group invariance testing ensures the measure’s robustness across diverse user cohorts.

This study proceeds as follows. Section “Literature review” presents the literature review. It is followed by a description of our research methods and data in Section “Methods and data”. Sections “Results” and “Discussion” present the results and a discussion, respectively. The paper ends with conclusions and a discussion of future work in Section “Conclusions and future work”.

Literature review

A comprehensive review of existing literature on ride-hailing service quality reveals a notable focus on exploring relationships with constructs like satisfaction and loyalty via structural equation modeling (SEM), while neglecting the systematic development and validation of measurement scales (Nguyen-Phuoc et al., 2020; Su et al., 2021; Akram et al., 2024; Ricardianto et al., 2024; Katili et al., 2024). Ride-hailing represents a key form of paratransit, which can take various forms (e.g., jeepneys, jitneys, samlors, taxis, etc.) besides ride-hailing. Therefore, recent studies on customized service quality measures for paratransit were extensively collected, studied and summarized in Table 1.

Table 1 Literatures involving customized service quality measure for paratransit modes.

Contextual variability in service quality dimensions and items

As shown in Table 1, although all these measures are for paratransit and most have considered the multidimensionality nature of service quality, dimensions and items exhibit pronounced variability across service types (jitney, jeepney, taxi, ride-hailing, etc.). Even within the same paratransit service type (e.g., traditional taxis), measures differ significantly across regions. These reflect both universal priorities and unique context-specific user needs. For traditional taxis, items concerning drivers' attire and etiquette, facility cleanliness and conditions, waiting time, and journey time appear consistently across regions, underscoring their foundational role in service quality. However, exclusive items have also been derived for each specific region context like Doha (Shaaban and Kim, 2016), Santander (Alonso et al., 2018), Hong Kong (Wong and Szeto, 2018), Melbourne (Rose and Hensher, 2018), etc., reflecting unique context-specific user needs. These disparities align with the assertion of De Oña and De Oña (2015) that service aspects appreciated by users are highly dependent on their geographical area. Therefore, it is crucial to develop service quality measures that are pertinent to the service type and region context.

Ride-hailing: the hybrid challenge of physical and digital touchpoints

Ride-hailing, as a fusion of traditional taxi services and mobile technology, introduces unique measurement requirements (Shah, 2020). Studies commonly replicate taxi-related items (e.g., driving security, waiting time). However, dimensions and items concerning app use have been considered only by Nguyen-Phuoc et al. (2020), Shah (2020), Nguyen-Phuoc et al. (2021), Li et al. (2022), Kumar et al. (2022) and Boar et al. (2023), but not the case for Su et al. (2021), Shah and Hisashi (2022), Vega-Gonzalo et al. (2023) and Wang et al. (2023). This inconsistency highlights a critical gap in capturing the full spectrum of ride-hailing service quality, where both offline (driver behavior, vehicle condition) and online (app functionality, booking reliability) interactions are essential.

Neglect of measurement invariance and subgroup differences

Past practices have no or limited consideration for measure validity. With only a few exceptions (e.g., Rose and Hensher, 2018), reliability and validity have been tested more or less in most literatures, usually by exploratory factor analysis (EFA), confirmatory factor analysis (CFA), or both. However, researchers seldom consider the measurement invariance of measures. Measurement invariance is to assess whether the factor structure and parameter estimates of a measurement model are statistically the same across different user groups. As service aspects appreciated by users rely highly on their sociodemographic and travel characteristics (e.g., travel reason) (De Oña and De Oña, 2015), failing to test invariance limits the generalizability of findings across different cohorts. Therefore, it is imperative to test measurement invariance across different cohorts of gender, income, trip purpose, etc.

Overreliance on expert-driven design and second-hand knowledge

Methodologically, most studies (Sumaedi et al., 2012; Rose and Hensher, 2018; Wong and Szeto, 2018; Nguyen-Phuoc et al., 2020; Shah, 2020; Nguyen-Phuoc et al., 2021; Su et al., 2021; Kumar et al., 2022; Halakoo et al., 2022; Li et al., 2022; Vega-Gonzalo et al., 2023; Wang et al., 2023; Boar et al., 2023; Ricardianto et al., 2024) followed researchers’ or experts’ own judgments in selecting and adapting dimensions and items from existing literatures and materials to construct a scale. Even when incorporating minority users’ judgments through a pilot survey (e.g., Shah, 2020; Askari et al., 2021; Kumar et al., 2022), still anchored in existing literatures and materials. Participants’ perspectives on the interested social phenomenon should unfold as participants view it, not as researchers view it (Marshall and Roseman, 1989). It goes against the philosophy of “customers are the sole judges of service quality” as dimensions and items considered important by researchers/experts were chosen instead of users/customers. More importantly, existing literatures and materials are second-hand knowledge that couldn’t authentically and comprehensively represent service aspects appreciated by users in the specific context. Scales grounded in expert-driven design and second-hand knowledge may introduce conceptual risks: misinterpreting customer perceptions, overlooking context-specific needs, and weakening the theoretical validity of measurement tools. Such limitations underscore the need for methodologies that center user-generated insights to ensure scales authentically reflect the service aspects that users themselves deem critical.

The current study seeks to fill these gaps by holding focus groups to collect first-hand knowledge of users’ experiences and perceptions on ride-hailing from which measure dimensions and items will then be extracted via grounded theory coding techniques, and by testing measurement invariance across a wide range of variables via multi-group confirmatory factor analysis (MGCFA) when validating the measure. It takes Suzhou, a prefecture-level city of Jiangsu Province in eastern China, as an example. Our selection of Suzhou serves dual purposes: 1) validating the operational feasibility of the generic approach through concrete implementation, and 2) demonstrating its effectiveness in generating customized measurement tools. The city exhibits unique demographic and economic characteristics that differ from other cities (e.g., 65.2% of its population being immigrants and a GDP ranking 6th nationally). This may lead to divergence in service quality indicators prioritized by ride-hailing users in Suzhou and in other cities, necessitating a tailored measure rather than using the generic measures for Suzhou. To our knowledge, it’s the first customized measure research for the specific context of ride-hailing in Suzhou.

Methods and data

Dimensions and items generation

Intangibility, production-consumption inseparability, and heterogeneity make measuring service quality very complex (Yaya et al., 2015). For such a complex issue, grounded theory which allows the natural emergence of theory from data via systematic coding procedures is an effective inductive tool. It derives dimensions and items directly from user narratives rather than drawing and retrofitting from existing literatures and materials, which can accurately grasp user-articulated pain points like ethical breaches (e.g., big data-enabled price discrimination exploiting frequent users). This perfectly aligns with the core need to identify unknown service quality dimensions and items from unstructured consumer experiences when developing scales. Besides, the constant comparison method transforms fragmented user feedback into a hierarchical conceptual system (i.e., main theme → themes (dimensions) → categories (indicators)), directly supporting the design of a hierarchical indicator framework for scales. In contrast, alternative methodologies exhibit critical limitations. Ethnography requires long-term immersion to observe the holistic culture of ride-hailing systems (e.g., driver communities, platform rules), producing thick descriptions rather than operationalizable indicators, which are difficult to translate into measurable scale items. Phenomenology focuses on revealing the essence of subjective experiences, but scales require cross-group stable measurable variables. Overemphasis on individual perceptions may compromise scale generalizability across different subgroups. The method chosen to generate measure dimensions and items here favors the original grounded theory proposed by Glaser and Strauss (1967). It lays emphasis on the bottom-up extraction of themes without using the predetermined frameworks favored by Strauss and Corbin (1990), which follows certain data collection and data analysis strategies.

Data collection

In-depth interviews and focus groups are the most commonly used qualitative data collection methods (Mars et al., 2016). Focus groups allow participants to respond to each other’s comments, helping to generate new concepts. Considering this merit and data collection efficiency, multiple rounds of focus groups were held to collect first-hand knowledge of users’ experiences and perceptions based on their day-to-day ride-hailing practices. A recruitment notice was posted to residents in Suzhou via the WeChat app in November 2022. People who had chanced upon the notice and were interested in participating could contact and inform their gender, age, ride-hailing adoption, and usage frequency. Those who hadn’t used ride-hailing in recent three months were excluded to make sure all recruited participants could vividly remember their ride-hailing experiences and perceptions in Suzhou. According to suggestions of 3–5 focus groups per study and 6-10 participants per group from Morgan (1992), stratified random sampling was applied to engage the participation of 32 users in each round of four focus group discussions (Table 2). We created age-homogeneous groups to make each relatively similar in composition, while the gender and usage frequency makeup within each group was heterogenized to facilitate diverse voices and views. Each recruited participant signed an informed consent form and was rewarded with a ¥150 gift card for attendance. Four topics were successively introduced in each group to prompt the discussion: (a) instances of and reasons for satisfaction/dissatisfaction when using ride-hailing services in Suzhou; (b) descriptions of an ideal ride-hailing service; (c) factors important in evaluating ride-hailing service quality; (d) performance expectations concerning ride-hailing services. Focus group discussions were held on December 10, 2022 (Saturday), and each lasted nearly two hours, which were digitally recorded (audio) and transcribed verbatim.

Table 2 Participants’ characteristics.

Data analysis

Kurniawan et al. (2018) pointed out that “respondents’ verbal expressions which signify events, actions, reactions, beliefs, values, attitudes, aspirations, deliberations, concerns, experiences and feelings are instances of code-worthy data”. These expressions in the original transcripts were analyzed through substantive and theoretical coding procedures following the guidelines provided by Glaser and Strauss (1967) to uncover key concepts, categories, themes, and their interrelationships. During the substantive coding stage, the code-worthy data were coded with straightforward terms which were then further refined into concepts and categories through constant comparison. Constant comparison is a defining characteristic of grounded theory, which identifies codes in the data and constantly compares them to previously identified codes, thereby revealing patterns in the data and allowing new concepts and categories to emerge where statements do not fit any of the currently identified ones (Glaser, 1992). Once the categories have thus been developed, the theoretical coding began. Again, using constant comparison, relations were sought between categories, allowing them to be merged and combined to form superordinate themes that would eventually develop into the main theme. Ultimately, after 7 rounds of focus group discussions, the saturation state that new data only supported previously identified concepts and didn’t introduce any new ones marks the end of coding procedures. To ensure the agreement of coding, this whole process was performed by two different teams. Intercoder reliability was assessed using Cohen’s Kappa, with discrepancies discussed and adjusted through iterative consensus meetings. Examples of code-worthy sentences and their analysis results are shown in Table S2 in the supplementary information file. The analysis ultimately yielded 103 concepts that were sorted into 12 categories and further refined into 5 themes. Therefore, a 5-dimension 12-item scale was preliminarily developed (Fig. 1). It is essentially a hypothesis (this 5-dimension 12-item scale can effectively measure the ride-hailing service quality in Suzhou) that needs to be empirically tested with large samples. The next step is to perform a questionnaire survey and test this hypothesis based on collected samples.

Fig. 1: Grounded theory results.
figure 1

It shows the specific 5 dimensions (Reliability, Efficiency, ……) and 12 items (Driving security, Deviation, ……) of the preliminarily developed scale.

Questionnaire design and data collection

The questionnaire was designed on a professional online survey platform in China named “Wenjuanxing”. At the very beginning, a screening question was asked to exclude respondents who hadn’t used ride-hailing in Suzhou in recent three months. The 12 items derived in Section “Dimensions and items generation” were recast into scale statements. A five-point Likert scale ranging from “strongly disagree” (1) to “strongly agree” (5) was adopted to derive users’ perception on these attributes. In addition, the survey also contained questions about a range of relevant sociodemographic and travel characteristics including gender, age, occupation, monthly household income, private car ownership, ride-hailing usage frequency, ride-hailing usage scenario (occasion/reason for travel, trip purpose), choice of alternative mode of transportation if ride-hailing services were unavailable (alternative travel mode, mode substituted), ride-hailing trip timing (time of day), as well as questions not pertinent to the present study. Before the full-scale survey, a paper-based pilot survey was carried out with 30 employees working in CCDI (Suzhou) Exploration & Design Consultant CO., Ltd. and corrections effected.

The survey link was shared to residents in Suzhou through the WeChat app from January 2, 2023, to February 3, 2023. WeChat red packets (an e-form money) were provided as a gift by lottery for participation in the survey. Questionnaires with extremely short time, inconsistent-, nonlogical- and incomplete answers, and straight-lining answer patterns were removed. For questions having the option “others”, some participants chose “others” and refused to give their exact answers further. These questionnaires were treated as incomplete and were also removed, leaving a total of 1464 valid questionnaires that constituted the analytical sample.

Scale purification and assessment

EFA can aid in unveiling the real factor structure represented by a series of measure items. Therefore, an EFA with varimax rotation (orthogonal rotation) was conducted in SPSS Statistics 22 to suggest a more likely factor structure for the 12 items. Before the analysis, (a) univariate and multivariate normality, and multicollinearity of these 12 items were checked via skew, kurtosis, Mardia’s multivariate kurtosis, and VIF (variance inflation factor); (b) their internal consistency was tested via Cronbach’s alpha; (c) their appropriateness for EFA was assessed by KMO and Bartlett’s test.

Once knowing the actual factor structure, we specified a CFA in Amos 24 to evaluate this structure. The maximum likelihood method with a bootstrap procedure (a resampling of 5000) was selected. The bootstrap procedure allows to solve the violation of normality (Kline, 2015). The measurement model must have acceptable model fit, reliability and validity, or else it should be adjusted. Model fit was assessed with the indexes most used in literatures, namely, χ2 (chi-square), df (degrees of freedom), CFI (comparative fit index), RMSEA (root mean square error of approximation), TLI (Tucker–Lewis index) and GFI (goodness of fit index) (Jackson et al., 2009). The cut-offs are CFI, TLI, and GFI > 0.9 while RMSEA < 0.08 (Hu and Bentler, 1999). Reliability and validity were checked via indicator reliability, internal consistency reliability, convergent validity, and discriminant validity (Hair et al., 2009; Urbach and Ahleman, 2010). The criteria are shown in Table 3. Poor model fit, reliability, and validity may be associated with items of low factor loadings and large modification index values. The methodological proposal of Chen and Hwang (2006) was followed for model adjustment. Firstly, delete items with factor loadings smaller than 0.45. Then, delete items with large modification index values. The model fit should be checked every time an item is deleted. Besides, offending estimates like negative or nonsignificant variances, standardized factor loadings in absolute value above 1.0 were checked (Marsh et al., 1998; Hair et al., 2009; Kolenikov and Bollen, 2012).

Table 3 Reliability and validity criteria.

Measurement invariance

To further probe measure validity, invariance tests were performed across different subgroups of gender, age, occupation, monthly household income, private car ownership, usage frequency, usage scenario, alternative travel mode, and trip timing by means of MGCFA in AMOS 24.

It involves a sequence of hierarchical steps that start with identifying a baseline model (Byrne, 2012). The measurement model obtained through scale purification and assessment was estimated separately in each subgroup to examine its suitability (model fit, factor loadings) as the baseline model. After completing this preliminary task, the configural invariance test (test for the equivalence of factor structure) was conducted by examining the CFI, RMSEA, TLI, GFI of the unconstrained model (Table 4). According to Byrne (2012), factor loadings, factor covariances, factor variances are key and most commonly tested parameters in determining measurement invariance, while error variance equivalence is now widely accepted as an excessively rigorous test. Therefore, three further invariance tests—factorial invariance, factor covariance invariance, factor variance invariance (tests for the equivalence of parameter estimates)- were performed by comparing a series of nested models (2-1, 3-2, 4-2 in Table 4). The difference between two nested models is significant or not can be tested via chi-square test (calculate change in χ2 and df between nested models, and get the corresponding p-value based on Δχ2 and Δdf) (Long, 1983). This chi-square test with p-value > 0.05 contends that the corresponding null hypothesis in Table 4 should be accepted. If it is rejected due to p-value < 0.05, a practical criterion of ΔCFI in absolute value < 0.01 (ΔCFI is the change in CFI between two nested models which is a robust statistic to test the between-group invariance) indicates that the difference is largely unsubstantial and thus the hypothesis should still be accepted (Cheung and Rensvold, 2002). A change of CFI in absolute value > =0.01, supplemented by a change of RMSEA in absolute value > =0.015, would indicate nonequivalence (F. F. Chen, 2007).

Table 4 Measurement invariance tests.

Results

Results of preliminary statistical analysis

Ride-hailing adoption by sociodemographic and travel characteristics is presented in Fig. 2. Respondents were mainly females (54.7%) and aged below 40 (69.7%). More than half were private company workers (51.3%). The monthly household income suggested a skew towards low-to-medium segments (64.2% below ¥20,000). Car ownership rate was high among respondents, as 64.4% reported having one. Almost two-thirds (66.3%) reported having used ride-hailing at least three times a week. The usage scenario revealed a diversified trend with commute and business trips significantly higher than the others. In the absence of ride-hailing, most respondents reported that they would choose traditional taxis (41.8%) and public transport (41.5%) as alternative modes, which is significantly higher than private cars (16.7%). Most trips were made during 6:30–19:00, while trips during late evening and night periods were substantially less. Non-parametric tests-of-difference were applied to check whether different sociodemographic cohorts displayed significantly different levels of “usage frequency”. As shown in Table 5, users who are females, >55, student, and private company workers make <¥10,000 a month within the household, and have a private car used ride-hailing less frequently. This trend may stem from multiple barriers: safety concerns (e.g., fear of harassment or assault, particularly among women due to media-reported incidents), limited tech literacy (e.g., older users struggling with app use), cost sensitivity (e.g., students and low-income groups prioritizing affordability over convenience), and unpredictable schedules (e.g., workers relying on cheaper, reliable fixed-route alternatives). Ride-hailing operators (Transportation Network Companies or TNCs such as Didi, Uber) and officials can tailor appropriate tactics to increase usage frequency of these groups if they want to further augment company revenue, reduce private car ownership and use, reduce parking demands, etc.

Fig. 2: Ride-hailing adoption by sociodemographic and travel characteristics.
figure 2

It presents the sociodemographic (Gender, Age, ……) and travel (Usage frequency, Usage scenario, ……) characteristics of questionnaire samples.

Table 5 Results of non-parametric tests.

Ride-hailing users in Suzhou seemed to hold a relatively satisfied attitude towards the service, but it still had a considerable margin for improvement as 3 out of 12 attributes (i.e., deviation, price markup and discrimination, detour) didn’t exceed the passing score of 3.00 (Fig. 3). “Deviation” may stem from algorithmic limitations (e.g., outdated traffic data or failure to account for real-time variables like weather) or opaque communication, which users perceive as intentional manipulation. “Price markup and discrimination” may arise from unregulated surge pricing and algorithmic profiling (e.g., charging frequent users higher rates). For “detour”, driver incentives to inflate fares and poor real-time oversight may be key drivers. To improve the overall quality of ride-hailing service, these three weaknesses should be addressed firstly.

Fig. 3: Attribute ratings.
figure 3

It displays the ratings of 12 different attributes by ride-hailing users.

Results of exploratory factor analysis and confirmatory factor analysis

Skew (-0.903, 0.540) below 1 and kurtosis (−1.331, 0.736) below 2 in absolute values demonstrated that all items were approximately univariate normal. Mardia’s multivariate kurtosis = 52.821 (critical ratio = 55.128 > 5.0) indicated a deviation from multivariate normality, supporting the use of bootstrap. VIF of all items (1.436, 2.849) < 5, confirming the absence of multicollinearity. Cronbach’s alpha = 0.755 > 0.7 confirmed the internal consistency. As KMO = 0.851 > 0.7 and significance level = 0.000 < 0.01 for Bartlett’s test, the data presented a good level of adequacy for EFA. The EFA results are summarized in Table 6. It revealed 3 factors (instead of 5 indicated by grounded theory) having eigenvalues ≥ 1, which were labeled as “Service”, “Integrity”, “Efficiency” and together explained 68.600% of the variance.

Table 6 EFA results.

According to the EFA results, a 3-dimension 12-item measurement model was built in Amos 24 (Fig. 4). It achieved an adequate model fit (χ2 = 273.627, df = 51, CFI = 0.972 > 0.9, RMSEA = 0.055 < 0.08, TLI = 0.964 > 0.9, GFI = 0.970 > 0.9). The results of reliability and validity are summarized in Table 7. With a minimum of 0.569 and a maximum of 0.914, all factor loadings > 0.5. All p-values were below 0.05. Indicator reliability was therefore met. AVE were all above 0.5, CA, CR were constantly above 0.7, meeting the ideal requirements of convergent validity and internal consistency reliability. Discriminant validity was proved as all correlations between constructs were smaller than the corresponding square root of AVE. Besides, all variances were positive and significant, and all standardized factor loadings were below 1.0, indicating no offending estimates. Therefore, it’s perfectly reasonable to take this 3-dimension 12-item measure as the baseline model for the following analysis.

Fig. 4: 3-dimension 12-item measurement model.
figure 4

It depicts the 3-dimension 12-item measurement model built in Amos.

Table 7 Reliability and validity results.

Results of multi-group confirmatory factor analysis

The 3-dimension 12-item model was estimated separately in each subgroup. All model fit indexes complied with the cut-off values (Table 8), and all factor loadings were above 0.5 except for the “deviation” of several subgroups (i.e., students, 20,001¥−50,000¥, not more than twice a week, business trips, escorting children to and from school, 16:30–19:00, 19:00–6:30), but still above 0.45, confirming the suitability of this 3-dimension 12-item model as the baseline model.

Table 8 Model fit results for each subgroup.

The results of measurement invariance tests are presented in Table 9. Columns at the left show that all models, especially the unconstrained models, achieved adequate model fit. Therefore, the configural invariance was verified. The comparison results of nested models are presented in columns at the right. For nested models with p-value > 0.05, the corresponding invariance hypotheses were thus confirmed. For nested models with p-value < 0.05, all ΔCFI in absolute values were below 0.01 (except for the nested model 4-2 of usage frequency), indicating that the corresponding invariance hypotheses should be accepted in practice. Although ΔCFI for the nested model 4-2 of usage frequency in absolute value was 0.011 > 0.01, the absolute change of RMSEA was 0.003 < 0.015 indicates the difference was unsubstantial and the null hypothesis thus should not be rejected. To sum up, the factor structure, factor loadings, factor covariances, and variances of the 3-dimension 12-item model were invariant across different subgroups of gender, age, occupation, monthly household income, private car ownership, usage frequency, usage scenario, alternative travel mode and trip timing, which further confirmed the validity of this measure.

Table 9 Results of measurement invariance tests.

Discussion

Theoretical Implications

The resulted 3-dimension 12-item structure confirms the necessity of developing an industry-specific scale customized to a specific geographical context and verifies the feasibility and validity of the proposed generic approach in customizing a multidimensional scale. As expected, this customized measure is quite different from generic measures (SERVQUAL, SERVPERF, E-S-QUAL), and customized measures for ride-hailing industry in other geographical contexts (e.g., Nguyen-Phuoc et al., 2021; Kumar et al., 2022; Shah and Hisashi, 2022; Vega-Gonzalo et al., 2023; Wang et al., 2023) and for other industries (e.g., Bezerra and Gomes, 2016; Soltanpour et al., 2018; Tiglao et al., 2020). These differences have reversely verified the necessity of developing customized scales. Despite these differences, essential items mentioned in most transport service literatures like crews attire and etiquette, facility cleanliness, and conditions are replicated here none the less. All other items (except for “price markup and discrimination”) can find identical or similar conceptions in the published ride-hailing, taxi or other paratransit literatures. Items here bearing such a striking similarity to previous literatures offer a strong evidence that the grounded theory approach is effective in obtaining scientifically sound measure items from the first-hand knowledge of customers’ thoughts. The joint use of EFA, CFA, and MGCFA demonstrates feasibility and validity in further refining, assessing and testing the scale, as a meaningful and valid multidimensional scale represented by Service, Integrity, and Efficiency has finally resulted. The reason why “price markup and discrimination” is included may be because this phenomenon in Suzhou is so much more severe than other places that users constantly bring it up in focus group discussions, which is also verified by its lowest score among the 12 items (2.53). By introducing this item as a critical determinant of perceived quality, “Integrity” emerges as a standalone dimension which fundamentally challenges and extends classical service quality frameworks such as SERVQUAL and SERVPERF. Unlike traditional dimensions (e.g., tangibility, reliability), “Integrity” incorporates algorithmic fairness into the theoretical system, explicitly addressing the socio-technical inequities inherent in digitally mediated services.

Practical implications

Ride-hailing adoption is more prevalent in women and those who work in private companies, while men and people who work in government and public institutions tend to use it with higher frequency. In common with many preceding research (e.g., Clewlow and Mishra, 2017; Lavieri and Bhat, 2019), ride-hailing adopters and frequent users tend to be younger in Suzhou. This may result from the fact that younger individuals have more exposure to new technologies, products, and services due to their more extensive social networks. Although ride-hailing is often cited as a possible mobility solution for the aging population, the older’s low adoption and usage frequency suggest that there are significant hurdles to overcome. The majority of ride-hailing users in Suzhou are in low-to-medium income segments (<¥20,000), while the usage frequency is lowest in the low-income segment (<¥10,000). Most ride-hailing users own private cars but have a lower usage frequency. These two results suggest that ride-hailing for them (low-income segment and users owning private cars) may serve as more of a convenience feature for one-off trips rather than being an accessibility facilitator for routine trips. The former is probably because of their low income and consumption power, while the latter may rely more on private cars for daily travel. Similar results were found by Lavieri and Bhat (2019) as well. To address the uneven ride-hailing adoption, TNCs and urban planners should implement safety-focused features like real-time SOS alerts and certified “Women-Safe Driver” programs, enhance affordability through subscription plans or income-tiered discounts, and simplify accessibility via lite app versions or SMS-based booking for tech-averse users. Additionally, targeted community engagement—such as workshops demonstrating time-saving benefits for students or partnerships with safety advocates—could rebuild trust and drive adoption. By systematically addressing safety, cost, and accessibility barriers, TNCs can expand their reach to these underserved segments while promoting inclusive mobility. To address low scores for deviation, platforms should integrate real-time data (e.g., traffic sensors, weather APIs) and offer in-app explanations for discrepancies (e.g., “30% delay due to congestion”), coupled with automatic refunds for fare deviations exceeding 10%. Solutions tackle low scores for price markup discrimination include government-mandated surge caps (max 1.5x base fare), third-party audits to detect biased pricing patterns, and user controls like “wait-for-lower-fare” options. Implementing GPS-based route compliance checks (penalizing deviations >5% without consent) and empowering passengers with in-app alerts for route changes can mitigate the detour issue, alongside bonus programs for drivers adhering to recommended paths.

This 3-dimension 12-item scale offers a highly effective tool for evaluating the ride-hailing service quality in Suzhou. Leveraging this scale, operators are enabled to precisely compute the overall service quality score of Suzhou’s ride-hailing services through the weighted average approach. By juxtaposing scores from different time periods, one can intuitively assess the enhancement of service quality and the efficacy of diverse service quality improvement initiatives. A cross-city comparison of this total score with those of other cities allows for a clear analysis of the strengths and weaknesses of Suzhou’s ride-hailing service quality at a national level. Moreover, through the application of the IPA (Importance-Performance Analysis) method, by integrating indicator weights with scores, the priority of various improvement indicators can be ascertained. This, in turn, furnishes operators with a robust foundation for optimizing services in a targeted fashion.

Conclusions and future work

On the theoretical front, this study contributes to propose a generic approach (for the development and validation of a multidimensional scale) applicable to any service setting (not limited to ride-hailing in Suzhou). We conclude the approach as follows: (1) collect first-hand knowledge of users’ experiences and perceptions on the service studied by focus group discussions and extract measure dimensions and items by grounded theory coding techniques; (2) purify and assess the scale based on the criteria of model fit, reliability and validity by EFA and CFA; (3) test measurement invariance across different user cohorts by MGCFA. The first use of grounded theory in scale development offers new insights into the approach. First-hand knowledge makes measure derived possess conceptual soundness and indeed interpret customer perceptions. Measurement invariance tests help to further confirm the measure validity. On the practical side, this study identified a 3-dimension 12-item measure for the less researched ride-hailing service quality in Suzhou context. Additionally, the Suzhou data also unveil uneven ride-hailing adoption and usage frequency across different sociodemographic cohorts. All these contribute to advance the knowledge of transportation officials, urban planners and TNC operators about ride-hailing services in Suzhou, which will help improve the services and attract more users.

The current study has some limitations that suggest directions for further research. Firstly, the reliance on self-reported data may introduce biases such as social desirability and memory inaccuracies. Future studies should incorporate objective measures, such as real-time usage data or behavioral analytics, to complement self-reported findings. Secondly, The influence of culture on perceived service quality has not been adequately addressed. Cross-cultural studies should be conducted in the future to explore how cultural differences shape user expectations and evaluations of ride-hailing services. Finally, the study does not explore dynamic aspects of service quality, such as real-time feedback mechanisms or longitudinal changes in user expectations. Future research should investigate how service quality perceptions evolve over time and how platforms can adapt to these changes.