Introduction

Large language models, equipped with powerful natural language processing capabilities, have demonstrated impressive applications across diverse fields, highlighting their potential as valuable assistive tools (Oermann and Kondziolka, 2023). By transforming text descriptions into visual outputs, these models represent a significant advancement in AI’s capacity to interpret and visualize abstract concepts (Driessen et al. 2024, Jang et al. 2024, Riemer and Peter 2024, Vemprala et al. 2024). Since their inception, large language models have been applied extensively in areas such as education, healthcare, and software development (Xue et al. 2023, Hu et al. 2024, Vemprala et al. 2024). Research shows that large language models can generate illustrative images from text descriptions, aiding human creativity in art and design (Lu et al. 2023). For example, DALL-E, built on a transformer architecture, generates highly detailed images, showcasing AI’s creative potential in design (Ali et al. 2024), while Midjourney enables users to explore imaginative visual scenes (Javan and Mostaghni, 2024). Building on these developments, ChatGPT-4o incorporates multimodal functionality, enabling it to generate images of futuristic urban landscapes from specific text prompts, showcasing unique potential in urban planning and design (Fu, 2024).

ChatGPT-4o’s image generation capability relies on an extensive database and robust computational capacity, enabling it to produce complex future city images from textual instructions. ChatGPT-4o processes large volumes of textual and visual data, with its database covering a broad range of urban design elements (Peng et al. 2023, Caprotti et al. 2024). The high quality of this data directly impacts the model’s ability to generate detailed, accurate images (Driessen et al. 2024). Additionally, supported by deep learning algorithms, ChatGPT-4o’s generation process includes multi-layered language comprehension, image analysis, and cross-modal integration (Wang et al. 2024). Leveraging efficient deep learning algorithms and computational power, ChatGPT-4o rapidly processes and integrates detailed data, converting textual instructions into concrete visualizations of future cities (Cugurullo et al. 2024). This functionality not only enhances creativity in urban planning but also provides visual representations of future cities, supporting human aesthetic evaluation and design feedback.

Although artificial intelligence has made significant strides in technical fields like data analysis and predictive modelling, its potential as a creative tool in urban planning remains underexplored. Current research primarily focuses on AI’s strengths in data processing and efficiency optimization (Oermann and Kondziolka, 2023, Osco et al. 2023, Hu et al. 2024), with limited exploration of its role in creating innovative designs and visualizing future urban landscapes. Specifically, structured methods for evaluating AI-generated city images from a human perspective are lacking. This gap hinders a deeper understanding of AI’s impact on creative design and underscores the need to develop systematic frameworks for analysing and providing feedback on the aesthetic and functional qualities of AI-generated urban designs.

Literature review

Image generation of ChatGPT

Since its inception, ChatGPT has evolved and improved continuously; however, ongoing research remains essential to address its limitations and to ensure effective application across diverse fields. Initially recognized for its text generation capabilities, ChatGPT lacked image generation functionality at launch (Floridi and Chiriatti, 2020). The latest version, ChatGPT-4o (GPT-4 Omni), introduced image generation, marking a significant advancement in large language models (LLMs) and demonstrating enhanced capabilities across language, vision, audio, and multimodal tasks (Zhu et al. 2024). These models have revolutionized AI-generated art and image creation, sparking public interest and discussions regarding their impact on sectors such as the arts (Oermann and Kondziolka, 2023). Despite these advancements, ChatGPT-4o still encounters challenges in processing complex and ambiguous inputs, particularly within its audio and visual functionalities, underscoring the need for richer feedback to drive continued improvements (Hu et al. 2024).

ChatGPT-4o’s image generation capabilities remain in their early stages, yet they hold vast potential for future development. Beyond technical advancements, models like ChatGPT have generated important discussions about their social impacts, particularly on creativity, originality, and productivity. Researchers note that generative AI promotes creativity by providing new perspectives and facilitating idea generation, serving as a catalyst for concepts users might not develop independently (Jang and Kim, 2024). This aligns with the concept of “parallel art”, in which human-AI collaboration produces unique, co-created works (Guo et al. 2023). Additionally, models like ChatGPT have significantly enhanced productivity by streamlining workflows, lowering cognitive load, and enabling users to focus on higher-level tasks (Kim et al. 2024). In this framework, the creative process becomes a collaborative endeavour, completed through human-AI interaction.

Artificial intelligence and urban planning

Artificial intelligence (AI) is increasingly integrated into urban planning, with transformative potential at various stages of the planning process. The introduction of AI-assisted, AI-augmented, AI-automated, and eventually AI-autonomous planning workflows raises questions about the potential impacts and measures required to effectively incorporate AI into urban and regional planning (Peng et al. 2023). For example, AI promotes sustainable urbanization by optimizing resource use and enhancing quality of life through data analysis and predictive modelling (Al-Raeei, 2024). Additionally, Additionally, Bibri et al. (2024) integrated AI through the GPT-4 large language model and retrieval-augmented generation, facilitating the automatic generation of intuitive cluster descriptions and names. This integration marks the first application of natural language processing in academic studies of geographic demographics.

With the advancement of large language models, generative AIs like ChatGPT now possess powerful natural language processing capabilities and an extensive knowledge base in urban planning, enabling them to create city design outputs based on user prompts (Ali et al. 2024). ChatGPT’s database incorporates extensive expertise in architectural design, regional planning, and sustainable urban development, systematically supporting the generation of content with urban planning depth (Fu, 2024). Recent studies have begun exploring ChatGPT’s applications in urban design assistance, demonstrating its effectiveness in conceptualizing plans and inspiring design ideas (Yu et al. 2024). For instance, ChatGPT has been used to assist in urban design evaluation, offering novel design directions and testing for environmental sustainability (Fu et al. 2024). However, most research to date focuses on ChatGPT’s role in supporting professionals, with limited exploration of its ability to utilize its database and algorithms to generate coherent future city designs in response to prompts from general users. This gap highlights the need for systematic research into whether ChatGPT-generated urban design images accurately reflect its knowledge base breadth and algorithmic responsiveness in meeting non-expert user demands.

Importance-performance analysis

Importance-Performance Analysis (IPA) is a visual decision-making tool using a two-dimensional grid to compare the importance and performance of various attributes, prioritizing specific indicators for improvement (Aicher et al. 2023). In the tourism industry, IPA plots visitors’ pre-trip expectations, post-trip satisfaction, and the importance of each attribute on a grid to guide tour design decisions (Duke and Persia, 1996). In higher education, IPA enhances teaching quality by visually representing which teaching attributes are most important to students and how well instructors perform on these attributes, thus guiding course design and improvement (Cladera, 2021). In public transportation, IPA assesses customer satisfaction by identifying gaps between the importance and performance of service attributes (Esmailpour et al. 2020). These examples illustrate that IPA applies across various fields, systematically identifying and addressing specific indicators to improve overall quality and user satisfaction.

The traditional IPA method plots average importance and performance results of attributes on a chart (Fig. 1a), classifying them into four quadrants: Quadrant 1: “Concentrate Here”, Quadrant 2: “Keep Up the Good Work”, Quadrant 3: “Low Priority”, and Quadrant 4: “Possible Overkill” (Martilla and James, 1977). IPA typically uses an X-Y coordinate graph centred on a scale to display results, with quadrant interpretations in Table 1. The X-axis represents “Performance” (PE), with better performance further right. The Y-axis represents “Importance” (IM), with higher importance higher up the axis (Rašovská et al. 2021). The coordinate plane is divided into four quadrants by horizontal and vertical lines, explaining the relationship between importance and performance. Once attributes are mapped to their quadrants, managers can adjust strategies to balance importance and performance (Boley et al. 2017, Cao et al. 2024).

Fig. 1: Two versions of the IPA rendering.
figure 1

a Traditional IPA Quadrant Classification (Martilla and James, 1977); b Revised IPA Graphical Representation (Abalo et al. 2007).

Table 1 Interpretation of each quadrant of the crosshair coordinate axis.

While centreing the crosshairs on the median of the scale may seem the most transparent way to position the quadrants (Oh, 2001), most attributes usually fall into the “keep up the good work” quadrant because respondents tend to give high ratings for both performance and importance (Phadermrod et al. 2019). This clustering diminishes the value of discussing relative strengths and weaknesses of attributes (Boley et al. 2017). To address clustering and ensure a more dispersed distribution of attributes across the quadrants, we adopted a data-centred approach by positioning the crosshairs at the mean values of the measured importance and performance items (Bekar et al. 2023). This method effectively resolves data clustering, ensuring attributes are more evenly distributed among the quadrants (Bi et al. 2019, Cao et al. 2024).

To enhance the interpretive power of IPA results, a 45-degree upward diagonal line can differentiate areas where performance exceeds importance (PE > IM) from areas where performance falls below importance (PE < IM) (Cladera, 2021, Fan, 2022). This 45-degree diagonal line, known as the Iso-Diagonal Line (Fig. 1b), indicates that all points on this line have equal improvement priority (IM = PE). In the Expectation-Confirmation Paradigm (Oliver, 1980, Miao et al. 2022) and User Experience Design, this implies that participant satisfaction with an attribute is based on the difference between their expectations and performance evaluation of that attribute. Using this line allows gap analysis. If an attribute is above this line (IM > PE), it indicates performance evaluation is lower than expectations, leading to negative disconfirmation, suggesting participants may be dissatisfied. Conversely, if an attribute falls below this line (PE > IM), it indicates performance evaluation exceeds expectations, leading to positive disconfirmation, suggesting participants are likely to be satisfied (Nunkoo et al. 2020).

The present study

This study is grounded in User Experience Design (UXD), which emphasizes active user involvement to better understand user needs and tasks, thereby enhancing the product’s overall usability and practicality (Mao et al. 2005). This approach, known as User-Centred Design (UCD), is widely acknowledged as an industry best practice (Bullinger et al. 2010). A core principle of UXD is the inclusion of all stakeholders in the design process, a concept grounded in systems theory and participatory design (Chan et al. 2020). Additionally, the concept of service design, closely related to UXD, emphasizes co-creation and a human-centred approach. Incorporating interactive feedback mechanisms can enhance user engagement and foster value co-creation, making the design process more engaging and emotionally appealing to users (Martín-Peña et al. 2024).

ChatGPT-4o is recognized for its efficiency in handling multimodal tasks, including image generation, editing, and image-based dialogue. With simple text prompts, users can perform complex image operations. Wu et al. (2023a) developed a multimodal system that generates images from user text prompts, offering a more natural mode of human-computer interaction. This system enables users to communicate with the model using natural language without needing specialized image processing skills, highlighting its practical potential and frequent citation in studies (Ray, 2023, Liu et al. 2024, Wang et al. 2024). Since its inception, ChatGPT-4o’s database has continuously evolved through global user interactions; however, limited research has examined its autonomous imaginative capability as enabled by its large model algorithm. Existing studies suggest that LLMs serve as intuitive tools for general users, regardless of their technical expertise (Jang and Kim, 2024). Based on this, the present study grants ChatGPT-4o full autonomy in image generation, offering only thematic instructions with no additional creative intervention.

User Experience Design (UXD) and ChatGPT-4o intersect in innovative ways, enhancing iterative applications for designing and analysing human-computer interactions. Through rapid engineering and an advanced function library, ChatGPT-4o adapts to diverse robotic tasks and simulators, enabling users to interact with robots through natural language instructions and thereby enhancing overall user experience (Vemprala et al. 2024). Moreover, ChatGPT-4o’s core technologies—large-scale language models, contextual learning, and reinforcement learning from human feedback—enable it to excel in language comprehension and generation tasks (Wu et al. 2023b). Within Cyber-Physical-Social Systems (CPSS), ChatGPT-4o employs a data-driven analytical approach, treating complex systems as a black box and focusing on the input-output relationships. This approach aids in understanding and enhancing user experience by analysing large datasets and identifying patterns without needing to examine the system’s internal complexities (Xue et al. 2023). This foundation underpins the present study, where subjective user feedback data effectively informs ChatGPT’s ongoing improvements.

Methods

This study uses a mixed-methods research design (Ivankova et al. 2006), integrating qualitative focus group discussions with quantitative public surveys to explore the application of ChatGPT-4o in future city planning. The qualitative phase involved expert focus groups identifying key indicators for evaluating urban design images generated by ChatGPT. Subsequently, a public survey assessed residents’ perceptions of these indicators using Importance-Performance Analysis (IPA). This methodology provides a comprehensive framework for understanding the strengths and weaknesses of AI-generated urban designs, highlighting areas for improvement. The complete framework of the study is shown in Fig. 2.

Fig. 2
figure 2

Research Framework.

All activities during this research ensured the systematic nature of the research and adherence to ethical standards. This included precise recruitment of potential subjects, rigorous screening to select suitable individuals, and subsequent data collection. This process ensured strict compliance with ethical review requirements and maintained the quality and validity.

Participant

The focus group respondents were experts from four key universities in China, specializing in urban planning and art & design. To ensure the sample’s representativeness and relevance, we used purposive sampling techniques. Four selection criteria were set: (1) at least five years of professional experience in their field; (2) involvement as a principal investigator or participant in provincial or higher-level projects within the last five years; (3) publication of at least three high-quality papers in international journals within the past five years; (4) willingness to participate in online discussions for the focus group. Invitation emails were sent to 26 experts using publicly available information from university websites. The emails included a study overview and an informed consent form. Ultimately, nine experts responded positively and signed the consent forms. The experts’ basic information is presented in Table 2.

Table 2 Basic information on focus group interviewees.

Survey participants were selected from four cities in different provinces of China: Wuhan, Jinan, Guangzhou, and Chengdu. Through the alumni platform of the authors’ affiliated universities, we contacted community leaders willing to help reach various neighbourhood WeChat groups in each city. Random sampling was conducted within these groups. A total of 640 questionnaires were distributed, and 427 valid responses were collected. The demographic statistics of the survey respondents are shown in Table 3. The gender distribution was relatively balanced, with 202 male and 225 female participants, reflecting a gender-balanced sample. The age structure of the sample spanned multiple age groups, indicating diversity. The sample revealed a wide range of educational backgrounds. The highest number of participants, 112, had an Undergraduate level of education.

Table 3 Demographics of questionnaire participants.

Stimulation

To explore ChatGPT-4o’s ability to autonomously envision future urban landscapes, we designed an AI-driven image generation process that allowed the model to create futuristic images of Beijing without any predefined constraints. This approach enabled a more precise analysis of how ChatGPT-4o utilizes its internal dataset and computational algorithms to interpret the future of an existing city. As the capital of China, Beijing features a highly recognizable urban landscape, ensuring that individuals have a foundational impression of the city. This makes it an ideal test case for assessing AI-generated future urban designs.

Given that ChatGPT-4o currently restricts image generation to one image per request, we employed a multi-stage independent generation method. Each image was generated separately to ensure that prior outputs did not influence subsequent results. To eliminate potential memory effects, we explicitly instructed ChatGPT-4o to disregard previous interactions and treat each request as an independent task. Multiple pre-trials were conducted to refine and optimize the prompt, ensuring that it effectively guided ChatGPT-4o to generate images aligned with our research objectives. Additionally, we referenced validated methodologies from previous studies to ensure the prompt’s effectiveness (Vemprala et al. 2024).

To mitigate any underlying model biases that may be influenced by session-based training updates, the following standardized prompt was issued across eight independent ChatGPT-4o accounts:

“Please generate an image of Beijing in the future using your internal knowledge, dataset, and algorithms. Do not reference any prior conversations or memory. Create a unique vision of the city’s future as imagined by your model.”

The ChatGPT-4o account holders acted as evaluators of the generated images. After each image was generated, the account holders reviewed the output for any homogenization patterns or significant errors, such as inconsistencies with fundamental urban planning principles (e.g., road collisions, image distortions, or incoherence). Evaluation was conducted following predefined exclusion criteria based on established urban planning principles (Lowe, 2018, Haghani et al. 2023, Oktay, 2023). If such errors were detected, the image was discarded, and a new image was generated. To encourage diversity without introducing human bias, evaluators were instructed to use minimal refinement prompts:

“Please generate another image of Beijing in the future, depicting a different perspective within the city. Ensure that this vision presents a distinct viewpoint while still being an autonomous creation based on your internal knowledge and dataset.”

The evaluators continued generating images until all 10 images in each set were reviewed. This refined prompt aimed to capture diverse urban depictions of future Beijing while still allowing ChatGPT-4o to autonomously create imaginative urban designs. Ultimately, the AI-generated dataset comprised 80 images across 8 sets.

To ensure randomness and representativeness in the experiment, one set of images was randomly selected from the eight generated sets for analysis. To eliminate potential order effects, the 10 images were presented in a randomized sequence within the questionnaire. This procedure ensured the randomness of image selection and the scientific validity of the results. The final sampled image set is shown in Fig. 3.

Fig. 3
figure 3

Generated sample image by ChatGPT-4o presentation.

Instrument

In the initial phase of this study, we used focus groups to identify specific criteria for evaluating future urban planning images. A focus group is a qualitative research method that collects participants’ views and feedback through group discussions (Morgan, 1996). This technique is well-suited for exploring emerging fields, allowing an in-depth understanding of participants’ genuine thoughts and feelings (Rabiee, 2004). Due to geographical constraints and scheduling considerations, our focus group discussions were conducted online. Each participant signed an informed consent form before the discussion, indicating their understanding and agreement to participate. The first author served as the moderator, guiding the two-hour discussions to ensure each participant could freely express their thoughts. The moderator used a semi-structured interview guide to maintain an organized flow and deeply explore topics. Special attention was given to question wording to ensure they were precise and inclusive, encouraging participants to freely express diverse viewpoints (Nyumba et al. 2018).

In the second phase, we developed an IPA questionnaire based on the focus group data analysis results (Boley et al. 2017) for data collection. The questionnaire design adhered to informed consent principles, ensuring participants were fully aware of the study’s content and purpose before completing it. The questionnaire was divided into three main sections, with a total of 19 items, to comprehensively collect participants’ basic information and their subjective evaluations of each criterion. Specifically, the first part of the questionnaire showed ten randomly arranged ChatGPT-generated images, followed by questions on basic participant information: gender, age, and educational background. The third part, consisting of 16 items, measured the indicators deemed feasible by experts and was presented in a randomized order.

A pilot study was conducted with 54 residents participating in the survey. The reliability analysis showed a Cronbach’s α coefficient of 0.835, indicating good reliability. Additionally, the KMO value was 0.855, and the significance of Bartlett’s test of sphericity was 0.000 < 0.01, demonstrating good data validity (Aharonovich et al. 2017). Based on the preliminary survey results, we discussed ambiguous statements with professional professors and made minor revisions to the questionnaire to ensure more reliable statistical results in subsequent data collection.

Data collection procedure

To ensure effective and reliable data collection, we designed two consecutive focus group meetings and implemented rigorous steps for smooth conduct. In the first meeting, after all respondents entered the online meeting room, the moderator introduced the participants and stated the discussion topic. Subsequently, each respondent, in random order, shared their views on future urban planning designs and was asked to identify the specific criteria they understood. This phase lasted 80 min. During the 20-min break, the moderator summarized the criteria collected during the first phase.

In the second meeting, respondents re-entered the online meeting room. The moderator asked the art and design experts to evaluate whether these criteria could be identified in the images based on their expertise and to explain their reasoning. Following this, urban planning experts discussed the screening results and the reasons provided, ultimately reaching a consensus. The moderator meticulously documented the entire process. After the meeting, the recorded content was transcribed and sent to the respondents via email for proofreading and confirmation, ensuring data accuracy and completeness.

After completing the qualitative data analysis and designing the questionnaire, we began collecting data. First, we contacted the heads of the residents’ committees in four targeted communities. Through communication, we gained their support, ensuring they understood the research purpose and process. Based on transparency and mutual trust, we agreed to pay each committee head 50 CNY for assisting with the study. This compensation acknowledged their time and effort and facilitated the smooth progress of the research.

With the community heads’ assistance, we provided potential participants with detailed information about the research purpose, significance, and their role, ensuring each participant signed an informed consent form after fully understanding the study. This process ensured respect for participants’ rights and adherence to informed consent principles, upholding ethical standards. To increase participation rates and respect participants’ time and contributions, we offered each participant 3 CNY for completing the questionnaire. This compensation mechanism improved the response rate, ensuring the quality and representativeness of the collected data.

Data analysis

This study employed template analysis for the focus group interview data (Brooks et al. 2015, Cao et al. 2025). During discussions in both meetings, each expert provided specific criteria after sharing their views. In the second meeting, all participants discussed these criteria in depth and reached a consensus. During the focus group discussions, the moderator and experts processed the information provided by the respondents in real-time. After the discussions, the moderator and experts immediately summarized and presented preliminary results (i.e., the template). This immediate feedback helped validate and confirm participants’ viewpoints, ensuring the accuracy and completeness of the data (Cohen et al. 2006). In the post-transcription analysis phase, each author meticulously reviewed the original meeting content and compared the qualitative data with the preliminary template (King, 2012). This comparative analysis found no new criteria beyond the preliminary template, suggesting data saturation and determining the final number of criteria. The entire analysis process was meticulously documented and presented in a written report to ensure research transparency and result reliability.

To accurately explore residents’ attitudes, we used statistical methods for data analysis. In the pilot survey phase, we used SPSS to conduct reliability and validity tests on the questionnaire content. Subsequently, we used Importance-Performance Analysis (IPA) to investigate the data for each criterion. We adopted a hybrid crosshair placement method, combining mean-centred crosshairs and a 45° diagonal line approach. This involved superimposing a 45° upward diagonal line (y = x) on the traditional median-centred axes to distinguish areas where performance exceeds importance (PE > IM) from areas where performance falls below importance (PE < IM) (Deng and Pierskalla, 2018). Although the discussion results were analysed based on both the data-centred regions and the satisfaction intervals defined by the Iso-Diagonal Line, all three auxiliary lines (median-centred, mean-centred, and Iso-Diagonal Line) were visible on each IPA chart. This visibility demonstrated decisions based on the placement of the crosshairs. Compared to a single method, this hybrid approach provided richer and more nuanced findings. Through the aforementioned data analysis methods, we comprehensively examined the residents’ subjective evaluations of ChatGPT-generated images.

Results

Non-identifiable indicators

During the focus group discussion, although the five indicators—social harmony, economic feasibility, residential comfort, cultural representation, and functionality—were considered important factors in future urban planning and design, experts in the arts recommended their exclusion. After hearing the rationale, urban planning experts also agreed. The excluded indicators and their reasons are presented in Table 4.

Table 4 Indicators considered non-identifiable with reasons.

Identifiable Indicators

Based on focus group discussions and comprehensive data analysis, no new feasible indicators were identified (Fig. 4). The following eight indicators—creativity, traffic rationality, design coherence, environmental greening, public space utilization, technological sense, visual quality, and cultural representation—were recognized as identifiable in the stimulus images and received unanimous support from all experts.

Fig. 4
figure 4

Indicators considered identifiable.

Creativity is regarded as a key element in future city design, embodying imagination. In this study, creativity specifically refers to the originality and innovation of AI-generated urban concepts, rather than variations among individual images. Creativity embodies novelty and serves as the driving force that distinguishes future cities from current designs, propelling urban development. The degree to which a future city image demonstrates unique, innovative features is essential in assessing its creativity. “A creatively designed city image should spark curiosity and imagination in viewers, allowing them to see the limitless possibilities of future cities” (IP 3). Participants unanimously supported this perspective, agreeing that innovation and uniqueness are core evaluation criteria. Creativity is evident not only in grand architectural structures and urban layouts but also in innovative applications of public art, green spaces, and infrastructure.

Rational transportation system design is fundamental to the functioning of future cities. “A well-designed transportation system can effectively alleviate traffic congestion and improve travel efficiency. Showcasing the transportation system through images allows one to intuitively see the rationality and convenience of urban traffic planning” (IP 7). Transportation system design directly impacts citizens’ quality of life, as effective planning reduces commute times and enhances travel comfort and efficiency. “An efficient transportation system is not only the lifeblood of city operations but also crucial to the quality of life for every citizen. Optimizing traffic design can significantly improve the travel experience of citizens” (IP 4). As future cities confront the challenges of high population density and urban expansion, effective transportation planning is essential to support sustainable development and ensure traffic safety.

The importance of design coherence lies in its ability to foster a harmonious urban environment through consistent visual language and design elements. Experts agreed that evaluating design coherence depends on the degree to which urban planning elements in the images are consistent and coordinated, forming a cohesive whole. Design coherence enhances not only the aesthetic appeal of a city but also the systematic and coordinated nature of the planning process. “Design unity can convey a sense of harmony, making people feel the integrity and coherence of the city” (IP 4). This cohesive design style is reflected in architecture, street layouts, and public space planning, which not only elevates the city’s visual appeal but also strengthens residents’ sense of belonging and identity.

Environmental greening focuses on whether urban design emphasizes environmental protection and sustainable development, with the rational distribution of greenery as its core element. Experts believe that future cities must prioritize environmental protection and sustainable development. “Environmental greening is not only a component of urban aesthetics but also a necessary condition for achieving sustainable development” (IP 2). The expert explained, “Through images showcasing green layouts, one can intuitively see the city’s efforts in environmental protection and green space distribution” (IP 6). Greening enhances the city’s aesthetic appeal and positively impacts the psychological and physical health of urban residents. A sustainable environment can significantly improve the city’s climate, creating a more liveable environment.

Experts discussed the multifaceted role of rational public space utilization in urban life. The design of public spaces like parks and squares is assessed based on whether they provide good recreational venues for citizens. Experts believe rational use of public spaces can enhance residents’ quality of life. “By showcasing public space designs in images, one can observe the planning and utilization of public resources in the city” (IP 2). Experts also pointed out that public spaces in future cities should have diverse functions to meet the needs of different groups.

Technological integration involves both individual technological elements and the overall intelligent layout and system integration. Experts agreed that the presence and advancement of technological elements in images (such as smart transportation and intelligent buildings) directly reflect the city’s technological level and potential for future development. “Technological integration is a hallmark of future cities and a key measure of a city’s ability to sustain development in the future” (IP 1). Experts also discussed how technological integration can be communicated through images. Advanced building materials, innovative architectural structures, and efficient energy management systems effectively showcase technological integration and can be visually represented in images, enhancing viewers’ perception of the high-tech level of future cities.

Experts agreed that when geographic location is specified in AI-generated images, cultural representation should be considered an independent evaluation criterion, especially as AI continues to integrate cultural elements into image generation. Initially, some experts questioned whether AI-generated images could effectively convey cultural elements, arguing that cultural identity is typically expressed through historical narratives, traditions, and local contexts—features that might be difficult to capture in static images. However, others pointed out that architectural styles, iconic landmarks, and urban aesthetics serve as powerful visual representations of a city’s cultural identity. A urban design expert stated, “Once a specific city is defined, visual elements such as architectural forms and spatial layouts can strongly convey the essence of its culture” (IP 5). In agreement, an art scholar added, “Culture is not static—it evolves over time. Even when depicting future cityscapes, AI-generated images should reflect cultural continuity” (IP 7). Through discussion, the experts reached a consensus that although AI-generated images may not capture all dimensions of cultural characteristics, cultural representation remains a recognizable and assessable visual attribute in urban imagery.

The visual quality of urban planning designs in the images was intensely debated among the experts. Initially, opinions diverged regarding visual quality and design unity, with some experts arguing that visual aesthetics are inherently linked to design unity. However, art design experts offered a different perspective, suggesting that diversity and richness in design could equally enhance visual aesthetics. “Visual quality is not solely about design unity; diversity and richness can also bring a unique beauty”, noted one art expert. “Therefore, we should consider visual quality as a separate criterion” (IP 8). Eventually, the urban planning experts agreed to treat visual quality as an independent indicator. Visual quality directly affects people’s first impressions and overall perception of a city. “High-quality visual design can enhance the city’s beauty, making people feel visually pleased and satisfied” (IP 8). Visual quality involves aesthetic considerations of architecture and landscape design and the comprehensive use of colours, materials, and lighting effects, making it an especially important criterion.

Results of importance-performance analysis

Table 5 provides descriptive statistics offering a comprehensive overview of the performance (PE) and importance (IM) scores for the eight indicators (A1 to A8). The table includes the mean scores for PE and IM, their respective rankings, the mean deviation between PE and IM, and the corresponding t-values and p-values for each attribute. The performance of A6 (Technological Sense) was rated the best (PE = 3.67), while A1 (Creativity) was considered the most important (IM = 3.79).

Table 5 Descriptive Statistics.

Based on the descriptive statistical results, we constructed an IPA matrix. As shown in Fig. 5, A1 (Creativity) is in Quadrant 1, while A6 (Technological Sense) and A7 (Visual Quality) are in Quadrant 2. A2 (Traffic Rationality), A4 (Environmental Greening), A5 (Public Space Utilization), and A8 (Cultural Representation) are in Quadrant 3, while A3 (Design Coherence) is in Quadrant 4. According to the Iso-Diagonal Line, A1 (Creativity) is perceived by residents as more important than its performance (IM > PE), indicating a need for improvement in this area. This suggests that residents are dissatisfied with A1. The other indicators are below this line (IM < PE), indicating their performance meets or exceeds their importance.

Fig. 5
figure 5

Presentation of IPA results for 8 indicators.

Discussion

In the focus group phase, we identified eight indicators that can be recognized in images and four that are challenging to identify. While social harmony, economic feasibility, residential comfort, cultural representation, and functionality were considered difficult to capture in images, these indicators nonetheless reflect experts’ visions and expectations for future urban planning design (Winkler, 2012, Li et al. 2020). Although hard to depict through static images, these indicators remain significant for overall future urban planning, highlighting the complexity and multifaceted nature of creating sustainable, liveable, and culturally rich urban environments (Mao et al. 2020). This discussion underscores the need for continued innovation in urban evaluation techniques to adequately capture the dimensions of future urban planning. Such advancements would ensure a comprehensive understanding and representation of elements that contribute to effective and forward-thinking urban design.

Among the eight indicators identifiable in images, experts emphasized that creativity is key to future urban design, while traffic rationality and environmental greening are also highly valued. Galdini and De Nardis (2023) highlighted the role of creative and innovative design in fostering vibrant, sustainable urban environments. Our findings support the view that creativity differentiates future cityscapes from existing ones and signifies urban development and progress (Lee and Chung, 2024). Furthermore, urban residents often express dissatisfaction with existing traffic conditions and green spaces (Benoliel et al. 2021). Car-centric urban planning has left little room for alternative transportation methods like walking and cycling, hindering sustainable urban living. Additionally, the lack of accessible green spaces has led to 76% of residents feeling dissatisfied with urban greenery availability (Psara et al. 2023). These discussions remind planners that addressing these aspects can enhance residents’ quality of life and contribute to a more sustainable urban environment.

The importance of public space utilization and technological integration in future urban planning cannot be overstated, as they enhance inclusivity and efficiency. Contemporary urban landscapes face challenges like the fragmentation and commercialization of public spaces, where these areas are often controlled and privatized. This leads to tensions between different usage practices and movements advocating alternative models of public space utilization (Mela, 2014). In the future, rational use of public spaces can foster more social interactions among citizens, creating friendly environments and enhancing neighbourhood inclusivity to meet diverse community needs (Lau et al. 2021). Additionally, technological advancements are transformative for urban planning, enhancing city services, productivity, and cost-effectiveness (Kumar et al. 2024). However, our study focuses more on the public’s imagination regarding new and unknown technologies, a relatively underexplored discussion in current urban planning research. Anticipating future technologies and their social impacts is crucial for technology assessment and responsible research and innovation. Engaging stakeholders and the public in this process is valuable for understanding and addressing their concerns and expectations (Decker et al. 2017).

Design Coherence and Visual Quality are key issues in the current aesthetic domain, related to urban planning. Our findings on design coherence align with Caliskan and Mashhoodi (2017), who advocate for the visual organization and legibility of urban spaces. Our study extends this view, suggesting that a cohesive design language enhances a city’s visual appeal and navigability. However, experts in art and design pointed out that visual quality is not solely related to uniformity; factors like diversity, uniqueness, and typicality can also lead to varying degrees of aesthetic appreciation (Blijlevens et al. 2017). A cohesive design language ensures various city components work together seamlessly, promoting order and ease of movement. Strategically incorporating diverse and unique elements can prevent visual monotony and enhance the overall aesthetic richness of the urban environment (Salama, 2017).

Based on the IPA results from the public survey, residents provided varying perspectives on different evaluation indicators of the ChatGPT-generated images. A1 (Creativity) was deemed most important as it drives the creation of well-planned communities and inclusive public spaces, shaping interactions among various stakeholders such as citizens, businesses, and the state, fundamentally influencing the planning process (Vidyarthi, 2022). In the quadrant division, A1 was the only indicator in Quadrant 1, indicating that it requires significant investment for improvement. Its position above the Iso-Diagonal Line suggests that ChatGPT’s creativity does not currently meet the satisfaction levels of most respondents, highlighting a gap in the AI’s innovative capabilities. A6 (Technological Sense) received high performance recognition from participants, likely because ChatGPT has absorbed vast amounts of open information through interactions with global users. This extensive knowledge base within the language model allows it to make intelligent predictions about the future (Guo et al. 2023). Both A6 and A7 (Visual Quality) were recognized as crucial and core aspects of ChatGPT’s future urban planning creations, suggesting that these indicators should be maintained and further enhanced.

Surprisingly, A2 (Traffic Rationality), A4 (Environmental Greening), A5 (Public Space Utilization) and A8 (Cultural Representation) were in Quadrant 3. These indicators were generally not deemed important by participants and did not perform well, indicating that they are not core areas for ChatGPT to focus on in future urban planning designs. This could be because participants expect ChatGPT-4o, as the latest version of an intelligent system, to emphasize other, more imaginative indicators. This aspect may require further exploration in future studies. Notably, A3 (Design Coherence) was in Quadrant 4. Although participants did not consider this aspect particularly important, ChatGPT performed well in this area, possibly achieving unexpected visual coherence. Additionally, all indicators from A2 to A8 were below the Iso-Diagonal Line. This demonstrates that even underappreciated design elements can significantly enhance the overall visual quality of urban design images and resident satisfaction through innovative applications and diverse presentations. The performance of these indicators suggests that while ChatGPT has shown potential, there is room for improvement. The insights gained from this analysis provide a concrete reference for further developing and refining ChatGPT’s capabilities in urban planning.

Conclusion

This study explored the application of ChatGPT-4o in future urban planning by identifying and evaluating key indicators through focus groups and public surveys. Eight indicators - creativity, traffic rationality, design coherence, environmental greening, public space utilization, technological sense, visual quality and cultural representation - were analysed to assess their performance and importance as perceived by residents. The IPA results highlighted creativity as the most important indicator needing improvement, while technological sense was highly appreciated. Despite some indicators being less prioritized, the potential for enhancing the overall visual quality of urban design images through innovative approaches was evident.

This study serves as a foundational step toward harnessing AI’s potential to address the complex and multifaceted challenges of future city development. It makes a significant contribution to urban planning and AI by demonstrating how advanced AI models, such as ChatGPT-4o, can generate and evaluate futuristic city designs. The study explores AI’s creative potential in urban visualization and highlights its ability to interact with human-centred evaluation frameworks. Using a robust mixed-methods approach that combines qualitative insights from expert focus groups with quantitative data from public surveys, this research offers a nuanced understanding of residents’ preferences, expectations, and priorities in urban design. Additionally, the methodology introduced in this study is scalable and replicable, enabling its application to other AI tools and contexts and advancing the discourse on sustainable, human-centred city planning.

Although we tried to minimize bias and maximize applicability when designing this study, there are still some limitations. First, relying on static images restricts the ability to capture dynamic interactions and evolving relationships, such as those associated with social harmony and functionality. These indicators often demand longitudinal or interactive assessments, which static representations cannot adequately capture. Additionally, although the focus group comprised experts from diverse fields, their shared cultural and regional background in China may have influenced their perspectives on selecting evaluation indicators. Furthermore, this study exclusively focused on ChatGPT-4o, an advanced language model, as the basis for generating and evaluating urban design images. While this choice allowed a detailed examination of a single model’s capabilities, it represents just one approach to AI-driven urban visualization. This study standardized the AI-generated image process, but some human intervention was unavoidable. Despite minimizing researcher influence through consistent prompts and limited iterations, subjective decisions—such as selecting final images and identifying generation errors (e.g., distortions, unrealistic layouts)—still required human judgement.

As an initial exploration, this study offers valuable directions for future in-depth research. Future studies could investigate integrating dynamic simulations or virtual reality environments to better capture the complexities of urban planning. To address potential biases, future research should involve experts from more diverse cultural and regional backgrounds. This approach would enhance indicator selection and ensure a more globally representative framework for evaluating AI-generated urban designs. Additionally, future research could compare the outputs and performance of various generative AI models, such as Midjourney and DALL-E, to highlight differences in design coherence, cultural adaptability, and creative depth. The limitation of Human intervention suggests that future studies could explore more automated or objective methods for evaluating AI-generated content, such as using computational metrics or larger-scale crowd evaluations.