Introduction

In recent years, QR code technology has rapidly integrated into numerous industries, serving as a powerful tool for data collection and consumer behavior analysis. The tobacco industry is no exception. QR codes have already been widely adopted across various industries for applications such as anti-counterfeiting, traceability, and consumer engagement. For instance, in the retail sector, QR codes are used to verify product authenticity and trace origins, particularly in industries like food and beverages, pharmaceuticals, and luxury goods. Beyond these functional uses, QR codes have also become a gateway for brands to interact with consumers. Consumers can scan the QR code on the outer packaging after purchasing cigarettes, and scanning a QR code can grant access to brand-specific platforms, offering interactive experiences, loyalty programs, and personalized content. Well-designed user interfaces (UI) further enhance consumer engagement and strengthen brand perception. Despite these advancements, the full potential of QR code data—particularly when combined with spatial and temporal analysis—remains underexplored in the context of driving actionable business insights.

Simultaneously, the widespread adoption of Point of Interest (POI) data has significantly enhanced the depth and precision of data analysis. POI data provides valuable geographical insights into consumer scanning behavior, enabling businesses to better understand activity patterns and spatial preferences. By combining QR code scanning data with POI data, companies can gain a more comprehensive understanding of consumer behavior across different temporal and spatial contexts. The results of data analysis can, on the one hand, enable tobacco companies to adjust and optimize the quantity and timing of resource allocation, and have a clear understanding of product positioning and strategies; on the other hand, they can identify smoking behaviors, especially the behavior of smoking in places where smoking is prohibited. This study seeks to address this gap by exploring how QR code scanning behavior when integrated with POI data, can provide actionable insights into consumer preferences and behavior patterns.

We utilized 134,292 QR code scan records from a tobacco company in 2023, combining scan data from 25 different product brands with POI data to conduct an in-depth analysis of consumer preferences across both temporal and spatial dimensions. Our analysis reveals significant differences in the temporal and spatial distribution of consumer behavior across different product brands. For example, some brands exhibited a concentration of scan activity during midday, while showing higher activity during nighttime. Some brands are predominantly scanned in residential and institutional areas, while others are more frequently associated with food and beverage establishments. To systematically categorize these multi-dimensional features, we use PCA to extract the most representative temporal feature variables and combine them with K-Means clustering to classify product brands into distinct categories. By leveraging this comprehensive temporal and spatial analysis framework, we uncover fundamental patterns of consumer behavior. On the one hand, these analyses reveal the consumption behavior characteristics of tobacco consumers. These findings can provide actionable insights for businesses to optimize marketing strategies and resource allocation. On the other hand, these patterns are of great significance for tobacco behavior identification. They can help stakeholders better understand the characteristics of tobacco-related activities in different time periods and locations.

In Sect. “Literature review”, we review and discuss relevant literature. Section “Data” details the data processing methodology. Section “Result analysis” presents the results of our analysis, focusing on temporal preferences in scanning behavior, POI location preferences, and the spatial clustering of scan hotspots. Section “Principal component analysis and cluster analysis” discusses the application of principal component analysis and K-Means clustering to classify product brands. Finally, Sect. “Conclusion and discussion” summarizes the findings and discusses their implications for policymakers and the tobacco company, along with potential directions for future research.

Literature review

Research on QR code

QR codes were originally designed for tracking automotive parts, but they have quickly been adopted in many other sectors that traditionally used barcodes, such as manufacturing, retail, healthcare, and transportation. In addition, QR codes have found valuable applications in novel fields, including mobile marketing, online advertising, electronic ticketing/coupons, electronic payments, identity verification, academia20, information security, and OMR form tamper detection18.

Despite the extensive application of QR codes across diverse industries, research on their use in the tobacco sector remains relatively scarce. 17used data from the Population Assessment for Tobacco and Health (PATH) to investigate adolescent engagement in online marketing, revealing that only a minor fraction (2.9%) of adolescents had encountered online marketing activities involving tobacco product QR codes. 21conducted an analysis of websites linked to QR codes on cigarette packaging collected from China in 2017, uncovering that the majority of these websites (78.3%) were brand-specific and lacked adequate restrictions or health warnings (only 13% of websites implemented age restrictions). Given the current paucity of research on tobacco QR codes, our objective is to perform comprehensive data analysis on tobacco QR codes, integrating them with POI data to provide a more nuanced understanding of consumer preferences and consumption patterns across both temporal and spatial dimensions.

Related research based on POI (Points of Interest) data

Points of Interest (POI) are defined as specific geographic locations or places characterized by distinct functional attributes, such as retail stores, restaurants, parks, and other notable sites. POI data has been widely applied in multiple research fields and directions, such as tourism recommendation and itinerary planning14, urban land use analysis13, mental health assessment12, and urban functional zone identification24. Inspired by the above literature, we plan to combine QR code scan data from the tobacco industry with POI data to analyze consumer preferences in space and time. By identifying patterns of scanning behavior at different POI types, we aim to explore consumers’ preferences for specific geographic locations, providing precise guidance for marketing strategies in the tobacco industry and expanding the application value of POI data in the commercial field. Combining POI data with other data sources for cross-domain research can further explore its potential and drive academic and practical innovations.

Empirical research on tobacco

With the development of the global tobacco industry, the availability of tobacco-related data has improved, prompting scholars to conduct a series of empirical studies on tobacco sales and tobacco consumer behaviors.

Empirical research on tobacco sale

Policy changes are a crucial factor influencing tobacco and e-cigarette consumption. 11explore the impact of different taxation policies on cigarette purchases, highlighting that taxes on premium cigarettes and bans on cigarette packaging sales have the most significant effect on demand. This provides valuable insights for the design of tax policies. Anti-marketing policies also have a considerable influence on tobacco product sales. 22,23examined the effects of taxes, smoking bans, and anti-smoking advertisements on cigarette consumption, finding that tax policies and anti-smoking ads are effective tools for reducing cigarette sales, while smoke-free policies had a less pronounced effect. 10studied the impact of the menthol ban on cigarette sales, discovering that policy changes can lead consumers to shift towards non-menthol cigarettes, and even engage in cross-state purchases.

Empirical research on tobacco consumer behavior

Many scholars have studied cigarette consumer behavior from the perspective of addiction2,3. As taxes increase, smokers tend to adjust the quantity or type of cigarettes they consume, such as switching to cigarettes with higher tar and nicotine content8,9, or adjusting the nicotine intake per cigarette1. With the rise of the e-cigarette market, research on its regulation has also been growing. 4used a dynamic structural model and found that e-cigarette consumption increases smoking volume. 6compared different addiction models and found that e-cigarette consumption exhibits characteristics of rational addiction, with the long-term price elasticity of e-cigarette demand being greater than the short-term elasticity.

About the dangers of smoking, many scholars have launched research, with7 being the first to use the concept of Population Attributable Risk (PAR) to measure smoking-related harm. However, several scholars have questioned this estimation method, arguing that it overlooks the effects of environmental pollution, occupational exposure, and lifestyle factors19, leading to an overestimation of PAR. E-cigarette use also has health implications, with studies showing that adolescents and young adults using e-cigarettes face health issues5 and are more prone to nicotine addiction16. In addition to its harm to consumers, the strict regulation of cigarettes often leads to gray market behaviors, such as smuggling15.

Data

Scan code data

After consumers purchase cigarettes, they can scan the QR code on the cigarette box, and the data from each scan will be recorded. In order to ensure the integrity and wide coverage of the scanned code data as much as possible, tobacco companies have spent a lot of resources promoting and encouraging the behavior of scanning the QR codes on cigarette packs. In our research, we analyzed a large amount of data to reduce some biases in sample selection. We have ingeniously combined the act of scanning QR codes with consumers’ consumption. Of course, this approach has some limitations, and we will have a detailed discussion in the discussion section. It is worth noting that our data involves when and where consumers scan the QR codes on cigarette packs, acting as a proxy for cigarette engagement or exposure, rather than direct smoking behavior.

The data for our study was provided by a tobacco company in China, and its data collection process adheres to strict industry-specific and enterprise-level ethical review processes, fully aligning with national data privacy and commercial regulatory guidelines. The cigarette package QR codes (the data collection entry) underwent mandatory formal review and approval through the tobacco industry’s national regulatory framework prior to public release. Additionally, our partner enterprise has established an internal data security management procedure that standardizes ethical requirements across the entire data lifecycle—from collection and transmission to storage, use, and disposal. Our research data is collected with the consent of consumers, and the consumer IDs are desensitized and anonymized. Our data does not involve consumers’ private information. To be specific, we obtained consumer scan data in the Xihu District of Hangzhou, China, from a tobacco enterprise from January 1, 2023, to December 31, 2023. The data is granular, with each entry representing an individual consumer scanning the QR code on a cigarette package. Specifically, each record includes user ID, scan time, scan location, latitude, and longitude, product brands, packaging format, region, verification result, scan product price, and more. To more accurately reflect the time and geographical characteristics of consumer scanning behavior, we filtered out scan records related to individual consumers, excluding those from retailers or other non-consumer groups. The specific steps are as follows: First, we selected scan records with packaging properties marked as “pack” and a verification result of “first-time successful verification.” Since the QR code is on the outer packaging of cigarettes, anyone can scan it, including cigarette sellers. They have a large number of cigarettes in their hands, which may result in a large number of scanning records and thus affect the real data. Therefore, we calculated the number of scans per individual, and to exclude non-consumer groups (especially cigarette sellers), we only used data within the range of the mean plus 0.5 standard deviations. Table 1 presents the structure of the scan data before and after selection. We counted the number of times consumers scanned codes at different time periods in the processed data, and drew Fig. 1 in version 3.4.2 of the R software, as well as conducted principal component analysis and K-Means clustering analysis.

Table 1 Descriptive statistics.

POI data

For the POI data, we selected the largest map open platform in China, Amap (Gao De), and used its developer platform’s API interface to convert the latitude and longitude coordinates from the raw scan data into geographic locations through reverse geocoding. For each coordinate, we queried the information of all POI (Point of Interest) points within a 1-kilometer radius. Since multiple POI may exist in proximity, we selected the nearest POI to the scan location and extracted the geographic information of that POI, including its latitude and longitude as well as its place category. We counted the number of different POI and combined them with the map to draw Figs. 2, 3, 4 and 5 in R software version 4.3.2.

The POI classification is based on AMap’s POI categorization method, and considering the actual situation of the location of the data, we have consolidated some POI site classifications. For specific adjustments and reasons, please refer to Appendix A in supplemental material. A detailed explanation of these categories is provided in Table A1(see Appendix A, in Supplemental material).

Result analysis

Preference analysis on scanned time

By analyzing the peak and trough periods of scanning for different brands throughout the day, we can more accurately identify the temporal characteristics and patterns of consumer behavior, thereby supporting stakeholders in making relevant decisions.

Given the wide variety of brands, we selected three cigarette brands (A, B, C) for illustration. A, B, and C represent 19, 12, and 3 respectively (there are 25 products in total, and A, B, and C are three of them). Statistical analysis of the scanning data revealed significant differences in scanning time patterns across brands. For instance, as shown in Fig. 1, Product A has triple peaks at 13:00, 15:00, and 23:00, while Product B and C have dual peaks at 14:00 and 21:00. Assuming that peak scanning periods can indicate high-frequency periods of tobacco consumption, we can observe some regular scanning patterns from the figure. For example, the time window from 12:00 to 14:00 is the peak period for all brands, suggesting that this is a time when people are more likely to encounter and discuss tobacco-related information in public places, regardless of brand characteristics. For stakeholders, they can design some universal related activities, such as strengthening the communication of information about general tobacco characteristics during these periods. As for other peak time periods unique to each brand, stakeholders can design more brand-specific and differentiated activities to enhance the communication of brand-specific information, thereby improving the targeting and effectiveness of the activities.

Fig. 1
Fig. 1
Full size image

Time preference analysis of tobacco QR scan. Note: This figure is generated in R 4.3.2 software, https://www.r-project.org/.

Preference analysis on scanned POI venue

In consumer behavior research, POI (Point of Interest) preference analysis holds significant practical value. The integration of QR code scanning data with POI enables POI preference analysis of consumers’ scanning behaviors, revealing consumers’ scanning activities at specific locations. This helps identify high-frequency areas of tobacco use.

For example, as shown in Fig. 2, institutions and residential areas rank among the top three regions in terms of scanning frequency for all three brands. This indicates a regional commonality in consumers’ tobacco-related behaviors—consumers exhibit high-frequency and location-concentrated scanning, usage, or consumption of tobacco in institutions and residential areas. It is recommended that stakeholders conduct universal tobacco information communication targeting these regions.

In addition, consumers show a preference for scanning activities of Brands B and C in catering areas, suggesting that consumers are more inclined to consume and discuss these two brands in food and beverage-related venues. It is recommended that stakeholders implement brand-specific information communication in catering areas. If combined with the time patterns mentioned in Sect. “Preference analysis on scanned time”, conducting relevant information communication in catering areas during lunch or dinner hours may yield better results.

Fig. 2
Fig. 2
Full size image

POI preferences analysis of tobacco QR code scan. Note: This figure is generated in R 4.3.2 software, https://www.r-project.org/.

Preference analysis on scanned POI space

Heat map of POI on road map

Figure 3 presents heat maps generated from the POI data corresponding to matched scanning locations, reflecting the distribution of POI on a road map. Across the three different brands, the heat maps consistently show that POI is primarily concentrated in the northeastern city center. In contrast, the southern region displays relatively sparse POI distribution, with these POI tending to cluster near major roadways. This pattern indicates that the POI distribution in the southern area may be influenced by the transportation network, with POI aligning along major thoroughfares. Additionally, there are notable differences in POI distribution across the three brands. These differences may be related to the type, function, or service scope of the POI. For instance, certain brands may exhibit a stronger preference for commercial districts, while others might be more concentrated in residential areas or specialized functional zones. To more accurately represent the density of POI, we propose the use of differentiated markers on the map. For example, POI with higher numbers could be depicted using darker colors, while those with lower numbers could be represented with lighter shades. This approach would enhance the clarity and precision of the visual representation, facilitating a deeper understanding of POI distribution patterns.

Fig. 3
Fig. 3
Full size image

POI heat map of tobacco QR code scan on the road map. Note: This figure is generated in R 4.3.2 software, https://www.r-project.org/..

2D heat map of the poi’s number

To more accurately reflect the scale of POI (Points of Interest) quantities, we propose using different markers on the map to indicate varying levels of POI density. Figure 4 illustrates the distribution of POI across different locations in the XH region, with panels (a), (b), and (c) representing three different mapping scales. In Fig. 4(a), POI is primarily concentrated in the central and northern parts of the map, with a prominent deep blue area in the northern-central region, indicating the highest density of POI. In contrast, the southern region has relatively fewer POI, which correlates with geographical and population distribution patterns. Figure 4(b) shows a more uniform POI distribution compared to (a), but the overall pattern remains similar: the northern region has slightly more POI than the southern region, and the high-density areas are consistent with those in (a). Figure 4(c) depicts a more dispersed POI distribution. These distribution maps provide valuable data insights for stakeholders, enabling decision-makers to better understand POI distributions within specific regions.

Fig. 4
Fig. 4
Full size image

2D POI heat map of tobacco QR code scan. Note: This figure is generated in R 4.3.2 software, https://www.r-project.org/. The coordinates of the graph are north at the top, south at the bottom, west on the left, and east on the right.

3D visualization map of the poi’s number

Figure 5 presents a 3D visualization of the POI (Point of Interest) distribution, offering a spatial perspective to observe POI density and patterns within specific areas. The 3D representation provides an intuitive way to assess the density and shape of POI distributions, reflecting urban layout characteristics, such as the dense concentration in city centers and the more dispersed distribution in peripheral areas. To better highlight the internal density variations across different brands—given the differences in scanning volumes—we applied scaling adjustments to the POI quantities: (a) scaled down to one-fifth; (b) scaled down to one-half; (c) scaled up by ten times. The 3D visualization map corroborates the findings from the previous section, now presented in a more visually impactful manner. All visualizations consistently indicate a high-density distribution in the central-northern area, which likely represents the urban core or a major commercial hub. For policymakers, 3D visualization provides actionable insights into policy changes by clearly identifying high-density areas. By combining the 3D distribution data with on-the-ground observations, stakeholders can make more informed and effective decisions.

Fig. 5
Fig. 5
Full size image

3D Visualization Map of Tobacco QR Code. Note: This figure is generated in R 4.3.2 software, https://www.r-project.org/. The coordinates of the graph are north at the top, south at the bottom, west on the left, and east on the right.

Principal component analysis and cluster analysis

Principal component analysis

Time is measured in relatively concentrated intervals due to the structured nature of human routines. Based on the living and working patterns of the locations where the data is and considering the time zone, we categorized the day into six periods: Night (0:00–7:00), Morning (7:00–12:00), Noon (12:00–15:00), Afternoon (15:00–18:00), Dinner (18:00–21:00), and Before sleep (21:00–24:00). Considering that different time divisions may affect the results, we have added some sensitivity analyses regarding time division. Specifically, we have respectively advanced and lagged behind the time division points by one hour; the results show that the time division will not affect the final result. Consumer behavior for each cigarette brand varies across these intervals. To identify the most significant features of these periods, we applied Principal Component Analysis (PCA). Before performing PCA, the data was standardized to eliminate scale differences among variables. We then calculated the covariance matrix of the standardized data and performed eigen decomposition to obtain eigenvalues and eigenvectors, where eigenvalues indicate the variance explained by each component, and eigenvectors define their directions. Table 2 summarizes the eigenvalues, variance percentages, and cumulative variance percentages.

Table 2 Principal component descriptive statistics.

To determine the number of principal components, we employed various methods, all of which consistently supported the selection of two principal components. After selecting the principal components, we analyzed the variables using tools such as the variable factor map(see Figure B2, in Appendix B) and contribution plots(see Figure B3, in Appendix B).

We obtained the principal component score matrix by multiplying the factor loading matrix with the standardized variable matrix. Table B1(in Appendix B) presents the scores of the two principal components for 25 product brands. Based on the data in Table B1, we plotted the principal component score chart.

From Figure B4(in Appendix B), it is evident that most individuals cluster on the right side of the plot, whereas brands 1, 4, and 7 are distributed in other regions. This suggests that these brands may exhibit significant differences in certain characteristics compared to the others. However, it is not possible to visually classify the products based solely on the plot. Therefore, alternative classification methods are necessary. In our research, we applied the K-Means clustering algorithm to classify the different product brands.

K-Means cluster analysis

We used the K-Means clustering method for analysis. The specific methods and processes are detailed in Appendix C. We calculated silhouette scores and obtained that the optimal number of clusters \(\:k\) was 5, and then conducted clustering. The characteristic variable values for the final clusters are summarized in Table 3.

Table 3 K-Means value of the characteristic variable of the clustering result.

Table 4 presents the specific groupings of different clusters and the characteristics associated with them. Overall, cigarettes from different groups exhibit distinct characteristics at various periods, reflecting changes in demand or value for these brands over time. Through cluster analysis, we can identify similarities in the characteristics of consumers who purchase products with different brands. These findings can be used to develop evidence-based tobacco control policies, evaluate the effectiveness of existing measures, and ultimately contribute to reducing tobacco-related harm.

Table 4 Clustering results and characteristics.

Conclusion and discussion

By integrating QR code scanning data from a commercial partner in 2023 with POI data, this study explores consumer behavior patterns across temporal and spatial dimensions. Using methods such as principal component analysis (PCA) and K-Means clustering, we classify products of different brands and uncover significant differences in consumer preferences regarding time preferences, scanning location distribution, and geographical spatial distribution. These findings not only help tobacco companies optimize resource allocation and improve operational efficiency, but also offer policymakers actionable insights for policy adjustments.

First, regarding the time and location preferences of QR code scanning behaviors, we found that consumers exhibit distinct scanning patterns, with some cross-brand commonalities as well as unique characteristics specific to different brands. For example, all brands show the highest scanning activity during lunchtime throughout the day, while the second peak time varies across brands. Based on these observations, stakeholders can arrange universal tobacco information communication during lunchtime and brand-specific differentiated information communication during the unique peak times of each brand. Second, by integrating QR code scanning data with POI data for analysis, we found that institutional and residential areas consistently rank among the top three regions with the highest scanning frequency for all brands. Meanwhile, each brand also has its unique high-frequency scanning areas, with some concentrated in catering districts. Based on these findings, stakeholders can conduct universal tobacco information communication in institutional and residential areas, and brand-differentiated communication in other unique regions. Furthermore, heat map analysis further indicates that consumer activities are typically concentrated in urban commercial centers and transportation hubs, while suburban areas and regions with low foot traffic see much less activity.

In the past, the data sources for tobacco research mainly came from some public data7,15,19, and a small number of data in the form of questionnaires and self-collection were used17,21. However, there are almost no studies conducted using real data from tobacco consumers. Such research is needed because it can discover some interesting points. For instance, in our research, we found that many consumers scan codes or smoke near some no-smoking areas. Such minor behaviors cannot be fully reflected in the data of previous studies. In addition, it is not limited to the tobacco industry; other industries can also conduct research using similar methods like us. Our analysis organically combines the temporal and geographical attributes of tobacco QR code scanning behaviors with venue characteristics and mapping data to conduct spatiotemporal feature analysis of scanning behaviors, summarizing the behavioral patterns of different brands. We recommend integrating the above analytical findings into tobacco information communication strategies, as this will double the effectiveness of implementation. These practices can more accurately enhance consumers’ understanding of specific tobacco information, promote communication between stakeholders and consumers, stimulate tobacco-related conversations among consumers, and leverage the social attributes of tobacco to facilitate widespread dissemination of tobacco information within consumer groups.

In addition, spatiotemporal feature analysis of scanning data can also provide significant reference value for stakeholders’ internal management activities. For instance, tobacco enterprises can use this information to better arrange upstream procurement, internal inventory levels, downstream distribution, and other operational activities, thereby promoting better alignment between supply and demand.

Finally, to improve management efficiency, we classified and grouped multiple products: we used PCA to extract key temporal features and applied K-Means clustering to categorize products into different groups. The research can facilitate in-depth understanding of product labeling and product differentiation, and lay the groundwork for more scientific and rational formulation of product portfolio strategies and guidance for new product development and positioning in the future.

This study has limitations as follows. For example, although tobacco QR code scanning data comes from consumers, not all consumers who purchase tobacco will engage in scanning behavior. Therefore, scanning data cannot fully represent purchasing or consumption behaviors, policy implications should be triangulated with enforcement or health data. In the future, consideration can be given to combining scanning data with terminal consumption data to conduct spatiotemporal feature analysis of consumers’ purchasing and usage behaviors. Additionally, this study only selected one region as the analysis sample, so the conclusions may not be applicable to other regions or brands. It is recommended that future research analyze sample data from more regions and brands to summarize patterns and guide relevant decision-making. Nevertheless, these limitations do not overshadow the meaningful contributions of our research, as it provides important initial insights for industry stakeholders’ digital decision-making based on consumer behavior and lays a solid foundation for more targeted and evidence-based decision-making.