Introduction

With the proliferation of information technologies, institutions of higher education have increasingly adopted the sports data information management system. A college sports data information management system encompasses the collection, analysis, and administration of sports-related data within academic institutions, facilitated by computer technology, network technology, and other information technology tools1. The system can collect and store all data related to college sports, encompassing athlete and coach profiles, competition results, and training conditions2. Leveraging web log mining technology in conjunction with the Apriori association rule algorithm can collect user behavior data through web log mining technology. Then, the Apriori algorithm is utilized to conduct frequent itemset mining and association rule mining, thereby elucidating relationships between different information and enhancing the retrieval accuracy and efficiency of the system3.

In the research on information management systems, Svacina et al. (2020) combined university laboratory safety management systems with natural ecosystems to construct a laboratory ecological safety management system covering ecological subjects, ecological objects, and ecological environments. They used fault tree analysis methods to identify risk factors in the system. Additionally, they constructed a fault tree model for ecological safety accidents in university laboratories4. Rak and Żyła (2022) first introduced key technologies and related concepts, analyzed the system’s requirements from the perspectives of system functionality and performance, and defined the main functional modules of the system. Then, they designed the system’s technical architecture and provided corresponding database designs5. In the study of the Apriori association rule algorithm, Su and Wu (2021) proposed the enhanced frequent pattern-Apriori algorithm based on closed item sets. Experimental results showed that compared to the classical Apriori algorithm, the improved algorithm achieved an average performance improvement of 290% on smaller data scales with diverse increments and support levels, without reducing the number of correctly mined frequent item sets6. Wang et al. (2023) modeled project features, user behavior characteristics, and content features using neural network algorithms. Additionally, based on the neural network algorithm, they constructed a relevant entrepreneurial project recommendation and resource optimization model and conducted evaluation and analysis7.

Although there have been studies exploring the application of association rule algorithms in various fields, there is still insufficient in-depth research specifically in domains like sports data information management. This leads to existing methods potentially not being fully applicable to the characteristics of sports data, such as timeliness and diversity. Additionally, the traditional Apriori algorithm, especially when dealing with large-scale data or significant data increments, may exhibit unclear performance improvements and may encounter inefficiency issues. Furthermore, many studies have not fully considered the scalability of the algorithm, making it difficult to cope with rapidly changing data environments. This study extends conventional research methodologies by integrating innovative log mining technology into web application development technology, thereby enhancing system performance and sports data processing capabilities while gaining insights into students’ browsing habits and preferences. Additionally, through the utilization of an optimized Apriori algorithm, correlations within sports data information are identified, leading to improvements in retrieval accuracy and efficiency, and facilitating administrators’ access to detailed system information. Ultimately, the reliability and efficacy of the optimized system are empirically validated through experimentation. The contribution of this study lies in the first-time integration of the Apriori association rule algorithm with web application development technology to optimize the sports data information management system. Particularly, by introducing log mining techniques from the genetic algorithm (GA), this study achieves effective optimization of the Apriori algorithm. This integration offers a new solution for sports data management, and through the optimization of the Apriori algorithm, this study successfully improves the algorithm’s execution efficiency and its capability to handle large-scale data.

Optimization of sports data information management system based on web log mining and Apriori algorithm

Application of web log mining technology in data informatization

When a user interacts with the server, various logs are generated, capturing essential information such as the user’s IP address and the nature of their request execution. These logs are critical in analyzing user behavior and optimizing websites8,9. Central to this process is web log mining, a fundamental task involving the preprocessing of log documents, encompassing data cleaning, integration, transformation, and specification10. This preparatory phase yields a refined, simplified, and more precise dataset, conducive to further analysis and mining. In Web log mining, the primary task is to analyze logs to obtain users’ page access patterns, such as access time, frequency, and source11. This information enables a comprehensive understanding of user behavior, including their preferred pages and functions, and may even unveil latent needs and desires. The analysis can optimize the website, meet user requirements, and improve the user experience.

Web log mining technology finds crucial applications in security analysis and performance optimization12. By analyzing the log files, anomalous login attempts and potential security breaches can be detected, safeguarding website integrity. Concurrently, log mining facilitates the identification of performance bottlenecks within websites, enabling timely configuration adjustments to enhance access speed and responsiveness13,14,15. This technology plays a pivotal role in understanding users’ needs, optimizing website experience, ensuring security, and improving performance. By scrutinizing log documents, insights into user behavior can be gleaned, leading to website enhancements and improved user satisfaction16,17,18. Before data mining, the original data is often disorganized and unsuitable for direct analysis. Data mining is divided into three stages: data preparation, data preprocessing, and data mining and result display. Data preparation should follow the principles of result-oriented, focusing on the intrinsic meaning of the data, and improving algorithm efficiency19. This requires selecting appropriate data according to the research purpose and excluding anomalous data to mitigate misleading outcomes and reduce processing time. Algorithmic efficiency is further bolstered through strategies such as distributed processing or increased concurrency, ensuring shortened processing cycles without compromising accuracy. There are often problems in actual data, such as attribute mismatches, null values, out-of-range values, or type mismatches, which affect data integrity and accuracy20. Hence, data preprocessing is a necessary step. Once the initial two stages are completed, refined and valuable data is obtained, paving the way for the delineation of analysis objectives based on the data or identified issues, thereby guiding the pursuit of desired results and knowledge.

The application of the Apriori association rule algorithm and GA in sports data management systems

Association rules play a pivotal role in uncovering implicit connections among multiple entities, akin to the butterfly effect, thereby elucidating correlations between various elements21,22,23. Through association rule analysis, insights such as “the presence of certain attributes coincides with the presence of others” or “the occurrence of specific events triggers subsequent events” can be derived. The Apriori algorithm employs an iterative approach to achieve this goal. It begins by exploring all candidate 1-item sets and obtains frequent -item sets through pruning. Subsequently, it iteratively connects and prunes to uncover frequent 2-item sets, and so forth, until larger item sets cannot be formed. Finally, the frequent k-item sets derived from the Apriori algorithm constitute the analysis’s conclusion sets. Table 1 presents the sports scores of four college students.

Table 1 Sports scores of 4 college students.

Mining association rules constitutes a crucial data mining task aimed at extracting valuable information from extensive datasets24,25,26,27. It encompasses two primary subtasks: firstly, the frequent item sets are filtered from the cleaned dataset, and the memory consumption and algorithm complexity are reduced by mining the low-frequency items and their relationships28,29,30. Secondly, it involves the analysis of these frequent item sets to unearth potential relationships among the data, thus revealing association rules. Through statistical, machine learning, and artificial intelligence techniques, the data are analyzed, and the laws and trends behind them are mined. This is helpful to deeply understand the data and find its practical applications. Association rule mining is widely used in e-commerce, medical care, social network analysis, and other domains, albeit encountering challenges such as managing large datasets, algorithmic efficiency, data quality, and privacy preservation.

In the sports data management system, the adoption of GA demonstrates notable efficacy in optimizing the computational overhead, especially in the key task of generating frequent item sets. GA, mimicking natural selection and genetic mechanisms, has been successfully deployed to tackle various complex problems, including the generation of frequent item sets in data mining. The sports data management system involves a lot of data, ranging from athlete performance metrics to competition results and team statistics. In such a system, the generation of frequent item sets emerges as a fundamental and imperative task, entailing the identification of recurrent data patterns within vast datasets. While traditional methods like the Apriori algorithm prove effective in certain scenarios, they often falter when confronted with colossal and intricate datasets, leading to computational challenges.

Optimization of sports data information management system

In the traditional algorithm, despite the utilization of connection and pruning steps for filtering, it remains necessary to traverse every transaction in the database and tally every candidate item sets containing the item when handling frequent item sets31. Given the typically substantial scale of databases, frequent database operations tend to consume considerable memory and performance. In addition, the algorithm produces numerous invalid candidate sets in the loop traversal, failing to exclude the elements that should not participate in the combination. Optimization measures primarily focus on three aspects. Firstly, database compression is employed32. By deleting or marking entries during database traversal, redundant inspections are circumvented, and unnecessary operations are minimized. Secondly, dynamic reduction of candidate item sets is implemented, whereby candidate item sets meeting the minimum support threshold are directly incorporated into the frequent item sets. Finally, item sets are pre-filtered, and the unqualified item sets are deleted before the connection step to mitigate the invalid calculations. The introduction of GA can further enhance the efficiency of this process. By simulating natural selection and genetic mechanisms, GA can adopt a global search strategy when searching for the optimal solution. When dealing with frequent item sets, GA iteratively improves the solution through crossover and mutation operations, thereby effectively reducing computational and search costs while enhancing mining accuracy. In summary, the existing sports data information management system is optimized, and the optimized system structure is displayed in Fig. 1.

Figure 1
figure 1

Optimized sports data information management system.

The purpose of database compaction is to reduce duplicate checks and unnecessary operations when traversing the database, which can be achieved by deleting or marking entries. The compressed database is calculated as shown in Eq. (1):

$$\:{T}^{{\prime\:}}=\{{t}^{{\prime\:}}\mid\:{t}^{{\prime\:}}=\text{compress}(t),\forall\:t\in\:T\}$$
(1)

\(\:{T}^{{\prime\:}}\) and \(\:T\) represent the compressed and original database; \(\:t\) refers to the transaction; \(\:\text{compress}\) denotes the calculation function; \(\:{t}^{{\prime\:}}\) means the compressed transaction format. When generating a frequent itemset, the performance of the algorithm can be optimized by dynamically resizing the candidate set. The process for adding a frequent itemset is as follows:

$$\:{F}_{k}={F}_{k-1}\cup\:\{c\mid\:c\in\:{C}_{k},sup(c)\ge\:min\_sup\}$$
(2)

\(\:{F}_{k}\) means the frequent itemset; \(\:c\) indicates the candidate set; \(\:sup\) represents the support degree function. The unqualified item set is deleted through the pre-filtering operation before the connection step, to reduce invalid calculations. A filter function is defined as:

$$\:{C}_{k}^{{\prime\:}}=\text{filter}\left({C}_{k}\right)=\{c\mid\:c\in\:{C}_{k},\text{some}\_\text{condition}(c\left)\right\}\:$$
(3)

\(\:{C}_{k}^{{\prime\:}}\) indicates the filter function; \(\:\text{filter}\) stands for the calculation function; \(\:\text{some}\_\text{condition}\) refers to the filter condition. In connection with the operation process of web log mining, the designed system function structure is classified into five parts. The specific content is given in Fig. 2.

Figure 2
figure 2

System functional structure diagram.

Figure 2 illustrates that the user management module is responsible for distinguishing between ordinary users and administrators in the system, and for managing data access and operations differently based on their respective permissions. Subsequently, the data import module imports log data from multiple websites through task management functionality, ensuring the initial integration and availability of data. After data importation, the data preprocessing module cleans and organizes the data as necessary, including user identification, session identification, and transaction tracking, preparing the data for mining and analysis. The core management rule mining module utilizes association rule algorithms to discover valuable information and patterns from the preprocessed data, which is a critical step in system implementation. Finally, the statistical analysis module is responsible for statistically analyzing key performance indicators such as daily website traffic and user page visits to assess the operational status of the website and user behavior trends. The collaborative work of these modules enables the entire system to efficiently process and analyze large amounts of sports data, thereby enhancing the quality and efficiency of information management. Data preprocessing stands as a pivotal step preceding data mining, given that data sourced from diverse origins often harbor inaccuracies, inconsistencies, or missing values, termed “dirty data.” Utilizing such data directly for mining purposes risks yielding inaccurate and misleading outcomes. Hence, it is necessary to convert these dirty data into usable transactional data to meet the requirements of mining algorithms and remove data with no reference value. Preprocessing procedures encompass data cleaning and transformation, with methods tailored to specific application contexts and data formats. Especially in association rule mining, transitioning original data into a relational database constitutes a critical juncture. Association rule mining necessitates transactional data format, wherein item occurrences are recorded in transaction units, whereas relational databases usually store data in tabular form. In the process of data transfer, the application of data cleaning is essential to mitigate noise, redundancy, and missing values, thereby mitigating the influence of dirty data. Subsequently, the cleaned data requires conversion into transactional data and storage within the relational database, while preserving data integrity and consistency. To sum up, preprocessing is the basis of effective data mining, notably in association rule mining, wherein correct data preprocessing and transformation are paramount to facilitating the extraction of valuable insights from the data.

System performance analysis under the optimization of web log mining technology and Apriori algorithm

Comparison and analysis of performance before and after optimization of the Apriori association rule algorithm

The dataset used in this study is sourced from Kaggle, covering multiple domains including education and sports. This allows researchers to access data related to students’ athletic performance, health tests, or other sports-related information, which is essential for researching the sports data information management system. Datasets on Kaggle are typically provided by community members or collaborating organizations, ensuring data quality and diversity. Researchers can access sports data from diverse regions and backgrounds, which helps increase the breadth and depth of research. The datasets can be downloaded through the official website (https://www.kaggle.com/). The experimental environment is as follows: the operating system is Windows 10, the internal storage is 4GB, the CPU is 2.0 GHz, and the memory is 500G. The algorithms before and after optimization are employed to mine varying dataset sizes, and the time taken to generate frequent item sets and conduct association rule mining is recorded. After running the algorithms ten times, the average runtime is computed, and the experimental findings are revealed in Fig. 3.

Figure 3
figure 3

Performance comparison before and after algorithm improvement.

Figure 3 denotes that before algorithm optimization, the running time of the Apriori algorithm exhibited an exponential increase with dataset size, leading to suboptimal performance in processing large-scale data and failing to meet practical requirements. After optimization, it can be found that the optimized algorithm markedly enhances the execution efficiency. Moreover, for a data volume of 100, the traditional algorithm requires 0.73s to generate frequent item sets, whereas the optimized algorithm takes 0.15s. However, when the processed data is 2,000, the traditional algorithm necessitates 15.76 s, while the optimized algorithm only takes 3.58s. As data volume increases, the advantages of the optimized model become increasingly apparent. Notably, the optimized algorithm demonstrates a minimum enhancement of 10–15% in execution efficiency. This underscores the effectiveness of the optimized Apriori algorithm in processing large-scale data.

Horizontal comparison of experimental results and analysis of sports data information management system

For the experiments, data from “21 Places to Find Free Datasets for Data Science Projects” is utilized. This dataset, provided by the University of Washington, comprises free datasets for data science projects and is accessible via the university’s official website (https://careers.uw.edu/blog/2021/10/05/21-places-to-find-free-datasets-for-data-science-projects-shared-article-from-dataquest/). The sample set is divided into five subsets, each with 2000 students’ sports data. It includes Binary Search Algorithm (BSA), Linear Search Algorithm (LSA), Binary Search Tree Algorithm (BSTA), Hash Table Algorithm (HTA), and Breath-First Search Algorithm (BFSA). The results of retrieval accuracy and retrieval time of the system are suggested in Fig. 4:

Figure 4
figure 4

The results of system retrieval accuracy and retrieval time (a): Retrieval accuracy; (b): Retrieval time.

According to the experimental results in Fig. 4, compared with the traditional model, the proposed algorithm achieves the highest retrieval accuracy of 98.5%, signifying its ability to accurately locate user-required information. Furthermore, after many experimental tests, the average retrieval accuracy of this algorithm is 92.3%. This result underscores the algorithm’s efficacy and application value in information retrieval tasks, meeting user’s demands for accurate and comprehensive information. Regarding retrieval time, the proposed algorithm outperforms other models, exhibiting the shortest retrieval time and the highest stability. Analysis of experimental data reveals an average retrieval time of 1.51 s for this algorithm, approximately 23% faster than alternative models. The optimized system is deployed on a small scale at University A, where users rate their experience after each use on a scale of 1 to 4, indicating satisfaction levels from low to high. During the usage period, a total of 976 users provided ratings. The results are outlined in Table 2:

Table 2 User satisfaction score.

According to the scoring results, 76.84% of users are satisfied with the new system. This substantial satisfaction rate underscores the considerable enhancements and advantages of the new system across multiple facets.

Conclusion

With the continuous development of science and technology, Web log mining technology and the Apriori association rule algorithm are integrated into the sports data information management system. This study leverages GA, Apriori association rule algorithm, and Web application development technology to optimize and upgrade the management system. Firstly, a Web application development technology based on log mining technology is introduced. Secondly, the Apriori algorithm is refined through log mining integration to uncover the correlation between sports data and information, thus improving the retrieval accuracy and retrieval time. Finally, experimental verification is conducted, confirming the reliability and effectiveness of the optimization system. The experimental results demonstrate that the proposed algorithm enhances execution efficiency by at least 10%~15%, underscoring its robust performance in handling vast information volumes. Compared with the traditional management system, the proposed algorithm remarkably improves the information retrieval time and accuracy, with an average retrieval accuracy of 98.3% and a 23% reduction in retrieval time. This improvement is attributed to the algorithm’s utilization of association techniques to enhance information correlations, thus shortening the retrieval time while improving the accuracy. Therefore, the proposed algorithm offers distinct advantages in accelerating information retrieval speed and holds significant practical application value.

However, several shortcomings exist. Firstly, there is considerable scope for enhancing the efficiency of algorithm selection and optimization within the association rule analysis algorithm. Particularly, efforts to enhance the Apriori algorithm should focus on minimizing loop traversal in frequent itemset discovery and construction. Secondly, the sample size is relatively limited, with only 1,500 data points selected in this study. Given that real-world datasets typically exceed this quantity, future studies will expand the sample set to bolster credibility further.