Introduction

In the current era of data explosion, the widespread dissemination of online rumors has become increasingly prominent. Data elements serve as critical resources in production and operational activities (Data Element White Paper, 2022), while their circulation facilitates the release of data resource value, and it simultaneously accelerates the generation and diffusion of online rumors. The global data and analytics market was valued at USD 100.8 billion in 2022 and is expected to reach USD 188.78 billion by 2027, with a compound annual growth rate (CAGR) of 13.36% (GlobalData, 2023). Social media, as a vital carrier for data element circulation, enables data flow through user activities such as content creation, browsing, sharing, liking, and commenting. Data subjects receive corresponding revenue shares and subsidies based on platform agreements. The propagation of rumor data involves multi-party exchanges of data rights and interests, emphasizing the necessity of ensuring data source traceability and rights compliance. Therefore, constructing provenance models and technologies for the provenance of data with rights and interests in rumor dissemination—tracing rights transitions among data subjects, identifying propagation subjects and propagation paths, and enabling compensation claims for infringement-related profits caused by violations from different rights subject—is crucial for safeguarding the reliability and sustainability of the data ecosystem.

Existing research on rumor traceability focuses on identifying the rumor source nodes. Early studies concentrated on modeling the importance of network nodes to infer potential rumor sources. (Shah and Zaman, 2016; Zhang et al., 2017; Jain et al., 2020). As research on rumor propagation mechanisms matures, reverse propagation modeling has emerged as a focal area (Jiang et al., 2022; Qiu et al., 2022). While current studies provide valuable methods and models for identifying rumor origins, they still need to adapt to evolving data rights (e.g., data holding rights, data revenue rights) and liability attribution in social media environments. In existing data provenance expression models, frameworks like OPM and PROV typically describe historical data states but lack unified mechanisms for provenance information publication and access (Shen and Zhang, 2011). The PROV-O model employs RDF data types for ontological descriptions of provenance data, but has limited expressiveness in representing complex data processing workflows (Ni, 2017); The ProVOC model supports flexible extensions but suffers from narrow applicability, requiring further validation of its generalizability and management efficacy (Wang et al., 2019). Current research lacks a data provenance model tailored to data element circulation that supports rights detection, failing to meet the growing demand for secure and standardized environments in data markets. To address this, this paper proposes a research methodology for the provenance of data with rights and interests in data element circulation during online rumor propagation. By standardizing the circulation processes of rumor data elements and refining the data rights confirmation framework, this paper enhances classical provenance models like PROV-O and ProVOC. Through defining data and model concepts, sharing the provenance of data with rights and interests, and their relationships, it establishes the PROV-OCC (Ontology-Circulation Confirmation) model for the provenance of data with rights and interests in rumor data circulation. Data provenance models require implementation and validation via provenance technologies. Current mainstream techniques include annotation methods (Ming et al., 2012), reverse query methods (Woodruff and Stonebraker, 1997), inverse function-based approaches (Fan and Poulovassilis, 2003) and bit-vector storage localization (Gadang et al., 2008), etc. Among these, annotation methods align with data element circulation processes by tagging processing-related information to trace historical data states and propagating records alongside data (Ming et al., 2012). This paper integrates the classical W7 annotation model with the PROV-OCC model through semantic fusion. Leveraging knowledge graph representations of typical rumor cases, it validates the effectiveness of provenance of data with rights and interests via OWL2 instance encoding and ontological reasoning.

The structure of this paper is as follows: First, this paper introduces existing research on online rumor data provenance methods, data provenance models, and technologies, proposing research questions and objectives. Next, it introduces the research methodology, which centers on ontology-based modeling for constructing a model for provenance of data with rights and interests. Then, the PROV-OCC model for the provenance of data with rights and interests is designed, semantically integrated with the W7 annotation method, instantiated through OWL2 ontology encoding, and validated using ontological reasoning. Finally, this paper discusses its research significance, innovations, limitations, and future directions, concluding with a summary of the contributions.

Related work

Online rumor data provenance method

Research on online rumor data provenance methods focuses on identifying rumor source nodes. Early studies concentrated on modeling the importance of network nodes to infer potential rumor sources. Numerous scholars have contributed insights to this field. The centrality of rumor network nodes has been shown to enable effective rumor source detection under generic stochastic or non-exponential propagation time models (Shah and Zaman, 2016); a polynomial-time greedy algorithm was proposed to identify rumor source nodes within the minimal distinguishing node subset in social networks (Zhang et al., 2017); a value-increasing game model was introduced by quantifying the marginal contributions of social network nodes to identify more influential rumor propagation sources (Jain et al., 2020). As research on rumor propagation mechanisms matures, reverse propagation modeling has become a focal area. Scholars have proposed significant perspectives on this topic: a graph network-based user information propagation logic model was designed to predict rumor sources (Jiang et al., 2022); reverse propagation processes were modeled using timestamp-updated observer sets to localize rumor origins (Qiu et al., 2022); a reverse sampling optimization strategy was developed to iteratively estimate initial source features and diffusion models for rumor source prediction (Qu et al., 2023); an infection potential-based rumor propagation network reconstruction method was proposed to trace candidate sources (Li et al., 2023); and mathematical descriptions of dynamic social network changes during rumor propagation were employed to trace rumor origins (Yutao; et al., 2023).

Existing studies predominantly focus on determining rumor sources through network node centrality and reverse propagation modeling, with limited exploration of the characteristics of rumor data circulation or evolving data rights and liability attribution in social media environments. In social media, rumor data circulation involves not only information transmission between nodes but also processes such as creation, sharing, modification, and re-creation, accompanied by dynamic shifts in data holding rights, collection rights, revenue rights, and associated responsibilities.

Data provenance model

Data provenance is described as recording the evolutionary information and processing history of the original data throughout the full lifecycle of generation, propagation, and extinction (Dai et al., 2010). It records the data rights and the processing activities it has undergone, such as storage, processing, usage, archiving and deletion, and represents information about the data’s origin and creation process (Glavic et al., 2014).

OPM model is a general-purpose data provenance model, which enables the interoperability of data provenance information between different systems from a technical perspective, defining three core concepts: entity, process, and agent, and is usually represented by a directed acyclic graph (Kwasnikowska et al., 2015); it is applied in the source-aware system PADS for recording publications (Mahmood et al., 2013), semantic-based data provenance, and related domains. (Ding et al., 2011). The Provenir model, constructed around three primary classes (Data, Process, Agent), represents workflow provenance information (Shen and Zhang, 2011), with applications spanning marine and biomedical domains. The PROV model, a W3C-released standard for data provenance, defines core concepts including Entity, Activity, and Agent, along with multiple relationship types. It provides detailed specifications for users to acquire, utilize, and verify provenance data while enabling multi-format information exchange and representation through OWL (Li and Wu, 2015; Group, W. C. P. W., 2013). PROV-O is a W3C-standardized lightweight ontology that structures provenance data (entities, activities, agents, and their relationships) using the OWL2 language, enabling semantic modeling and interoperability of cross-domain provenance information. It ontologically defines the core “Entity-Activity-Agent” triple and derived relationships (e.g., derivation, association, attribution), while ensuring reasoning feasibility through OWL-RL compatibility (Lebo et al., 2013). Currently the most universal and comprehensive data provenance model, PROV has inspired scholarly investigations into its generative mechanisms, design optimizations (Wei and Deng, 2016), and extensive applications in geospatial tracing (Guillem et al., 2015), structural evolution of metadata schemas (Chunqiu and Shigeo, 2018), and web application context (Ni and Meng, 2014), etc. Building upon PROV, the ProVOC model emerges as a flexible, lightweight provenance description framework. It comprises three primary components—data, activities, and executing entities (Computer Network Information Center et al., 2017), employing common descriptive approaches like directed acyclic graphs, classes, and associative relationships (Zhang and Lu, 2022). This model has gained substantial traction in geospatial data (Guo and Hu, 2020), infectious disease data (Zhu et al., 2021) and scientific research data (Chen et al., 2019), etc.

Although the existing data provenance models have been applied to various scenarios and domains such as scientific research, infectious disease, engineering and information systems (Wang et al., 2022), they still need to adapt to the complexity and dynamic nature of data rights transitions in data element circulation within social media. The OPM and PROV models lack a unified mechanism for the publication and access of provenance information; the PROV-O model supports ontological descriptions but has limited expressiveness for complex data processing workflows; the ProVOC model supports flexible expansion but has a narrow application scope, with its generality and management effectiveness yet to be further validated and evaluated.

Data provenance technology

Data provenance technology is used to verify the data source, infer the evolution process, and ensure the data compliance and reliability (Wang et al., 2019). Different data provenance technologies are suitable for different data types and application scenarios. Annotation method (Chiticariu et al., 2005) is a data provenance technology that traces the historical state of data by recording the relevant information of the processing, and uses annotations to record the important information of the original data, such as background, author, time, source and other key elements. By reviewing the annotations of the target data, the provenance of the data source can be effectively realized. The annotation method is simple and effective, and widely used, but it has limitations in large-scale systems, especially for providing detailed provenance information for fine-grained data, and has low efficiency (Ming et al., 2012). W7 model, as a typical example of annotation method, can describe the data provenance tracking annotation mode and model (Yazi, 2007), and use semantic method to represent and organize provenance information proactively to establish a data provenance model (Ram and Liu, 2009). The annotation method is applied in manufacturing, medical, biological and other fields. Reverse query method, also known as the inverse function method, is more complex than the annotation method, but requires less storage space (Ming et al., 2012). This method uses reverse query or constructs a reverse function to reverse parse the query, or by reverse pushing transformation process, deduces the process of the original data from the result (Woodruff and Stonebraker, 1997). The key to the reverse query method is to construct the reverse function, which directly affects the query effect and algorithm performance. In addition, there are also provenance technologies such as bit-vector-based provenance tracking (Gadang et al., 2008), bidirectional pointer tracking (Wang, 2009), graph theory and special query language tracking (Karvounarakis, 2010), etc.

Existing data provenance technologies are primarily applied to structured data domains, lacking accurate modeling and computational support for unstructured text data and the evolving processes of data rights in social media.

In summary, existing research has provided theoretical and methodological support for tracing data and rights in rumor circulation across three dimensions: rumor data provenance methods, provenance model construction, and technical implementations. However, current rumor data provenance methods predominantly focus on identifying critical propagation nodes and reverse-path modeling, neglecting the dynamic behavioral evolutions—such as creation, sharing, modification, and re-creation—in social media scenarios, which trigger shifts in data rights and liability attribution. While mainstream provenance models like OPM, PROV, and ProVOC have been applied in vertical domains, their designs fail to adapt to the characteristics of social media data circulation, including broad data circulation, multiple subjects' participation, frequent data processing and recreation, and dynamic data rights changes. These models lack semantic representations for key dimensions such as data rights attribution, data propagation paths, and data infringement liabilities. Data provenance technologies, largely tailored to structured data, remain insufficient in supporting rights tracking, logical inference, and infringement subject identification, struggling to meet the dynamic and semantically interconnected demands of rumor data circulation in social media. Therefore, this paper focuses on data and rights provenance in the circulation of online rumors within social media scenarios, constructing a data provenance model that integrates rights characteristics, semantic modeling, and logical reasoning capabilities. This paper formalizes the circulation process of online rumor data elements and the corresponding data rights confirmation framework. Leveraging the ontological descriptive strengths of the PROV-O and ProVOC models, the PROV-OCC (Ontology-Circulation Confirmation) provenance model of data with rights and interests is established for online rumor data circulation. By incorporating the event annotation capabilities of the W7 provenance technology model, OWL2-based instance encoding and visualization of propagation paths are implemented using classic rumor cases. Ontological reasoning is applied to validate the provenance efficacy of the PROV-OCC model, enabling the identification and traceability of infringement subjects and propagation paths in rumor circulation and dissemination paths.

Research questions and objectives

Existing research on the provenance of online rumor data predominantly focuses on identifying source nodes and modeling propagation paths, with insufficient investigation into the transitions of data rights among multiple rights subjects and the attribution of infringement liability during data element circulation in social media scenarios. Additionally, mainstream data provenance models such as OPM, PROV, and ProVOC fail to effectively adapt to the dynamic changes in data rights characteristics of the circulation of rumor data elements in social media scenarios, lacking unified and generalized methods for the provenance of data with rights and interests and their semantic representation.

Therefore, the research objective of this paper is to construct a data provenance model applicable to detecting data rights changes, tracing rumor propagation paths, and identifying infringement subjects in the process of online circulation of online rumor data elements within social media scenarios. Specifically, this paper will explore and address the following research questions:

RQ1: What are the fundamental semantic units that characterize the provenance of data with rights and interests in the circulation of online rumor data within social media scenarios?

RQ1-1: In what way can the circulation process of online rumor data on social media platforms be abstractly and universally represented?

RQ1-2: What are the standardized semantic elements of data rights transitions in social media rumor data, as defined by the existing “five rights separation” framework for ternary data subjects?

RQ2: What constitutes an ontology-based provenance model of data with rights and interests that effectively captures the dynamics of data rights during the circulation of online rumor data in social media scenarios?

RQ2-1: What constitutes an ontology-based provenance model for data with rights and interests, grounded in the circulation processes of online rumor data and data rights confirmation frameworks in social media, explicitly representing data subjects, circulation processes, data rights, and their associative relationships?

RQ2-2: In what ways can the integration of the annotation-based model and the proposed provenance model of data with rights and interests semantically describe and represent the dynamic changes in data rights during the circulation of rumor data in social media?

RQ3: Based on the constructed provenance model of data with rights and interests, how can propagation paths be accurately traced and infringement subjects identified in rumor data circulation?

In response to the above research questions and analysis of existing data provenance models and technological developments, this paper proposes the following verifiable research hypotheses:

H1: The circulation process of online rumor data in social media scenarios involves multiple types of data subjects, and the participation behaviors of each type of subject will trigger corresponding changes in data rights.

H2: The provenance model of data with rights and interests designed in this paper can effectively record and trace data rights transitions in social media rumor data, enabling the inference and identification of infringement subjects and propagation paths.

Methods

To effectively address the research questions and verify the research hypotheses, this paper proposes a construction method for the provenance of data with rights and interests based on ontology modeling in the context of online rumor data element circulation in social media scenarios. The method aims to extend mainstream data provenance models to conduct ontological modeling of the data element circulation process, semantic units, and rights subjects involved in rumor dissemination on social media. It supports the detection of dynamic transitions of data rights, propagation paths, and infringing entities through ontology reasoning. This approach will fill the gap in existing research regarding data rights tracing. The research steps are as follows:

Abstraction of the rumor data flow process and standardization of the data flow rights confirmation system

Extract the concepts of data element circulation processes comprehensively from the leading, authoritative, and reusable data lifecycle models. Through manual analysis of term frequency, semantic similarity, and applicability in domain practice, identify high-frequency and logically coherent categories as the core processes of data element circulation. This addresses RQ1-1. On this basis, incorporating the actual data element circulation processes in mainstream social media scenarios defines the connotations of these core processes. Finally, aligning with the “five rights separation” framework based on ternary data subjects—a key outcome from prior research—this paper specifies and codifies the scope of data rights and responsibilities for distinct data subjects throughout each circulation process.

Design of ontology-based data with rights and interests provenance model based on data flow rights confirmation

The data circulating on social media platforms possesses rights and interests attributes, which are abstractly represented as data elements. Design “Rights-and-Interests-Attributed Data Element” (to address RQ1-2 and RQ1) and integrate the advantages of the PROV-O and ProVOC data provenance models. Based on the ternary structure of “Rights-and-Interests-Attributed Data Element–Activity–Subject,” this paper constructs an ontology-based rights provenance model that fits the social media scenario and clearly defines the semantic relationships of the concepts and properties in the model during the online rumor data element circulation process (to address RQ2-1).

Semantic representation and ontology encoding of the rights provenance model based on the labeling method

This paper uses the classic W7 model for semantic representation of the data with rights and interests provenance model, tracking events within the provenance model through seven interrelated elements: “What,” “When,” “Where,” “How,” “Who,” “Which,” and “Why,” which helps address RQ2-2 and RQ2.A representative case of abnormal circulation of rights data in the form of an online rumor on the DouYin platform, a mainstream Chinese social media scenario in 2023, titled “The rumor that a founding general was slandered as a traitor,” is selected for ontology instantiation in this provenance framework. The provenance process is encoded using the ontology language OWL2, detailing the core classes and properties, including non-hierarchical semantic relations, within the ontology-based data with rights and interests provenance model to ensure semantic precision and consistency. This supports H1.

Verification of the data with the rights and interests provenance model

This paper designs competency questions and creates corresponding SWRL inference rules. The Pellet reasoner within the Protégé environment is used to perform ontology reasoning on the integrated and encoded provenance model to verify whether the model can accurately record and track changes in data subject rights and locate infringers, as well as trace the infringement propagation path of rumor data. Typical online rumor cases are selected to instantiate the ontology-based provenance model, corresponding case knowledge graphs, design capability questions, and build corresponding SWRL inference rules. The Pellet reasoner is used to discover implicit logical relationships between classes and instances, infer and confirm the infringers and victims of data rights, and trace the main infringement chain in the rumor case to evaluate the effectiveness and accuracy of the model. This addresses RQ3 and supports H2.

Model design

Abstraction of the online rumor data element circulation process and specification of the data rights confirmation framework

The spread of online rumors is not only an information diffusion process but also the process of online rumor data circulating on social media platforms as data elements. This process is accompanied by the platform's revenue sharing and involves the transfer of multiple subjects' data rights. Therefore, it is necessary to clarify the generic process of data element circulation and the corresponding data rights confirmation framework.

Definition of the circulation process of online rumor data elements

Process concepts based on the data lifecycle model reuse

Data element circulation is generally a process in which data providers and demanders pass data according to certain rules and is regarded as a subset of the data lifecycle that focuses on data transfer and exchange. The main data lifecycle models include ANDS (Burton and Treloar, 2009), BLM(Management, T. B. o. L., 2022), CSA(Atayero and Feyisetan, 2011), DataONE (Michener et al., 2012), DCC (Goth, 2012), DDI (Emaldi et al., 2015), DigitalNZ (CEOS, 2012), Ecoinformatics (Rüegg et al., 2014), Generic Science (CEOS, 2012), Geospatial (FGDC, 2010), UK Data Archive (Emaldi et al., 2015), as well as the models proposed by Siddiqa et al. (Siddiqa et al., 2016) and Qin Shun and Xing Wenming (Qin and Xing, 2021), etc. Based on these 13 reusable data lifecycle models (Sinaeepourfard, 2017). A total of 92 concepts related to data element circulation were extracted and summarized into 10 categories by artificial clustering. In the process of clustering, firstly, 92 processes are comprehensively sorted out, and the logical basis of category classification is determined by manual analysis of their term frequency, semantic similarity and applicability in industry practice. Finally, five categories with the highest word frequency and realistic logic are selected as the core links of data element circulation, namely, data collection, data processing, data storage, data transactions and data usage, to represent the main circulation paths of data elements in the circulation process, as shown in Table 1.

Table 1 The term frequency of data lifecycle concept clustering and its typical link sub-item statistics.

The concept of the data element circulation process in social media scenarios

In the social media scenario, the definition of data element circulation needs to be combined with the characteristics of rumor spreading to ensure its effectiveness. In the White Paper on Data Elements (2023) released by the China Academy of Information and Communications Technology (CAICT), it is mentioned that social media platforms are representative data elements circulation carriers and key scenarios for changes in data rights and interests. Data providers publish content on the platform, and the platform saves and uploads the raw data to the cloud server after reviewing the quality and credibility of the data. When users access a social media platform, data providers and consumers sign user agreements to enhance their experience. Data providers are divided into individuals, enterprises, public institutions or government departments according to different subjects, and can publish content, share knowledge, experience and opinions on social media platforms independently or by signing contracts. The social media platform will first review the content published by the data provider to ensure the quality and credibility of the data, and then save and upload the raw data to the platform cloud to complete the collection and storage of the raw data. In order to facilitate the subsequent use of data, the platform will also convert the data format, add platform watermarks, implant advertisements, etc., and label and classify the data according to its content. After the above standardized processing, the data is published on the social media platform and pushed to the target audience (Shuang, 2021), to obtain the platform users effective browsing, forwarding, praise, comments and other heat, so as to create data value, so that data providers in accordance with the previous contract or agreement with the platform to get the corresponding share and subsidies, such as Bilibili’s “creation incentive plan” (Shanghai Kuanyu Digital Technology Co., L, 2018), Weibo “Creator Advertising Sharing Plan” (Weibo, 2023), YouTube “Partner Program” (Google, 2024), and X (Twitter)‘s “Creator Monetization Standards “(Twitter, 2024), the number of views, likes, reposts and other indicators of the published content will be used as the basis for the creator’s share. At the same time, when the data demand side browses the content of interest on the social media platform, it can contact the interested parties through the platform, use the integrated information of the data supply and demand sides provided by the platform, apply for part of the data usage rights and sign a contract, and carry out the secondary processing and creation of the data according to the provisions of the contract.

The above process is generalized as follows: when data supply and demand enter the platform, they need to sign an agreement to protect the privacy and rights of data subjects and regulate the use of data. The platform reviews the data provided by the data providers, completes the collection of the raw data and makes the first backup after the review is passed. Then the platform standardizes the raw data internally and regularly backs up the processed data to prevent data loss. The platform integrates and publishes standardized data, and at the same time interacts and matches the information of the data supply and demand sides, playing its intermediary role to meet the needs of the data demand side. In order to obtain the intended data or part of its data usage rights, the data demand party will transfer the rights and interests of the data published by the platform, including effectively browsing, forwarding, liking, commenting on the data in social media, and obtaining the right to use the data part. After signing a contract with the data demander, the platform delivers the data to the demander and clears funds with the data providers according to the contract to complete the data transactions. In addition, the data providers can also perform secondary processing on the data according to the contract. The circulation process is shown in Fig. 1.

Fig. 1: The pervasive circulation process diagram of data elements in social media scenarios.
Fig. 1: The pervasive circulation process diagram of data elements in social media scenarios.The alternative text for this image may have been generated using AI.
Full size image

It describes the complete circulation process of data among the data provider, social media platform, and data demander, covering five generalized circulation stages: data collection, data storage, data transactions, data processing, and data usage.

The connotation of the data element circulation process for online rumor data

The circulation process of data elements involves multiple core links from generation to application, which is the key path of data element circulation in social, economic and technological environments. Based on the data lifecycle model, the above five core links are extracted and clustered, namely data collection, data storage, data processing, data transactions and data usage. Based on this, a generalized description of the data element circulation process in the context of social media is provided. To further clarify the specific meanings of the five stages in the circulation of data elements (including online rumor data), each will be defined and explained in detail below to address RQ1-1.

Data collection refers to the acquisition of raw data elements (including online rumor data) from multiple channels. There is a need to identify data sources and to adopt legal means to collect data. The data collected requires compliance testing and quality control to ensure the availability, accuracy, completeness and consistency of the data. The collection process needs to record the source of the data, collection time, collection method and other information, but also needs to take into account the authenticity of the data for subsequent use, traceability and audit.

Data storage refers to the structured or semi-structured storage and management of collected data (including online rumor data) and related contracts generated during the data transactions process. Create data indexes for easy retrieval and data reprocessing; regularly back up data and protocols to ensure effective recovery and prevent disputes and conflicts in the event of data loss or corruption.

Data processing refers to the standardized processing of raw data (including online rumor data), such as data cleaning, data conversion, data aggregation, etc., to meet specific needs or applications. Relevant data protection and data privacy policies must be observed during data processing, including the protection of sensitive information and obtaining data usage permits, to ensure compliance with the flow of data elements.

Data transactions refer to the transfer of data elements (including online rumor data) from one entity to another entity through contracts, agreements and other forms for the purpose of value exchange between different entities, including transaction preparation, transaction matching, transaction contract signing, delivery and settlement, etc. Before a data transaction, both parties to the transaction need to clarify the content, structure, rights and value of the data elements, on the basis of which a contract or agreement is drawn up to determine the mode of delivery, conditions of use, change of rights, price and time limit. Consideration also needs to be given to the legal and ethical liability that rumor data may entail.

Data usage refers to the use of acquired data (including online rumor data) as a production factor in business, scientific research, public services, livelihood and other fields, so as to realize the value of data. Data usage includes reading data, writing data, analyzing data, visualizing data, mining data, secondary processing data, etc.

In the context of online rumors circulating through social media, the circulation process of data elements involves changes in data rights. Implementing Data Rights Confirmation enhances compliance and privacy protection during data element circulation, further improving data traceability and control.

Specification of the data rights confirmation system for the circulation of online rumor data elements

This paper applies the results of our previous research, namely the “five rights separation” system based on ternary data subjects, to carry out data rights confirmation for circulating data (including online rumor data) (Zhao et al., 2025). The core concept of this system is to divide the data rights confirmation system into five dimensions: data collection rights, data management rights, data holding rights, data revenue rights, and data usage rights. This division corresponds to the intertwined nature of rights across the five processes of data collection, data processing, data storage, data transactions, and data usage within the full process of data element circulation. At the same time, the rights confirmation subjects are categorized into three types: public data, corporate data, and personal data. Through the establishment of standardized classifications of data rights under the “five rights separation” framework, this model clarifies the scope of rights and responsibilities for different data subjects in each dimension, thus enabling the one-by-one confirmation of rights. Each of the three types of data subjects holds all or part of the “five rights” with respect to their corresponding categories of data elements. As data elements (including online rumor data) circulate, whether in a sequential, non-sequential, or leapfrogging manner through the five stages of circulation, the ownership of data rights by the relevant data subjects also undergoes corresponding changes. The “five rights” framework identifies the primary processes involved in each type of data right, as well as the potential stages of data element circulation that may be implicated. Based on this structure, the framework further defines the ownership content of each of the three data categories across the five dimensions of data rights. The detailed structure of the data rights confirmation system is presented in Table 2.

Table 2 Data (including online rumor data) rights confirmation system based on “five rights separation” of tripartite data subjects.

Design of an ontology-based provenance model of data with rights and interests based on data element circulation

Semantic unit definition

In the circulation of data elements (including online rumors), it is necessary for social media platforms to introduce data types with rights and interests attributes to safeguard the data rights and interests of data subjects. Based on these attributes, data can be classified into public data, corporate data and personal data. Public data (Qian and Hu, 2014) are collected, opened and shared by government departments or public institutions; corporate data (Wang et al., 2022) are data generated in the process of production and operation of enterprises or obtained through collection and processing; Personal data (Wang et al., 2022) are divided into data created by individuals and privacy-related information data. These data blocks—through the processes of posting, forwarding, or receiving—generate value or revenue for users and platforms (Ghani et al., 2019). For example, DouYin, Today’s Headlines, YouTube, Facebook, and Twitter incentivize quality content creation through revenue sharing. Data blocks have rights and interests attributes (Duan et al., 2023; Wei et al., 2019) such as buying and selling, transferring, and exchanging, enabling data producers and publishers, platforms and regulators, receivers and forwarders, and users to share and benefit from the value of the data (Tang et al., 2012).

In this paper, the data with rights and interests attributes described above are abstracted as “Rights-and-Interests-Attributed Data Element.” These elements consist of object classes, properties, and representations. The object class of the Rights-and-Interests-Attributed Data Element is an abstract concept represented as a data block. Each data block possesses its own characteristics, such as properties, relationships, and tradability. Properties are used to describe the features of Rights-and-Interests-Attributed Data Element, including variable properties such as time, space, and data rights, as well as constant properties such as names and numbers. Relationships are used to reflect the connections between Rights-and-Interests-Attributed Data Element and entities or activities, including rights, generation, and use. Tradability is used to measure the capacity of data elements with rights and interests to circulate and exchange in the data element market, including quality, value, degree, etc. Representation refers to the value domain of the characteristic, describing how it is expressed, and each characteristic has only one representation (Yuan and Chen, 2008).

The concept of Rights-and-Interests-Attributed Data Element is shown in Fig. 2, each data block has a corresponding number, time and other characteristics, the number can be represented by “letters and numbers”, time can be represented by “DATE”, object classes and characteristics of one-to-one correspondence. A Rights-and-Interests-Attributed Data Element is composed of data with rights and interest objects, i.e., data blocks, and their unique characteristics, multiple Rights-and-Interests-Attributed Data Element concepts, along with various representations, constitute a Rights-and-Interests-Attributed Data Element. This addresses RQ1-2 and RQ1.

Fig. 2: Conceptual diagram of rights-and-interests-attributed data element.
Fig. 2: Conceptual diagram of rights-and-interests-attributed data element.The alternative text for this image may have been generated using AI.
Full size image

The structure of the Rights-and-Interests-Attributed Data Element is illustrated, showing its object classes, properties, and representations.

Provenance model design

Based on the definition of Rights-and-Interests-Attributed Data Element, this paper combines the three parent classes of entity, activity and agent in PROV-O and the parameter concepts in ProVOC data provenance model with the process of online rumor data circulation and data rights confirmation framework, and puts forward the three parent classes of Rights-and-Interests-Attributed Data Element, activity and entity and their conceptual properties. The PROV-OCC (Ontology-Circulation Confirmation) model is designed to record the change of data rights and trace the infringement subject and propagation path of rumor data. The model consists of Rights-and-Interests-Attributed Data Element, activities, and entities as three parent classes, as shown in Table 3, which form the basis of the data provenance model and are used to create provenance descriptions for other subclasses. Rights-and-Interests-Attributed Data Element is circulating, Rights-and-Interests-Attributed Data Element, which are represented by data blocks in social media scenarios. Activities refer to a series of processes (events) accompanying the circulation of Rights-and-Interests-Attributed Data Element in the circulation of data elements, including major processes and sub-processes such as data collection and data storage. Entities include service platforms and service objects, which are divided into individuals, enterprises and public institutions. In the process of data elements circulation, properties are used to describe the basic characteristics and status of entities, activities and Rights-and-Interests-Attributed Data Element, in which properties are divided into variable properties (provenance-related properties) and invariant properties. In the process of data element circulation on a single social media platform, activities are used as invariant properties to record the changes of Rights-and-Interests-Attributed Data Element. In the process of circulation of data elements in multiple social media platforms, activities are used as variable properties to record the changes of Rights-and-Interests-Attributed Data Element in the same activity on each platform. This supports the provenance of data with rights and interests across social media platforms.

Table 3 PROV-OCC model concept.

In the PROV-OCC model, there are 32 object properties between classes, and their names, explanations, and attribute facets are shown in Table 4.

Table 4 Classification and explanation of object properties in the PROV-OCC model.

In the PROV-OCC model, the core structure of the model formed by the three parent classes of entity, activity and Rights-and-Interests-Attributed Data Element and their corresponding subclasses connected with each other through object properties is shown in Fig. 3. Authorized object properties exist among the subclasses of entities, such as the authorization granted by data providers to social media platforms. A derivative object property exists between two Rights-and-Interests-Attributed Data Elements, such as when one Rights-and-Interests-Attributed Data Element can derive new Rights-and-Interests-Attributed Data Elements. The object properties among entities, Rights-and-Interests-Attributed Data Elements, and activities are expanded from the following aspects: Entities and activities are interrelated, and activities are also associated with Rights-and-Interests-Attributed Data Elements. Entities hold data rights over Rights-and-Interests-Attributed Data Elements, which can be generated, processed, and otherwise manipulated by entities. Activities affect Rights-and-Interests-Attributed Data Elements through actions such as generation, deletion, and modification. Meanwhile, entities, Rights-and-Interests-Attributed Data Elements, and activities all possess data properties. This addresses RQ2-1.

Fig. 3: PROV-OCC model core structure.
Fig. 3: PROV-OCC model core structure.The alternative text for this image may have been generated using AI.
Full size image

In the PROV-OCC model, entities, activities, and Rights-and-Interests-Attributed Data Elements are interconnected through object properties, forming various relationships such as authorization, derivation, generation, and influence.

In this paper, Protégé 5.5 is used to visualize the PROV-OCC model and show the hierarchy and relationship of the parent classes and subclasses in the ontology. Limited by space, this article only shows three layers of ontology, as shown in Fig. 4.

Fig. 4: PROV-OCC model ontology.
Fig. 4: PROV-OCC model ontology.The alternative text for this image may have been generated using AI.
Full size image

The hierarchical structure and “is-a” relationships of the three-tier ontology of the PROV-OCC model were visualized using Protégé.

Semantic representation and ontology-based encoding and instantiation of the data provenance model for Rights-and-Interests-Attributed Data Element based on an annotation-based method

Semantic representation of fusion of W7 and PROV-OCC model

In this paper, the W7 model is used for the semantic analysis (Jia and Kou, 2016) of the PROV-OCC data provenance model. The W7 model is a provenance technology based on the full lifecycle labeling of data provenance. It is composed of seven interrelated elements: “What”, “When”, “Where”, “How”, “Who”, “Which” and “Why”. Each element can be used to track events (Ram and Liu, 2009) affecting data. In the PROV-OCC model constructed in this paper, “What”, as the basic component of W7 model, i.e., activity, refers to a series of processes (events) accompanying the flow of Rights-and-Interests-Attributed Data Element in the circulation of data elements, including data collection, data storage, data transactions, data processing, data usage and other processes and their sub-processes. “How” refers to the object attribute connecting Rights-and-Interests-Attributed Data Element, activities and entities as a parent class, such as having, revising, handling, etc.; “Who” refers to the agent that caused the event to occur, which plays a certain role in the event and makes a certain contribution. It corresponds to the service object and service platform in the social media scene, specifically refers to individuals, Enterprises and public institutions; “When” refers to the time point and time period when each activity occurs, such as the start time and end time of data collection; “Where” refers to the space where the activity occurs, that is, the place where the Rights-and-Interests-Attributed Data Element changes, such as the IP address of the service object, the network space of the service platform, etc.; “Which” is used to express the program or tool used in the event. In the PROV-OCC model, “Which” is used to express the possible data rights of the entity that changes with the event, including data usage rights, data collection rights, data revenue rights, data management rights and data holding rights; “Why” refers to the reason for the provenance activity, that is, to determine the compliance of the Rights-and-Interests-Attributed Data Element, to determine the authenticity and accuracy of the content, to determine whether the Rights-and-Interests-Attributed Data Element has infringement and other acts. The semantic representation is shown in Table 5. This addresses RQ2-2 and RQ2.

Table 5 Semantic representation of W7 and PROV-OCC model fusion.

Instantiated ontology model coding

Provenance example

In order to verify the effectiveness of the integration of W7 and PROV-OCC models, this paper selects the famous online rumor instance of abnormal circulation of data with rights and interests on China’s mainstream social media scene in 2023, namely, the DouYin platform, “Major General Found as a Renegade” (Hangzhou Internet Court, 2023), to verify the provenance model. The content of the case is: in October 2022, Guo Moumou (account number of fans 28,000) found an article while browsing the Internet, which is believed to have caused widespread concern. Guo Moumou produced and released a video of “Top Ten Worst Traitors in the History of the Communist Party of China” through his DouYin self-media account, in which he secretly embezzled the portrait of Major General He Kixi of China as the head portrait of a negative historical figure. From October 2022 to July 2023, in order to attract more traffic, DouYin short video bloggers He Moumou (account number of fans is 3.56 million) and Fu Moumou (account number of fans is about 5000) created the video for the second time and released it through their own self-media accounts. These fake videos have been watched and forwarded in large numbers on the Internet, with more than 7500 likes, resulting in a serious negative social impact. In early August 2023, Ms. Shi, the family member of General He Kexi, received a phone call from the Party History Research Office of the Ningbo Municipal Party Committee. She learned that some self-media had used her grandfather’s photo on the DouYin platform and was suspected of slandering, insulting and slandering heroes. The family members reported to the Xihu District Procuratorate and contacted the DouYin platform for investigation and evidence collection. On December 21, 2023, the Hangzhou Internet Court ordered the defendants Guo Moumou, He Moumou, and Fu Moumou to publicly apologize in influential media across the country, and compensate 100,000 CNY, 40,000 CNY, and 10,000 CNY for public welfare damages, respectively. Infringing videos and infringing accounts were deleted by Trembles.

Instantiation description of the case

The graph instantiation of the W7 and PROV-OCC fusion model shows the circulation status of the Rights-and-Interests-Attributed Data Element, i.e., the propagation path of the online rumor data, the data infringement content and the provenance status record of the infringement subject, as shown in Fig. 5.

Fig. 5: Instantiated graph of data with rights and interests from the fusion of W7 and PROV-OCC models.
Fig. 5: Instantiated graph of data with rights and interests from the fusion of W7 and PROV-OCC models.The alternative text for this image may have been generated using AI.
Full size image

The figure illustrates the data flow of rights, instantiated as a knowledge graph, associated with the online rumor case “A Founding Major General Falsely Accused as a Traitor”. It depicts the complete lifecycle of the data—from generation to dissemination, further processing, and eventual transaction—highlighting the flow of data rights among various entities involved, as well as the relationships involved in tracing infringements.

The service object “A” (Guo Moumou) created and generated a raw video called “Top Ten Worst Traitors in the History of the Communist Party of China” on the service platform “DouYin”, namely the Rights-and-Interests-Attributed Data Element “V”. “A” has the data rights (holding rights, usage rights, collection rights, revenue rights and management rights) of “V”. The video involves the portrait photo “Vimage” of Major General He Kexi of China; the data rights of this photo belong to “G” (Ms. Shi, etc.), a close relative of He Kexi. During the data collection activity, “A” misappropriated “Vimage” as an avatar for a negative historical character in the video. In the data processing activity, “A” processed “Vimage” to create an original video, namely the Rights-and-Interests-Attributed Data Element “V”, which is derived from “Vimage”. In the data storage activity, “A” uploaded and stored “V” on “DouYin”. In the data transactions activity, “DouYin” published “V” and recommended it to service objects “B” (He Moumou) and “C” (Fu Moumou). According to DouYin’s creator incentive plan and income rules, “A” shares profits from the video. At this time, “G” held part of the revenue rights of “V”, and “DouYin” held the management right and partial revenue rights of “V”. “B” and “C” browsed “V” and, under the agreement between DouYin and the creator, obtained authorization from “A” for “V”. At this point, “B” and “C” held the data collection rights and data usage rights of “V”. In the data usage activities, “B” and “C” reprocessed “V” and created new videos “V-1” and “V-2” respectively. At this time, “B” and “C” have the data rights of “V-1” and “V-2” respectively. In data transactions activities, “B” and “C” publish “V-1” and “V-2” on “DouYin” and share profits from it. At this time, “G” and “A” have part of the revenue rights of “V-1” and “V-2”, “DouYin” has partial revenue rights and management rights of “V-1” and “V-2”. “DouYin” continues to recommend “V”, “V-1” and “V-2” to a large number of service objects for viewing and reprinting, thus making these videos gain widespread visibility. At this time, the service objects participating in reprinting have the usage right and part of the revenue rights of “V”, “V-1” and “V-2”; the service objects browsed have part of the revenue rights of “V”, “V-1” and “V-2;” “DouYin” has the management rights and partial revenue rights of “V”, “V-1” and “V-2”.

After receiving the report and confirming that “V”, “V-1” and “V-2” are rumor data, “DouYin” needs to identify the source of the rumor information and the victims of the data infringement. Therefore, it is necessary to trace the data with the rights and interests of “V“, “V-1“ and “V-2”. Support H1.

Provenance process coding of the OWL2 language

This paper employs the OWL2 language to encode the elements and relationships of the PROV-OCC model for rights-related data provenance, which is based on W7 technology. This encoding facilitates efficient querying and reasoning for data provenance. Here are the main elements of the encoding:

“Who”: Describes the entity of the PROV-OCC, that is, the service platform “DouYin” and the service object “Individual”, coded as follows:

<!-- http://www.PROV-OCC.com#DouYin -->

<owl:Class rdf:about=“http://www.PROV-OCC.com#DouYin”>

<rdfs:subClassOf rdf:resource=“http://www.PROV-OCC.com#Service Platform: Enterprise”/>

</owl:Class>

“How”: Describes the relationships among entities, Rights-and-Interests-Attributed Data Elements, and activities in the PROV-OCC model, such as: hasGeneratedBy, hasPublishedBy, hasProfitsSharedBy, hasBrowsedBy, hasRecommendedThrough, hasAddedBy, hasReprintedBy. The coding is as follows:

<!-- http://www.PROV-OCC.com#hasRevenueRight -->

<owl:ObjectProperty rdf:about=“http://www.PROV-OCC.com#hasRevenueRight”>

<rdfs:subPropertyOf rdf:resource=“http://www.w3.org/2002/07/owl#topObjectProperty”/>

<rdfs:domain rdf:resource=“http://www.PROV-OCC.com# Entity”/>

<rdfs:range rdf:resource=“http://www.PROV-OCC.com# Rights-and-Interests-Attributed_Data_Element”/>

</owl:ObjectProperty>

“What”: Describes the activities in the PROV-OCC model, i.e., the processes of data element circulation. For example, the “Data usage” activity is associated with service objects “B” and “C”, who reprocessed “V” and respectively generated new Rights-and-Interests-Attributed Data Elements, “V-1” and “V-2”. The coding is as follows:

<!-- http://www.PROV-OCC.com#Data_Usage:_V-1_has_Generated_by_B,_V-2_has_Generated_by_C -->

<owl:NamedIndividual rdf:about=“http://www.PROV-OCC.com#Data_Usage:_V-1_has_Generated_by_B,_V-2_has_

Generated_by_C”>

<rdf:type rdf:resource=“http://www.PROV-OCC.com#Data_usage”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#B”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#C”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V-1”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V-2”/>

</owl:NamedIndividual>

“When”: Describes the time period when the activities in the PROV-OCC model occur and the time when the Rights-and-Interests-Attributed Data Element changes (e. g., generation, release, secondary processing). For example, the time span of the “Data transactions” activity is from October 2022 to July 2023. The coding is as follows:

<!--http://www.PROV-OCC.com#Data_Transactions:_V,_V-1_and_V-2_have_been__published/recommended_by_DouYin -->

<owl:NamedIndividual rdf:about=“http://www.PROV-OCC.com#Data_Transactions:_V,_V-1_and_V-2_have_been__

published/recommended_by_DouYin”>

<www:hasActivityEndTime rdf:datatype=“http://www.w3.org/2001/XMLSchema#dateTime”>2023-07-30T00:00:00

</www:hasActivityEndTime>

<www:hasActivityStartTime rdf:datatype=“http://www.w3.org/2001/XMLSchema#dateTime”>2022-10-01T00:00:00

</www:hasActivityStartTime>

</owl:NamedIndividual>

“Where”: Describes the location where activities occur and where the Rights-and-Interests-Attributed Data Elements are stored in the PROV-OCC model. For example, in this scenario, the activities take place on the service platform “DouYin”; during the “Data storage” activity, the Rights-and-Interests-Attributed Data Element “V” is stored on the “DouYin” platform. The coding is as follows:

<!-- http://www.PROV-OCC.com#Data_Storage:_V,_V-1_and_V-2_have_been_stored_in_DouYin -->

<owl:NamedIndividual rdf:about=“http://www.PROV-OCC.com#Data_Storage:_V,_V-1_and_V-2_have_been_stored_in_

DouYin”>

<rdf:type rdf:resource=“http://www.PROV-OCC.com#Data_storage”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#DouYin”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V-1”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#V-2”/>

<www:hasActivitySpace rdf:datatype=“http://www.w3.org/2001/XMLSchema#anyURI”>https://www.douyin.com

</www:hasActivitySpace>

</owl:NamedIndividual>

“Which”: Describes the data rights of entities in each activity of the PROV-OCC model. The entities related to data rights in this scenario are mainly service objects “A”, “B”, “C”, “G” and service platform “DouYin. For example, the service pair like “A” has all the data rights of “V” when the “Data processing” activity generates the interest data element “V”. The coding is as follows:

<!-- http://www.PROV-OCC.com#A -->

<owl:NamedIndividual rdf:about=“http://www.PROV-OCC.com#A”>

<rdf:type rdf:resource=“http://www.PROV-OCC.com#Entity”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#Data_Processing:_V_processed_Vimage,

V_has_generated_by_A”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#Data_Storage:_V,_V-1_and_V-2_have_been_

stored_in_DouYin”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#Data_Transactions:_V,_V-1_and_V-2_have_

been__published/recommended_by_DouYin”/>

<www:hasAssociatedWith rdf:resource=“http://www.PROV-OCC.com#Data_collection:__A_has_using_Vimage”/>

<www:hasCollectionRight rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasHoldingRight rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasManagementRight rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasPartialRevenueRight rdf:resource=“http://www.PROV-OCC.com#V-1”/>

<www:hasPartialRevenueRight rdf:resource=“http://www.PROV-OCC.com#V-2”/>

<www:hasRevenueRight rdf:resource=“http://www.PROV-OCC.com#V”/>

<www:hasUsageRight rdf:resource=“http://www.PROV-OCC.com#V”/>

</owl:NamedIndividual>

“Why”: Describes the reason for the PROV-OCC model for the traceability activity. In this scene, “DouYin” traces and identifies the source of rumor information and the victim of data rights infringement and provides it to the judicial organ as evidence.

Model validation

This paper employs SWRL (Semantic Web Rule Language) inference rules and the Pellet reasoner to conduct reasoning and provenance verification on the ontology resulting from the integration and encoding of the W7 and PROV-OCC models. SWRL inference rules are a rule language based on ontologies that can define rules similar to Horn logic, thereby enhancing the reasoning and expressive capabilities of ontologies. In this paper, SWRL inference rules are used to address the conflict relationships between various classes and between classes and instances after the PROV-OCC model has been ontologized. The ontology and SWRL rules are imported into the Pellet reasoner within the Protégé software to deeply explore the implicit logical relationships between classes and instances, verifying the capabilities and reliability of the PROV-OCC model in tracing the provenance of data rights.

This paper designs three types of capability questions and creates corresponding inference rules to verify that the model can effectively trace the infringers and the propagation paths of rumors’ data rights. The capability questions are: (1) Can the infringer of the data rights and the type of data rights infringed upon be traced? (2) Can the victim of the data rights infringement and the type of data rights infringed upon be traced? (3) Can the path of the data rights infringement be traced? Based on the above capability questions, this paper constructs SWRL inference rules for tracing the provenance of data rights in the context of social media and imports them into the Protégé rule base, as shown in Fig. 6 and Table 6. In the ontology, “x” and “v” represent instances of entities and Rights-and-Interests-Attributed Data Element, respectively. Object properties used include hasRevenueRight, hasCollectionRight, hasManagementRight, hasUsageRight, hasHoldingRight, hasPartialRevenueRight, and hasAssociatedWith.

Fig. 6: SWRL rules after importing Protégé.
Fig. 6: SWRL rules after importing Protégé.The alternative text for this image may have been generated using AI.
Full size image

By importing the PROV-OCC model into the Protégé software, a visual representation of selected SWRL rules supporting provenance of data with rights and interests is provided.

Table 6 Part of the SWRL rules for provenance of data with rights and interests.

This paper imports the entities, activities, and Rights-and-Interests-Attributed Data Element involved in the provenance cases of the PROV-OCC model into the instance layer of the ontology, and after mapping with the ontology, object properties and data properties are added according to the actual situation. On this basis, based on the designed SWRL reasoning rules, the Pellet reasoner is used for knowledge reasoning and implicit knowledge mining. The reasoning results include that the service object “A” is the original creator (original rumor data source object) of the Rights-and-Interests-Attributed Data Element “V”, “B” and “C” are the secondary creators (secondary rumor data source objects) of the Rights-and-Interests-Attributed Data Element, and “G” is the object whose data rights are infringed (i.e., the object whose data rights are violated in the rumor data dissemination), as shown in Figs. 710. The reasoning results successfully trace the data source objects of infringement and the victim objects of data infringement accurately, verifying the consistency and completeness of the ontology and further proving the ability and reliability of the PROV-OCC model in tracing data rights provenance.

Fig. 7: Service object “A” is the inference result of the originator of Rights-and-Interests-Attributed Data Element “V”.
Fig. 7: Service object “A” is the inference result of the originator of Rights-and-Interests-Attributed Data Element “V”.The alternative text for this image may have been generated using AI.
Full size image

Using the Pellet reasoner in the Protégé software, it is inferred that the service object “A” is the original rumor data source of the Rights-and-Interests-Attributed Data Element “V”.

Fig. 8: Service object “B” is the inference result of the secondary creator of Rights-and-Interests-Attributed Data Element “V”.
Fig. 8: Service object “B” is the inference result of the secondary creator of Rights-and-Interests-Attributed Data Element “V”.The alternative text for this image may have been generated using AI.
Full size image

Using the Pellet reasoner in the Protégé software, it is inferred that the service object “B” is the secondary creation rumor data source of the Rights-and-Interests-Attributed Data Element “V”.

Fig. 9: Service object “C” is the inference result of the secondary creator of the Rights-and-Interests-Attributed Data Element “V”.
Fig. 9: Service object “C” is the inference result of the secondary creator of the Rights-and-Interests-Attributed Data Element “V”.The alternative text for this image may have been generated using AI.
Full size image

Using the Pellet reasoner in the Protégé software, it is inferred that the service object “C” is the secondary creation rumor data source of the Rights-and-Interests-Attributed Data Element “V”.

Fig. 10: Service object “G” is the inference result of the victim of the data rights infringement.
Fig. 10: Service object “G” is the inference result of the victim of the data rights infringement.The alternative text for this image may have been generated using AI.
Full size image

Using the Pellet reasoner in the Protégé software, it is inferred that the service object “G” is the object whose data rights were infringed upon during the propagation of the rumor data.

Through the reasoning experiment, it can be found that the main propagation path in this rumor instance is on the service platform “DouYin”, the service object “A” → service objects “B” and “C” → other browsing and forwarding service objects. The subjects of data rights infringement include both infringers and victims, where the victim is the service object “G”; the infringer service object “A” infringes on “G”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V”; the infringer service objects “B” and “C” respectively infringe on “G”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V-1” and “V-2” and “A”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V-1” and “V-2”; other browsing and forwarding service objects infringe on “G”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V”, “V-1”, and “V-2”, infringe on “A”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V”, “V-1”, and “V-2”, infringe on “B”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V-1”, and infringe on “C”‘s partial revenue rights to the Rights-and-Interests-Attributed Data Element “V-2”. This solves RQ3 and supports H2.

Discussion

Theoretical and practical implications

The theoretical significance of this paper is reflected in the following aspects: Based on existing mainstream data provenance models, this paper integrates a data rights confirmation system and innovatively proposes an ontological provenance model, PROV-OCC, which can describe the data rights transformation scenarios in the process of online rumor dissemination. This enriches and expands the theories and methodologies of data provenance. The model design is carried out by cross integrating three dimensions: entity, activity, and “Rights-and-Interests-Attributed Data Element”, “Rights-and-Interests-Attributed Data Element” is taken as the core element of data element circulation and combined with the rights and interests attributes and the properties of circulation of data elements, a comprehensive data provenance model for data with rights and interests for the process of online rumor data circulation is constructed. By combining of the PROV-OCC provenance model with the W7 provenance technology, a technical framework and semantic representation model for data provenance is established, which can trace the rumor propagation paths and data infringement subjects, providing strong theoretical support for the comprehensive provenance of rumor propagation subjects and their data infringement liabilities, offering a new perspective and contribution to the theories and methodologies of data provenance.

This paper also holds profound practical significance: Targeting the social media scenario, it proposes a data (including online rumor data) element circulation management method and system architecture based on the rights-and-interests data provenance model, PROV-OCC, effectively addressing the challenge of tracing the data infringement subjects and their associated data infringement liabilities in online rumor data, enhancing the credibility and security of circulating data, clarifying the rights subjects, providing a basis for legal accountability, and rebuilding public trust in platforms. It provides functions such as data rights and interests protection and provenance tracing for various rights stakeholders involved in the circulation process, safeguards the rights and interests of data subjects, and promotes fairness, transparency, trustworthiness, and efficiency in data element circulation. The model can describe the rights and interests information and changes in the process of rumor data circulation, comprehensively recover and compensate for infringement profits based on the data infringement behaviors and contents of different data subjects. At the same time, it helps improve regulatory mechanisms and platform incentive policies, reduces the likelihood of similar rumors reoccurring, and restores public trust and confidence in the governance capabilities of both the government and platforms in addressing online rumors.

Research on innovation points

This paper introduces the concept of Rights-and-Interests-Attributed Data Element to abstractly represent data with rights and interests attributes circulating on social media platforms, aiming to safeguard the rights and interests of data subjects

Giovanardi proposed a theoretical framework based on the Internet of Things (IoT) to address issues such as the loss of traceable data in the lifecycle information management of building facades (Giovanardi et al., 2023). Zhang developed a traceable ring signature scheme (SM2-TRS) based on the SM2 digital signature algorithm, which ensures data integrity, non-repudiation, anonymity, and traceability (Zhang et al., 2024). Riskin assessed data reliability by quantifying data accuracy, completeness, and traceability. Previous research on the traced data in data provenance has primarily focused on recording and tracking changes in data content, with limited attention to the rights and interests attributes of data (Riskin et al., 2025). Currently, major social media platforms encourage high-quality content creation through revenue-sharing mechanisms, enabling all subjects participating in data circulation to achieve data sharing and value benefits from data. The proposed concept of Rights-and-Interests-Attributed Data Element abstracts data with rights and interests attributes involved in the circulation of online rumor data in social media scenarios, and represents them as object classes, properties, and representations in the form of data element patterns. This provides a fundamental semantic unit for the provenance of data with rights and interests in rumor data.

This paper designs an ontological provenance model, PROV-OCC, to support the recording of data rights, the data infringement subjects and the tracing of infringement paths in the circulation of online rumor data on social media scenarios

Prudhomme enhanced the PROV-O model by semantically aligning PROV-O ontology with the Basic Formal Ontology (BFO), improving cross-domain data interoperability and strengthening data traceability through strict logical consistency and reasoning mechanisms (Prudhomme et al., 2025). Yazici adopted a W3C-PROV-O-based method for data graph extraction and visualization, integrating and validating it through prototype tools(Yazici and Aktas, 2022). Souza improved the PROV-O model with the DOSN-PROV method, enabling effective tracing of information origins, propagation paths, and data trajectories in social networks (Souza et al., 2021). Zhang developed the ProVOC model, which enables effective tracing and cracking of user data flows and privacy management in intelligent library services. Existing ontological data provenance models are limited in handling complex data and lack flexibility for extension to the provenance of data with rights and interests in data element circulation; their generalizability and management effectiveness require further verification. This paper reuses and extracts the core stages of major data lifecycle models, integrates them with the actual data circulation processes of social media platforms, and aligns with previous work on the “five rights separation” framework based on ternary data subjects to standardize the data element circulation rights confirmation system. It defines “Rights-and-Interests-Attributed Data Element” and integrates the advantages of the PROV-O and ProVOC data provenance models to design the rights-and-interests data provenance PROV-OCC model of ontological rumor data based on a “Rights-and-Interests-Attributed Data Element–Activity–Subject” ternary structure. The model clarifies the semantic relationships of concepts and properties during rumor data circulation in the model, supporting the recording of data rights, data infringement subjects, and tracing of infringement paths.

This paper develops a semantic representation and an instantiated ontology reasoning method for the rights-and-interests data provenance model based on the annotation approach

Petar addressed the semantic representation of traceable data from perspectives such as semantic technologies, data interoperability, construction of trusted knowledge bases, and reputation assessment of supply chain participants (Petar et al., 2024). Ivica proposed a provenance verification framework for large language models (LLMs) through model simulation and comparative analysis (Ivica et al., 2025). Guanglin transformed complex relationships in knowledge graphs into interpretable reasoning paths, thereby clearly revealing the logical chain of data provenance (Guanglin et al., 2022). Zhiliang developed an OWL-based ontology model by embedding SWRL rules directly into the ontology to enable reasoning and verification of complex information (Zhiliang et al., 2024). Zhao utilized SWRL rule language and the Pellet reasoner to perform implicit knowledge mining and reasoning within a rumor domain ontology, classifying instances into rumor categories and validating the semantic parsing capability and consistency of the ontology (Zhao et al., 2024). However, existing studies lack research on the semantic representation of provenance data directly applicable to ontology reasoning, as well as on visualized reasoning of traceability. This paper semantically integrates the proposed PROV-OCC model with the annotation-based W7 data provenance model, providing a technical framework and semantic representation model for rights-and-interests data provenance that encompasses activities, relationships, entities, time, space, data rights, and provenance causes. On this basis, typical rumor instances in social media scenarios are transformed into knowledge graphs to demonstrate the application process of the rights-and-interests data provenance model. Competency questions are designed, corresponding SWRL reasoning rules are created, and the Pellet reasoner is employed for ontology reasoning and model validation, which demonstrates that the model can effectively record data rights, trace the data infringement subjects, and infringement paths during rumor data circulation. It enables the recovery and compensation of infringement profits, thereby enhancing the reliability and scalability of provenance.

Research limitations and future directions

The annotation-based W7 model and ontology reasoning method employed in this paper enable fine-grained analysis of rights-and-interests data provenance. However, the inherent computational complexity of ontology reasoning may lead to high resource consumption during large-scale real-time data tracking. Future research will consider incorporating technologies such as knowledge graph link search and edge computing, explore more lightweight and efficient ontology reasoning methods, aiming to enhance the practicality and scalability of the model in large-scale data scenarios.

The traceability capability of the proposed provenance model for data with rights and interests relies on the integrity and authenticity of records during the circulation of data elements. Issues such as missing, tampered, or falsified records can directly affect the accuracy and credibility of the provenance results. Future research will incorporate technologies such as blockchain to optimize the technical architecture of the provenance model, thereby improving the stability of the provenance process as well as enhancing trust and consensus of provenance.

This paper focuses on tracing the infringement subjects and propagation paths of rumor data but lacks quantitative assessment methods for the distribution of profits and the loss of data rights and interests caused by rumor data infringements. Future research will focus on developing evaluation methods for assessing the economic and social impacts of rumor propagation on different data subjects, aiming to construct a rights and responsibilities evaluation model that supports precise compensation and accountability allocation.

Conclusion

Against the backdrop of online rumor propagation in social media, this paper extracted 92 circulation processes from 13 reusable data lifecycle models and summarized them into five universal stages of data element circulation: data collection, data storage, data processing, data transactions, and data usage. It further standardized the “Five Rights Separation” framework based on ternary data subjects, aiming to balance the interests and needs of multiple data holders. Building on this foundation, the concept of “Rights-and-Interests-Attributed Data Element” was introduced, and the ontology-based PROV-OCC model for rights-and-interests data provenance was designed to suit the context of online rumor data circulation in social media. This model consists of three core components: entity, activity, and rights-and-interests-attributed data element. It defines 32 object properties to describe the dynamic changes in data rights relationships among various subjects and circulation stages. This paper integrates the ontological PROV-OCC model with the event annotation capabilities of the W7 provenance model, enabling semantic representation and description of data and rights transitions throughout the rumor circulation process. A typical social media rumor case was instantiated as a knowledge graph and encoded using an ontology. Capability questions were designed, and SWRL inference rules combined with the Pellet reasoner were applied to conduct ontology reasoning. This process validates the model’s accuracy and feasibility in tracking data rights changes, identifying infringement subjects and propagation paths. The results demonstrate that the proposed rights-and-interests data provenance model effectively addresses key challenges of rights-and-interests data provenance in the dynamic environment of social media, and that the research hypotheses have been verified. This paper provides both a theoretical foundation and a semantic model support for the practical application of rumor data provenance and for the regulation and governance of data element circulation. It promotes the legal, compliant, and transparent flow and utilization of data resources, thereby contributing to the advancement of the digital economy.