Introduction

Natural language processing (NLP) techniques are essential for text processing, with advancements in deep learning enabling robust results in tasks like word segmentation, entity recognition, and text classification1,2,3. Pre-training models have revolutionized NLP, originating from the introduction of word embeddings by Bengio et al.4 in 2003. Word2Vec5 and GloVe6 improved text processing but struggled with polysemy due to static word vectors. Modern pre-trained models, like GPT7 (generative) and BERT8 (understanding), utilize the entire model architecture for downstream tasks. GPT, employing a powerful one-way Transformer9, excels at capturing long-range dependencies and pioneered unsupervised pre-training followed by supervised fine-tuning10. BERT, a deep bidirectional Transformer, leverages Masked Language Modeling (MLM) for contextual understanding and Next Sentence Prediction (NSP) for inter-sentence relationships, significantly advancing the field. A significant amount of research has focused on improving the performance of BERT in various applications by refining pre-training tasks, incorporating external knowledge, and optimizing transformer architectures11.

The performance optimization studies for the BERT model can be divided into the following aspects. Cross-linguistic knowledge fusion: Hu et al.12 used multilingual attention for sentiment analysis. Chen et al.13 fine-tuned on augmented multilingual data for low-resource languages. Caciularu et al.14 introduced CDLM for multi-document language modeling. Multimodal information fusion: Research emphasizes visual text processing. Xu et al.15 proposed LayoutLM for layout and content understanding. Luo et al.16 combined visual and language pre-training models. Li et al. introduced VTR-PTM with dual-stream input for image captioning. Li et al.17 proposed P-PCFL for precise vision-language connections. Long text processing: Dai et al.18 introduced Transformer-XL with segment-level recursion. Yang et al.19 proposed XLNet using an AR model and dual-stream attention. Song et al.20 developed MPNet by integrating MLM and PLM with location information. Ding et al.21 presented CogLTX to address limitations in long-range attention. Model distillation and compression: Jiao et al.22 introduced a transformer-based distillation method. Xu et al.23 presented Theseus for BERT compression through modular replacement and emulation. He et al.24 studied KDEP, a feature-based knowledge distillation method.

The integration of specialized domain knowledge has proven crucial for enhancing the generalization ability and controllability of BERT models. This has led to significant research efforts in developing domain-specific pre-trained models, primarily focusing on biomedical and general academic texts. In the biomedical domain, Lee et al.25 demonstrated the superior performance of BioBERT over standard BERT in tasks like named entity recognition, relationship extraction, and question answering. Jeong et al.26 further applied BioBERT to biomedical question answering, achieving notable success with a sequential transfer learning approach. BioBERT’s effectiveness in extracting medical text features was also highlighted by Xiao et al.27. To address fine-grained semantic relationships in this domain, Liu et al.28 proposed SAPBERT for accurate biomedical entity linking. The urgency of the Covid-19 pandemic spurred the development of CovidBERT by Hebbar and Xie29 for extracting relevant treatment information, while Rasmy30 introduced Med-BERT, integrating BERT with structured domain knowledge for improved disease prediction.

Similarly, the academic domain has witnessed the development of specialized BERT models. Beltagy et al.31 released SciBERT, which outperformed BERT on scientific text, addressing the scarcity of large-scale, high-quality scientific data. Park and Caragea32 corroborated SciBERT’s advantages in intermediate task transfer, scientific keyword recognition, and classification. Van Dongen et al.33 introduced SChuBERT for citation prediction, leveraging a comprehensive academic corpus with citation links. Liu et al.34 trained OAG-BERT by incorporating diverse academic entities, demonstrating its effectiveness. Shen et al.35 developed SsciBERT specifically for English social science abstracts, achieving state-of-the-art results in humanities and social sciences. In the context of Chinese academic literature, Li et al.36 created a dataset and corresponding models based on CNKI abstracts and titles.

These studies underscore a growing trend in BERT research towards domain-specific enhancements. However, a notable gap exists in pre-trained models tailored for full-text text mining within the humanities and social sciences. While large language models like GPT have broadly advanced NLP, smaller BERT-based models continue to excel in specific domains, exhibiting stronger domain integration in multimodal and cross-lingual applications. Although academic domain-specific models are prevalent in natural sciences, the development for English humanities and social science is recent and primarily focused on abstracts. The creation of a Chinese humanities and social science pre-trained model utilizing full-text academic resources remains an underexplored area.

Methods

In the field of deep learning, the structure of researches using domain data pre-training (e.g., SciBERT and BioBERT) can typically be allocated to two parts—pre-trained experiments and performance validation experiments—conducted on the model using a standard dataset. In addition to inheriting this basic experimental paradigm, to make the content more characteristic of humanity and social science research, this study also made improvements in light of the characteristics of general research within the disciplines of Chinese humanity and social science dataset. The following experimental steps are designed.

(1) Data acquisition and pre-processing. A python crawler was used to crawl the abstracts of papers included in the Chinese Social Sciences Citation database from 1998 to 2020 and the full text of all papers included in the China Social Science Excellence database from 1995 to 2021. Any abnormal characters and spaces in the text were cleaned, the full text of each paragraphs and the abstract as the basic unit of the pre-training data were determined, the abstracts and full academic texts of papers into the same file with line breaks were merged, and the training and validation datasets at a ratio of 99:1 were divided.

(2) Model pre-training. Using the constructed corpus of Chinese humanity and social science abstracts and full-texts, this study trained and developed a pre-trained model specifically designed for Chinese humanity and social science. The training process utilized BERT, which was trained on a general Chinese vocabulary. The pre-trained model was constructed using a self-built GPU server and was named as HsscBERT (https://github.com/S-T-Full-Text-Knowledge-Mining/HsscBERT).

(3) Pre-trained model evaluation. Perplexity was used as an evaluation metric to initially determine the whole performance of the pre-trained models. A standardized performance evaluation experiment is required to ensure the usability and effectiveness of the model. We designed the following model performance validation experiments based on the characteristics of Chinese humanity and social science data. Based on the validation dataset, we conducted text classification, structural function recognition and named entity recognition experiments and selected the benchmark models and the new models in this research for the comparison. Figure 1 presents the basic flow of the experiments.

Fig. 1
figure 1

Framework of pre-training and validation experiments.

Pre-train experiment

Pretraining model

Currently, mainstream pre-training models are trending towards larger sizes, more parameters, and increasingly complex structures. In this study, we chose BERT as the baseline model to conduct pre-training using the Masked Language Model (MLM) task.

BERT represents a pioneering advancement in the field of natural language understanding in the deep learning era. What distinguishes BERT from previous pre-training models is its ability to train deep bidirectional representations that consider both forward and backward contexts across all layers. In contrast, earlier language models were typically unidirectional. The pre-training phase of BERT in the context of Chinese humanities and social sciences consists of two main tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

The MLM task is a specialized auto-encoder training approach, where the BERT model randomly masks 15% of the input characters from Chinese humanities and social science texts and updates the parameters of the bidirectional transformer by predicting the masked portions. The NSP task requires the model to process sentence pairs and determine whether the two sentences belong to a coherent sequence, thereby learning inter-sentence relationships. By combining the loss functions from both tasks, BERT derives a federated learning loss function, enabling the simultaneous updating of the encoder’s output layer and the connected classifier.

Pretraining data

The corpus consists of the abstracts in CSSCI (http://cssci.nju.edu.cn/login_u.html) and the full texts in China Social Science Excellence (http://ipub.exuezhe.com/index.html). We stored the crawled information in the MySQL database, from which we extracted all items whose values are not empty and obtained a total of 16,960,263 pieces of abstracts and paragraphs. Table 1 presents the descriptive statistical information of the corpus. The results indicate that the total number of data used in this experiment surpasses 4 billion Chinese characters.

Table 1 Chinese humanity and social science dataset

In order to understand the distribution of disciplines in the Chinese humanity and social science Corpus, Tables 2 and 3 show the number distribution of different disciplines in abstracts and academic full-texts.

Table 2 The distribution of CSSCI abstracts in accordance with discipline
Table 3 The distribution of full texts from China Social Science Excellence in accordance with discipline

Experimental parameters for pre-training

Each line of the Chinese humanity and social science corpus is an abstract or paragraph of an article.Additionally, it’s worth noting that the number of Chinese humanity and social science characters in the majority of lines is statistically less than 512. Therefore, for the pre-training experiment, the maximum sequence length is set to 512 to enhance the pre-training speed. The initial learning rate for the training of Chinese humanity and social science was set to 2e-5, and the warmup function provided by the transformers facilitated a rapid increase in the initial learning rate during the pre-training process for Chinese humanity and social science. Afterward, the learning rate gradually decreases to zero. This mode used in Chinese humanity and social science enabled the model to quickly stabilize during the initial training phase and subsequently converge rapidly in subsequent training sessions. Additionally, a decreasing learning rate struck a balance between the need for convergence speed and training accuracy. Considering the server configuration and model limitations, the training batch size values were individually set to 32 for each graphics card, and based on the usual experimental practice, three or five rounds of training were conducted to achieve better results. The specific parameter values are presented in Table 4.

Table 4 Pre-training parameters in Chinese humanity and social science

Models performance evaluation

In this study, two approaches are used to evaluate the performance of pre-trained language models. The first method is simpler, assessing how well the model learns Chinese humanities and social science texts by calculating the model’s perplexity on a test set. The second approach is more straightforward, applying the pretrained language model to a specific Chinese humanities and social science task. By observing the performance of the model in a Chinese humanities and social science task and comparing it with other models, its effectiveness can be assessed.

Evaluation of perplexity

The concept of perplexity introduces a novel solution. When dealing with substantial disparities in perplexity, a lower perplexity score indicates a stronger fit of the pre-trained model to real Chinese humanity and social science sentences. In other words, a lower perplexity signifies a superior model in Chinese humanity and social science. The computing formula for perplexity is in the following formula.

$${{{PP}}}\left(W\right)={P\left({w}_{1}{w}_{2}\ldots {w}_{N}\right)}^{-\frac{1}{N}}=\root{N}\of{\frac{1}{P\left({w}_{1}{w}_{2}\ldots {w}_{N}\right)}}$$
(1)

During the experiment, several pre-trained experiments in Chinese humanity and social science were conducted, and the perplexities are presented in Table 5.

Table 5 Perplexity of the pre-trained models

As can be observed in Table 5, two models pre-trained with corpus of Chinese humanity and social science have a low level of perplexity. Given that the test set for the experiments consists of abstracts or full academic text from actual publications, it can be assumed that the content of the test set is normal understandable sentences, and thus the language models with lower perplexity may perform better on the specific text mining task. Obviously, HsscBERT_e5 obtains a lower degree of perplexity than HsscBERT_e3, but it consumes much more time than HsscBERT_e3, and Hssc _ BERT _ e5 has better performance in specific natural language understanding tasks.

Evaluation of tasks in NLP

Perplexity, to some extent, reflects the convergence of the pre-trained model, and therefore, it can more effectively characterize the linguistic features of the pre-trained data. Indeed, while a model with lower perplexity indicates better understanding of the pre-trained data, it does not guarantee superior performance across various specific natural language understanding task. Thus, further experiments are needed to validate the performance of the model trained in Chinese humanity and social science. In this researches, a total of three natural language processing tasks were set for validation in the following three categories: (1) discipline classification experiments, namely, discipline classification tasks for titles and abstracts of CSSCI journal papers, (2) abstract structural function recognition experiments, including abstract sentence recognition tasks for articles published in Data Analysis and Knowledge Discovery, and (3) named entity recognition experiments, including Chinese literary datasets for entity recognition. The BERT-base-Chinese model and Chinese-RoBERTa-wwm-ext model were selected as the baseline models in this research for comparison with the model trained in Chinese humanity and social science. All of the above three types of tasks are important foundations for research in the field of academic textual studies in the humanities and social sciences, especially for in-depth analysis and research of literature. First, this study acquired and constructed a high-quality evaluation dataset, and divided the training set and test set according to 9:1. Second, the HsscBERT model and the selected BERT-like base model were fine-tuned and evaluated for their experimental results, respectively. Finally, the large language model is introduced to be tested on the test set.

Evaluation dataset

(1) Discipline classification dataset of CSSCI

The titles and abstracts of articles published in CSSCI-indexed journals from 1998 to the first half of 2021 were obtained and were categorized according to the CSSCI discipline classification standard. Approximately 2000 pieces of bibliographical information were randomly extracted from each subject, whereas all bibliographical information was extracted from subjects with fewer than 2000 papers. As such, bibliographical information was extracted from a total of 41,559 valid documents. The extracted bibliographical information was divided into a training and test set at a ratio of 9:1 in Chinese humanity and social science. The disciplines included in the dataset are shown in Table 6.

Table 6 Discipline classification labels and data distribution

(2) Structural function recognition dataset

Data Analysis and Knowledge Discovery is one of the core CSSCI journals belonging to the discipline of library and information science. Relying on the fields of computer science, data science, and informetrics, among other fields, this journal extensively organizes research on the methods, theories, and technologies of data-driven knowledge discovery, intelligent management, and semantic computing. With the characteristics of both social and natural sciences, the journal is a product of big data management and application in the smart era. Because the abstracts of articles published in the journal have been strictly arranged in a structured form since 2014, the study of its abstract structural function recognition does not require a large number of manual tags, which avoids tag errors, and reduces the workload and time consumption.

In this dataset, abstract information of the literature published in the journal Data Analysis and Knowledge Discovery from 2014 to October 2021 (including articles first published online) was obtained. In addition, the abstracts were organized according to their labeled structured abstract subheadings (“Purpose,” “Methods,” “Results,” “Limitations,” and “Conclusions,” among others) and classified into the corresponding categories and then normalized. Furthermore, the training and validation datasets of abstracts of journal Data Analysis and Knowledge Discovery were divided at a ratio of 9:1 with a single abstract as the unit of measurement. Table 7 presents the details of the structural function labels normalization.

Table 7 Uniform specifications of structural function labels

(3) Chinese literary entity recognition dataset

To explore the effects of the pre-training of Hcss_BERT model, Chinese literary text data were selected for validation experiments in named entity recognition (NER)37. The data were obtained from a dataset for Chinese literary text entity recognition and relationship extraction, which was posted on GitHub (https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset/tree/master). The dataset is based on 726 articles related to Chinese literature filtered and extracted from the website. The dataset was constructed using an entity and relationship annotation, heuristic Tagging based on generic disambiguation rules, and CRF machine learning methods to assist in the tag and ensure the precision of the entities and relationships. The dataset was annotated using the “BIO” annotation method, which distinguishes entities from non-entities. There were seven entity categories in the dataset. In the entity categories, the dataset uses “Thing” to capture non-human objects, “Time” to capture story timelines, and “Metric” to capture the length and other information based on the characteristics of literary texts, allowing the label design to fit better with the text features (Table 8).

Table 8 Entity labels of dataset for Chinese literary entity recognition

Results

To verify the HcssBERT pre-trained model on various natural language processing tasks, we conducted comparative experiments between the proposed and following baseline models. Based on the models used during pre-training described in the previous section, the baseline models used in experiments are as follows: (1) BERT-base-Chinese, the basic Chinese pre-trained model officially provided by Google, was selected as the first validation baseline model. (2) Chinese-RoBERTa-wwm-ext, the Chinese RoBERTa model adopting. (3) ChatGPT, one of the top-performing generative multilingual pre-trained models at the current stage, we utilized GPT-3.5-turbo and GPT-4.0-turbo from the GPT series to provide response services. (4) LLAMA3.1-8B, one the best-performing open-source large language models, for LLAMA3.1-8B model, this study firstly performs Lora fine-tuning, and then uses the fine-tuned model for downstream task validation. In light of the common performance metrics system for NLP, in this pre-trained model of Chinese humanity and social science validation task, we used the Accuracy, Precision (P), Recall (R), F1-score, Macro-avg, and Weighted-avg to evaluate the experimental performance of the four pre-trained models. Each metric is calculated as follows (Table 9).

Table 9 Confusion matrix of models in Chinese humanity and social science
$${{{Accuracy}}}=\,\frac{{{{TP}}}+{{{TN}}}}{{{{TP}}}+{{{TN}}}+{{{FP}}}+{{{FN}}}}$$
(2)
$${{{Precision}}}=\frac{{{{TP}}}}{{{{TP}}}+{{{FP}}}}$$
(3)
$${{{Recall}}}=\frac{{{{TP}}}}{{{{TP}}}+{{{FN}}}}$$
(4)
$${{F}}1\mbox{-}{{{Score}}}=\frac{2{{{precision}}}* {{{recall}}}}{{{{precision}}}+{{{recall}}}}$$
(5)

The macro-average in Chinese humanity and social science represents the arithmetic mean of the statistical indicators for all categories, including macro-accuracy, macro-recall, and macro-F1-score.

$${{{macro}}}\mbox{-}{{{precision}}}=\,\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{{{{precision}}}}_{i}$$
(6)
$${{{macro}}}\mbox{-}{{{recall}}}=\,\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{{{{recall}}}}_{i}$$
(7)
$${{{macro}}}\mbox{-}{{F}}1\mbox{-}{{{score}}}=\,\frac{2{{{{precison}}}}_{{{{macro}}}}* {{{{recall}}}}_{{{{macro}}}}}{{{{{precison}}}}_{{{{macro}}}}+{{{{recall}}}}_{{{{macro}}}}}$$
(8)

The weighted-average in Chinese humanity and social science involves selecting the ratio of the number of samples in each category to the total number of samples in all categories as weights, and then computing the average. The specific indicators to be calculated using the weighted-average in Chinese humanity and social science are the weighted precision, weighted recall, and weighted F1 values.

$${{{weighted}}}\mbox{-}{{{precision}}}=\,\mathop{\sum }\limits_{i=1}^{n}{{{{precision}}}}_{i}* {f}_{i}$$
(9)
$${{{weighted}}}\mbox{-}{{{recall}}}=\,\mathop{\sum }\limits_{i=1}^{n}{{{{recall}}}}_{i}* {f}_{i}$$
(10)
$${{{weighted}}}\mbox{-}{{F}}1\mbox{-}{{{score}}}=\,\frac{2{{{{precison}}}}_{{{{weighted}}}}* {{{{recall}}}}_{{{{weighted}}}}}{{{{{precison}}}}_{{{{weighted}}}}+{{{{recall}}}}_{{{{weighted}}}}}$$
(11)

Prompts for LLM

Table 10 presents the prompts used for three different downstream tasks. In the case of text classification tasks like structural function recognition and discipline classification, no additional samples were added to the prompts which relied solely on the provided descriptions. However, for the named entity recognition (NER) task, more complex prompts were constructed, including a specific example. This is because NER tasks have a higher degree of output discreteness compared to classification tasks, making it challenging to extract computational elements from generated text without structured control over the output. Nonetheless, the model still has the possibility of generating text outside the label set. To address this issue, during the final calculation process, any output generated by the model that does not belong to any label category is regarded as an erroneous label, ensuring the rigor of the computed results.

Table 10 Input prompt examples for LLM

Experiment results

Discipline classification experiments

In this study, the baseline models and the two proposed pre-trained models for Chinese humanity and social science domains were used to conduct automatic classification experiments based on the title and abstract text of CSSCI journal articles belonging to the disciplines, respectively. The classification of the models was measured using the accuracy, macro-avg, and weighted-avg values. Table 10 presents the results of the classification experiments. The data indicates that HsscBERT_e3 proposed in this study achieved the optimal results on the discipline classification experiments conducted on the title, and the best accuracy and weighted avg value in the combined data of abstract. HsscBERT_e5 works best for the macro average of the abstract. GPT performs poorly in two scenarios, with a low macro-average. This is due to the unpredictability of the generated content by the generative model, resulting in the generation of data beyond the label set. Compared with the two models using WWM technology, HsscBERT_e3 and HsscBERT_e5 outperformed the baseline model, indicating that the multiple models proposed in this study can provide better support for the intelligent processing of humanity and social science texts. It is worth noting that HsscBERT_e3 is slightly better than HsscBERT_e5 in the title classification task, which should be related to the characteristics of the training data, and the full-text data accounted for a relatively large proportion of the data used in this research, so HsscBERT_e3 achieved better results in the shorter text task. The GPT3.5-turbo model performs weakly, especially at the macro-average. The reason for this is that the GPT3.5-turbo model often outputs redundant content and content that is not in the given category set, which seriously affects the performance of the model. In contrast, the performance of the GPT4-turbo model has greatly improved. Although there is still a certain gap compared to the fine-tuned model, it has achieved an effect close to the fine-tuned BERT model in the 0-shot situation. Meanwhile, it is evident that the performance improvement of the GPT4-turbo model is very significant when the input content is richer. In addition, the LLAMA3.1-8B model after lora fine-tuning outperforms GPT4.0-turbo on the classification task, but also suffers from irregularities in the output format, where the output does not belong to the given category.

Structural function recognition experiments

To verify the performance of the pre-trained model in Chinese humanity and social science, abstract sentence structural function recognition experiments were conducted using abstracts obtained from publications in Data Analysis and Knowledge Discovery. The abstracts of the journal are standardized and can be directly used as standard answers for structural function recognition after a simple normalization. The experiments in in Chinese humanity and social science used the precision, recall, and the F1-score under macro-avg and weighted-avg as the basic metrics to evaluate the models performance. The results were compared with BERT-base-Chinese, Chinese-Roberta-wwm-ext and GPT. The model validation results are shown in Table 11. The experiments shows that HsscBERT_e5 has achieved the optimal value in every index. In the meantime, the performance of HsscBERT_e3 model is slightly lower than that of HsscBERT_e5. In contrast, the Chinese-Roberta-wwm-ext model performs the poorer in this task, with all indexes not reaching 80%, while the performance of BERT-base-Chinese, one of the baseline models, is fair medium performance. The ability of GPT3.5-turbo to recognize abstract sentence structural function is poorest, but compared to subject classification tasks, GPT3.5-turbo achieves higher macro-average scores in sentence segmentation tasks. It generates very little text outside the generated label set, indicating that large language models tend to produce more structured text when performing tasks with a smaller number of labels. The GPT4 turbo model has made significant improvements at the macro coverage level, partly due to the superior performance of the GPT4 turbo model itself. On the other hand, the categories involved in abstract step recognition tasks are far less than those in text subject classification, which also enables the model to better predict categories in 0-shot situations. Compared to the subject classification task, the LLAMA3.1-8B model performs even better on the speech step recognition task, with much better performance than the GPT family of models, and there is no case of outputting a non-given set of categories due to fewer data categories.

Table 11 Experiment results of classification models

Chinese literary entity recognition

As one of the tasks of NLP, named entity recognition is the basis for subsequent applications such as question answering systems, machine translation, and information retrieval. To verify the recognition effect of the HsscBERT pre-trained models, the Chinese literature dataset was used for entity recognition validation experiments, which contained seven types of entities including ‘characters’ and ‘places’. The precision, recall, and F1-score were used as the evaluation indexes of the model as is typical with named entity recognition. The experimental results are presented in Table 12. It can be seen that HsscBERT_e5, obtained the optimal performance in both Recall and F1, and the F1-score is 73.38% for all types of entities, which is 1.39% higher than the benchmark model BERT-base-Chinese, and poor performance Chinese_RoBERTa_wwm_ext, only reaching 50.56%. The performance of GPT3.5-turbo on this task is also lower than that of the fine-tuned model, and the recall is significantly lower than the precision. The F1 score only reaches 18.20%. This indicates that even the currently most advanced model is still difficult to apply to vertical domain tasks. Similar to the previous two experiments, GPT4-turbo still exhibits much better performance than the GPT3.5 turbo model, but the macro coverage index is still low. Obviously, in the 0-shot scenario, it is difficult to require generative models such as GPT to complete domain tasks while ensuring the standardization of the output. The LLAMA3.1-8B model outperforms the GPT model on the named entity recognition task, but there is still a gap compared to the BERT family of models, and in combination with the other two validation results, the generative model still needs to be explored more on the text comprehension-type task (Table 13).

Table 12 Experiment results of structural function identification of abstracts (%)
Table 13 Experimental results of Chinese literary entity identification (%)

Discussion

The present study focuses on the development of pre-trained language models specifically for Chinese humanities and social science texts. A series of models was proposed, including HsscBERT_e3 and HsscBERT_e5, and their performance was evaluated across a range of tasks, including discipline classification, abstract structural function recognition, and knowledge entity recognition. The models demonstrated superior performance in domain-specific tasks in comparison to baseline models, particularly with regard to the recognition of domain-related linguistic features and the improvement of semantic understanding. Furthermore, experimental findings indicated that five rounds of pre-training yielded superior model performance in comparison to three rounds, thereby underscoring the significance of continuous domain-specific training. Additionally, the research highlighted that small and fine-tuned models outperform large-scale models like GPT in certain specialised tasks, affirming the value of targeted model training over broad, generic pre-training. The process described in this study can be further generalized to other domains, such as literature, history and philosophy. Training on large-scale domain texts enables the model to demonstrate superior performance on domain-specific tasks.

Notwithstanding the favourable outcomes, this study also identified certain limitations. From a cross-disciplinary and cross-language perspective, the performance of the model varies across disciplines due to differences in publication volume and language style. The more esoteric language of disciplines such as philosophy poses a significant challenge for training, leading to lower recognition accuracy in specific tasks. The increasing overlap of disciplinary content in modern research also complicates text categorisation, as it makes it more difficult for models to achieve high classification accuracies. While the model has demonstrated proficiency in Chinese humanities and social sciences, challenges persist in multilingual scenarios due to constraints in model parameters and training data. From the perspective of NLP tasks, the encoder-only architecture model represented by BERT is more suitable for text comprehension tasks, and is difficult to perform generative tasks (e.g., translation, summarization).

To date, NLP has experienced four research paradigms: Fully Supervised Learning (Non-Neural Network), Fully Supervised Learning (Neural Network), Pre-train, Fine-tune and Pre-train, Prompt, Predict38. This study represents an attempt and exploration of the third paradigm in the context of Chinese humanities and social sciences. The experimental results demonstrate that, despite the large language model’s demonstrated proficiency in text generation within the general domain, the Pre-train, Fine-tune research paradigm based on small models maintains relevance in the vertical domain. Future research should address the limitations of perplexity as a metric for model evaluation, exploring alternative metrics that could better capture nuanced performance differences, especially in low perplexity scenarios. In addition, further exploration is required to enhance the performance of models across diverse disciplinary domains characterised by varied linguistic characteristics. The incorporation of a more diverse array of data sources, along with the identification and resolution of challenges pertaining to cross-disciplinary classification, will serve to enhance the model’s versatility and generalizability. Additionally, refining the pre-training process to accommodate the specific needs of different linguistic styles, such as those found in Chinese humanities and social sciences, could further optimize model performance. The exploration of advanced techniques, such as the incorporation of external knowledge bases, the integration of more domain-specific features, and the application of hybrid models combining supervised and unsupervised learning approaches, holds promise for the enhancement of future model development. As the field of natural language processing (NLP) continues to evolve, it is essential to examine how new paradigms, such as the fourth paradigm of NLP, can be integrated into the pre-training process to improve performance on downstream tasks.