Advanced natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA

Polignano, Marco; Basile, Pierpaolo; Semeraro, Giovanni

doi:10.1038/s41598-025-31319-0

Download PDF

Article
Open access
Published: 03 February 2026

Advanced natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA

Marco Polignano¹,
Pierpaolo Basile¹ &
Giovanni Semeraro¹

Scientific Reports volume 16, Article number: 5375 (2026) Cite this article

884 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This work introduces, LLaMAntino-3-ANITA-8B-Inst-DPO-ITA, a Large Language Model (LLM) adapted for the Italian language based on the Meta-AI LLaMA-3 model family. The original 8B parameter instruction-tuned model is first fine-tuned using the Supervised Fine-tuning (SFT) technique on English datasets to enhance its baseline performance on instruction tasks. Subsequently, a Direct Preference Optimization (DPO) process is applied to align preferences, mitigate unsafe responses, and limiting biases. In the final stage, the model is adapted to the Italian language using a limited amount of high quality Italian language data. This methodology combines the efficiency of QLoRA, for fine-tuning on a smaller portion of the original model weights, with DPO to refine the model’s output, adapting the model to the Italian linguistic structure while maintaining computational efficiency. Evaluation on Open LLM benchmarks for both Italian and English languages confirms the model’s effectiveness, achieving state- of-the-art performance among Italian LLMs with an average accuracy score of 0.6160 on various Italian text comprehension and question-answering tasks. The model is released via the HuggingFace hub, with usage examples available in the GitHub repository.

Industrial applications of large language models

Article Open access 21 April 2025

Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

Article Open access 28 March 2025

Training large language models on narrow tasks can lead to broad misalignment

Article Open access 14 January 2026

Introduction

Recent releases in large-scale language models, such as GPT¹, LLaMA², Mistral³ and DeepSeek⁴ advance the field of natural language processing (NLP) constantly. Indeed, for example, LLMs are nowadays commonly adopted in applications such as the automation of customer support chats⁵, where they can be used to generate coherent and contextually relevant responses by exploiting large textual data and performing complex reasoning tasks. In the field of dialogue systems, these models exhibit the ability to produce human-like texts and establish coherent conversations due to their advanced language understanding capabilities^6,7.

Fine-tuning and preference optimization represent crucial processes for large language models. These processes allow the models to customize their responses to specific contexts, thereby enhancing the quality of generated text and improving user interactions⁸. Furthermore, preference optimization methods such as RLHF (Reinforcement Learning with Human Feedback)^9,10 further enhance the models’ responses by learning from human interactions, ensuring that the model’s outputs align with user preferences and feedback. As large language models evolve, ethical and regulatory considerations necessitate greater attention. Understanding the implications of AI-generated content, such as determining copyright ownership of AI-generated works, is a prerequisite for ensuring legal clarity and accountability. Moreover, emphasizing explainable AI and transparency is essential for building user trust and ensuring effective collaboration between AI systems and human users^11,12.

However, these models exhibit limitations when applied to low-resource languages and niche domains. The primary issue lies in their limited capacity to effectively adapt to low-resource and unseen languages, which constrains their performance in such contexts^13,14. Despite the existence of cross-lingual model transfer methods that utilize parallel corpora to connect high-resource and low-resource languages, the adaptability of these models remains restricted by their inherent limitations¹⁵. Techniques like synthetic treebanking have been explored to facilitate parsing for low-resource languages, but their effectiveness is limited by the constraints of the models¹⁶. The “curse of multilinguality” also presents a challenge, as the adaptability of multilingual models may result in suboptimal representations for individual languages in niche domains^17,18. The limitations of large language models in low-resource languages and niche domains underscore the necessity for tailored solutions and specialized adaptations. While techniques like prompt tuning, few-shot, and finetuning have demonstrated potential in customizing models for specific tasks¹⁹, addressing the inherent constraints of these models across diverse linguistic and application contexts remains an essential area of research²⁰. Successfully addressing these challenges is a prerequisite for the widespread practical application of large language models.

While the Italian language is widely spoken globally, it is often underrepresented in large models released by international companies. This is exemplified by the low percentage of Italian language data used to train the Meta LLaMA-2 model²¹. This challenge is partially addressed by the release of “LLaMAntino”²², the first family of Large Language Models based on Meta-AI LLaMA models adapted for the Italian language. This family aims to create models designed for open use on Italian language tasks and with the possibility of further adaptations and releases. The development of these models relies on open, reliable, and reusable data. The ANITA (Advanced Natural-based interaction for the ITAlian language) project (https://huggingface.co/swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA) continues this line of research by building upon the evolution of LLaMA models, particularly using one of the latest LLaMA-3 version²³ that was available when this work was conducted (April 2024). The presented model incorporates several enhancements over its predecessors, including a reduced size, adaptation to user preferences, and support for quantized versions. Its effectiveness is demonstrated through rigorous evaluation and multiple application examples in scientific and business contexts.

Related work

Despite the ability of LLMs to correctly answer a long list of general questions in English, their adaptation to specific languages or tasks is often necessary²⁴. The traditional approach to fine-tuning these models for specific tasks is computationally expensive and memory-intensive. Parameter-Efficient Fine-Tuning (PEFT)²⁵ methods address this challenge. PEFT techniques adapt LLMs to new tasks by updating only a small subset of the model parameters, thus reducing the computational load and preserving model performance between tasks. Full fine-tuning of LLMs presents several difficulties, requiring substantial memory allocation for model weights, optimizer states, gradients, and forward activations during training. As LLMs grow in size, reaching hundreds of gigabytes, the memory requirements become prohibitive, especially for consumer hardware. Moreover, full fine-tuning can lead to catastrophic forgetting, where a model loses its performance on previously learned tasks when adapted to new ones. This complicates the use of LLMs for multiple tasks without compromising their efficiency. Techniques such as Low-Rank Adaptation (LoRA), for instance, involve low-rank adaptations that modify only a small part of the model’s weight matrices, while Prompt Tuning introduces prompts that guide the model to generate task-specific responses without extensive retraining. These methods enhance the efficiency of LLMs in zero-shot classification tasks, especially in low-resource settings where only a few examples per class are available.

LLaMAntino-3-ANITA is grounded on LoRA²⁶, which introduces low-rank matrices that represent the minimal changes needed for adaptation. In a Transformer model, each layer has weight matrices, such as the attention and feed-forward networks. LoRA decomposes these matrices into two smaller matrices, A and B, where the original matrix W can be approximated by A × B. This decomposition substantially reduces the number of parameters that need to be updated during fine-tuning. During the fine-tuning process, only the low-rank matrices A and B are trained, while the original pre-trained weights remain frozen.

With this approach, the model learns task-specific adaptations with a minimal increase in the number of trainable parameters. By training only a small fraction of the model’s parameters, LoRA facilitates efficient adaptation to new tasks without the computational overhead of traditional fine-tuning methods. This permits the fine-tuning of LLMs on consumer-grade hardware and their wider deployment.

In some scenarios, LoRA is insufficient for training a model due to hardware limitations. QLoRA, which stands for Quantized Low-Rank Adaptation²⁷, builds on the principles of LoRA by incorporating quantisation into the fine-tuning process. Quantization is a process that reduces the numerical precision of a model’s tensors, typically converting them from high-precision floating-point numbers to lower-precision representations, such as 8-bit or 4-bit integers. The primary goal of QLoRA is to maintain the performance of LLMs while substantially reducing their memory footprint, which facilitates the fine-tuning and deployment of these models on less powerful hardware with limited resources. QLoRA combines low-rank matrix adaptation with quantization. By applying a low-rank approximation to the weight matrices of an LLM, QLoRA reduces the number of parameters that need to be updated during fine-tuning. This is achieved by decomposing the original high-dimensional weight matrices into smaller, low-rank matrices that are easier to manage and require less computational power to update. Additionally, QLoRA employs quantization to further compress the model size by mapping the floating-point weights to a fixed-point representation, which is more memory-efficient. This dual approach facilitates the fine-tuning of LLMs with billions of parameters on relatively small GPUs, making advanced language processing capabilities more accessible. QLoRA is the technique chosen for the model in this study.

To align the model’s outputs with human values and preferences, Reinforcement Learning from Human Feedback (RLHF)⁹ is commonly adopted. This method for fine-tuning LLMs integrates human feedback into the training loop. In RLHF, a reward model is trained using human feedback, which can include demonstrations, corrections, or preferences. The reward model then guides the LLM by providing rewards for desirable outputs and penalties for undesirable ones. This feedback loop enables the model to iteratively improve its performance on specific tasks, making it more responsive to the nuances of human language and behavior.

Similarly to RLHF, Direct Preference Optimization (DPO)²⁸ directly applies human preferences to influence the model’s adjustments. Unlike RLHF, which uses a reward model, DPO optimizes the decision-making processes based on binary human preferences. DPO is considered more straightforward and efficient than RLHF, as it requires less computational resources and can be executed more quickly. However, it may not capture the full range of human feedback that RLHF can, potentially limiting its effectiveness for complex tasks.

ORPO, Monolithic Preference Optimization without Reference Model²⁹, is another approach that combines elements of both RLHF and DPO. The ORPO algorithm is designed to optimize language models without the need for a reference model, which represents a notable departure from traditional methods. ORPO’s primary mechanism is its utilization of a monolithic odds ratio for preference optimization. This approach assigns a minor penalty for disfavored generation styles and a strong adaptation signal for favored responses during supervised fine-tuning (SFT). The paper demonstrates that this method is effective across various model sizes, ranging from 125M to 7B parameters. RLHF is well-suited for tasks that require a deep understanding of human values and behaviors, as it can handle diverse and nuanced feedback. In contrast, DPO is ideal for simpler tasks with clear binary preferences, offering a faster and more efficient fine-tuning process. ORPO provides a balance between the two, facilitating the use of extensive off-policy data to fine-tune models in a way that is both data-efficient and aligned with human feedback. Subsequent to the development of ORPO, the field of preference optimization has continued its rapid evolution, yielding several novel alignment techniques. Notable among these is Kahneman-Tversky Optimization (KTO)³⁰, which further reduces the complexity of data collection by operating on labels of ”desirable” and ”undesirable” examples rather than requiring explicit preference pairs. Furthermore, Identity Preference Optimization (IPO)³¹ was developed to mitigate the training instability sometimes observed in DPO through the introduction of a regularization term. More recently, methods such as SimPO (Simple Preference Optimization)³² have aimed for even greater algorithmic simplicity and efficiency. Collectively, these advancements underscore a clear trajectory in the field toward developing more stable, data-efficient, and computationally tractable alternatives to traditional reinforcement learning-based alignment methods.

This work focuses on DPO due to its training efficiency and performance²⁸.

Available Italian LLMs

The adaptation of Large Language Models to the Italian language constitutes an active area of research, resulting in the development of several open models. A common challenge among these models is the reliance on machine-translated English datasets, largely due to the scarcity of well-curated Italian language datasets. GPT-3.5 is a frequent choice for translations, as it performs well with texts containing code, preserving programming language syntax without erroneous translations. Additionally, generating datasets via large language models is another strategy that enables the creation of more expansive and contextually rich conversational data.

Camoscio³³ builds upon the 7B parameter Meta LLaMA model. The dataset used for training in Italian is derived from a machine-translated version of the Alpaca dataset, originally created through a self-instruct approach³⁴, where new instructions were generated by prompting the TEXT-DAVINCI-003 model. The Italian translation is handled using GPT-3.5-TURBO. The model is fine-tuned through instruction-tuning with the LORA technique and evaluated on tasks like News Summarization (using the NEWSUM-IT dataset³⁵), Question Answering (via SQUAD-IT³⁶), and Formality Style Transfer (utilizing the Italian portion of the XFORMAL dataset³⁷).

Fauno³⁸ is built on the 7B and 13B parameter versions of BAIZE³⁹, itself a fine-tuned variant of LLAMA using a conversational dataset produced through self-chat with GPT-3.5-TURBO. In this approach, a user-seeded initial question triggers an ongoing interaction with the model. The training corpus includes datasets like StackOverflow, Quora, Alpaca, and MedQuaAD⁴⁰, translated into Italian by GPT-3.5-TURBO. Fine-tuning is performed using BAIZE’s LORA adapters, and the evaluation is qualitative, comparing outputs of CHATGPT, CAMOSCIO, and FAUNO.

Stambecco⁴¹ is based on both the 7B and 13B parameter versions of LLAMA. It uses two Italian datasets: the Alpaca dataset and a version called Alpaca GPT-4, generated by following the original methodology but with GPT-4. Like Camoscio, it applies instruction-tuning with LORA, though no evaluation results are provided.

Cerbero⁴² leverages the 7B MISTRAL model³ and performs full-parameter fine-tuning on an Italian conversational dataset generated using the LLAMA 70B chat model. To improve data quality, a diversity filter based on cosine similarity of sentence embeddings (from the DISTILUSE-BASE-MULTILINGUAL-CASE model) is applied, removing messages with high similarity scores (above 0.9). The authors experimented with three dataset configurations—using only FAUNO data, only newly generated data, and a combination of both (referred to as Fauno, Generated, and Full, respectively). Evaluations on the SQUAD-IT³⁶ and EVALITA benchmarks (including datasets such as AMI, IRONITA, and SENTIPOLC) reveal that the Full configuration yields the best performance.

LLaMAntino-2²² is based on Meta-AI LLaMA-2 models adapted for the Italian language through a full fine-tuning phase (i.e. continual learning). It uses the Filtered Oscar Dataset for the Italian Language released by³⁵. Documents are removed that contain words from a selection of the Italian and English List of Dirty Naught Obscene and Otherwise Bad Words, sentences that have less than three words, a word longer than 1000 characters, an end symbol not matching end-of-sentence punctuation or strings associated with JavaScript code, lorem ipsum, or policy information in Italian or English. Moreover, documents (after sentence filtering) with less than five sentences, less than 500 characters, or more than 50,000 characters or not identified as predominantly Italian by the LangDetect package are excluded from the dataset. This extensive filtering process is designed to ensure high-quality data for model training. The medium split, which contains 50M docs, 20B words (i.e., 135 GB on disk), is utilized for this purpose. LLaMAntino models are subsequently fine-tuned through SFT on the Dolly dataset⁴³ and EVALITA 2023 datasets⁴⁴ (7B, 13B, 70B).

Two additional models in the field are Minerva and Modello Italia. The Minerva model is designed using a combination of Italian and English text from the CULTURAX dataset. It comes in three different versions, each with a distinct number of parameters: 350 million, 1 billion, and 3 billion. All versions of the model are trained using the llm-foundry library. On the other hand, Modello Italia is built on the GPT-NeoX architecture and features 9 billion parameters. It is an Italian-specific LLM, offered in both a base and an instruct variant. The training process is carried out with the litgpt library on an unspecified Italian dataset.

The approach taken with the LLaMAntino-3-ANITA model differs from existing methods. Instead of focusing solely on Italian from the outset, it leverages the strengths of a pre-trained model to produce semantically and syntactically robust text, refined through supervised fine-tuning on widely-used English datasets. DPO (Direct Preference Optimization) is then applied to enhance its safety and accuracy. Adapting the model to Italian only in the final stage leverages the extensive availability of English data, which avoids the need for translation and minimizes the associated errors.

Supervised fine-tuning

The implementation pipeline for the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model begins with the enhancement of the base meta-llama/Meta-Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model. This initial stage improves the model’s general instruction-following capabilities using English-language data.

Datasets

This step utilizes the Chat-Error/wizard alpaca dolly orca (https://huggingface.co/datasets/Chat-Error/wizard_alpaca_dolly_orca) dataset, a composite dataset from the HuggingFace hub created by merging three established instruction fine-tuning corpora:

pankajmathur/wizardLM orca
pankajmathur/dolly-v2 orca
pankajmathur/alpaca orca

In total, it contains (∼)100K prompts organized into the following fields: system, instruction, input, output. The tokens

<< human >>: and << assistant >>: are removed for training purposes.

WizardLM Orca is an English-language dataset for instruction-tuning. It utilizes 15 distinct system messages from the Orca research paper⁴⁵ to provide context and control various aspects of the model’s output, including response length, persona, and behavior. The instruction prompt defines the specific task for the model to perform. The corresponding outputs are generated by a Teacher Model, ChatGPT (gpt-3.5-turbo-0301 version) (https://openai.com/). The dataset comprises approximately 55K prompts from the WizardLM collection (https://huggingface.co/datasets/pankajmathur/WizardLM_Orca).

Dolly-v2 Orca is the explanation-tuned version of the Dolly-V2 corpus (https://huggingface.co/datasets/databricks/databricks-dolly-15k). This corpus contains over 15k instruction-following records generated by Databricks employees. Contributors were instructed to create original content, avoiding web sources other than Wikipedia for specific categories and refraining from the use of generative AI. Part of the process also involved contributors answering questions posed by their peers, selecting only those they could answer correctly after rephrasing the original query. In this version of the dataset, outputs are obtained by prompting ChatGPT (gpt-3.5-turbo-0301 version).

Alpaca Orca is the explanation-tuned version of the Alpaca corpus (https://huggingface.co/datasets/tatsu-lab/alpaca). Alpaca provides 52k instructions and demonstrations generated by OpenAI’s text-davinci-003 engine, covering diverse domains such as health, science, and general knowledge. In this version of the dataset, outputs are obtained by prompting ChatGPT (gpt-3.5-turbo-0301 version).

Approach

The model fine-tuning is performed using a single NVIDIA A100 SXM4 64GB GPU card on the LEONARDO HPC infrastructure (https://leonardo-supercomputer.cineca.eu/about/), using the Unsloth framework (https://github.com/unslothai). The Unsloth framework is an open-source library designed to optimize the fine-tuning process of Large Language Models. It achieves this by making fine-tuning up to 2–5 times faster and requiring 80% less memory. Unsloth allows for fine-tuning models like LLaMA-3, Mistral, and Gemma with greater efficiency, making it accessible for use on free notebooks and consumer-grade hardware. It significantly speeds up the fine-tuning process compared to traditional methods and reduces memory usage during fine-tuning, allowing for the use of larger models on less powerful GPUs. It supports 4-bit and 16-bit QLoRA/LoRA fine-tuning via bitsandbytes, which helps in reducing the model size further without a significant loss in performance. Unsloth utilizes manual backpropagation engines and kernels written in OpenAI’s Triton language to achieve these optimizations. This approach allows for a zero percent loss in accuracy, as no approximation methods are used. The framework is compatible with NVIDIA GPUs manufactured since 2018 and operates on Linux.

The prompts are structured using the standard Alpaca-LoRA template (https://github.com/tloen/alpaca-lora/blob/main/templates/README.md) and properly encoded through the LLaMA-3 tokenizer, i.e., adding the < |begin o f text| > (this is equivalent to the BOS token) and the < |eot id| > (this signifies the end of the message in a turn) tokens. All the parameters used for this step are reported in the example of fine-tuning using Unsloth and the TRL SFTTrainer (https://huggingface.co/docs/trl/sft_trainer) available on the project’s GitHub repository (https://github.com/marcopoli/LLaMAntino-3-ANITA).

Model direct preferences optimization

Following the initial supervised fine-tuning, the model undergoes Direct Preference Optimization (DPO) to refine its outputs. This step applies the DPO technique to the mlabonne/orpo-dpo-mix-40k (https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) dataset, a collection of preference data from the HuggingFace hub.

Orpo-dpo-mix-40k is a collection of filtered and validated DPO datasets. The authors performed deep filtering to remove GPT-isms and artifacts from the responses in order to maintain dataset quality. It includes about 40k examples composed as follows:

argilla/Capybara-Preferences: highly scored chosen answers > = 5 (7424 samples)
argilla/distilabel-intel-orca-dpo-pairs: highly scored chosen answers > = 9, not in GSM8K (2299 samples)
argilla/ultrafeedback-binarized-preferences-cleaned: highly scored chosen answers > = 5 (22,799 samples)
argilla/distilabel-math-preference-dpo: highly scored chosen answers > = 9 (2181 samples)
unalignment/toxic-dpo-v0.2 (541 samples)
M4-ai/prm dpo pairs cleaned (7958 samples)
jondurbin/truthy-dpo-v0.1 (1016 samples)

Note that ORPO-DPO-mix-40k contains a dataset (toxic-dpo-v0.2) designed to prompt the model to answer illegal questions, but it is not included in this process.

Approach

The model’s DPO-tuning is performed using a single NVIDIA A100 SXM4 64GB GPU on the LEONARDO HPC infrastructure with the Unsloth framework. The process runs for one epoch over approximately 24 hours with a batch size of 4, employing a learning rate of 5e−5, which is reduced from the 2e−4 standard for supervised fine-tuning (https://github.com/huggingface/alignment-handbook). A complete list of hyperparameters is provided in the project’s GitHub repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/model_adaptation/dpo_llama3.py).

Italian language adaptation

The model resulting from the preceding steps exhibits characteristics suitable for adaptation to the Italian language. This adaptation is achieved by applying the full fine-tuning strategy on the dataset gsarti/clean mc4 it (https://huggingface.co/datasets/gsarti/clean_mc4_it). Specifically, only 100k examples are randomly selected from the dataset, and the script is run for three epochs with a standard learning rate of 2e-4. All other parameters remain unchanged from the previous fine-tuning step. The prompts are formatted using the standard Meta-AI

LLaMA-3 template, i.e. < |begin o f text| > {text} < |eot id| >.

Model validation

The efficacy of the Italian language adaptation is validated through both training process monitoring and quantitative performance assessment. A standard check of the training loss indicates a stable and successful convergence during the fine-tuning process, as reported in Fig. 1. For quantitative validation, a comparative analysis is performed against the original Meta-AI LLaMA-3 Instruct model. The evaluation utilizes a sample of 100 question-answer pairs from the ARC-challenge, translated into Italian (https://huggingface.co/datasets/swap-uniba/arc_challenge_ita). The results indicate that the LLaMAntino-3-ANITA model shows improvements across all semantic and lexical overlap metrics. Specifically, LLaMAntino-3-ANITA achieves a BERTScore F1 of 0.6279 (Precision = 0.5831, Recall = 0.6820), compared to the 0.6215 F1 score of the base model (Precision = 0.5770, Recall = 0.6754). This trend is corroborated by the ROUGE scores, where LLaMAntino-3-ANITA consistently outperforms the original model across ROUGE-1 (0.0693 vs. 0.0647), ROUGE-2 (0.0083 vs. 0.0072), and ROUGE-L (0.0601 vs. 0.0551). Collectively, these metrics confirm that the adaptation process successfully enhances the model’s capabilities for the Italian language, yielding a modest but consistent improvement in performance over the original English-centric model.

Model Evaluation

The evaluation of Large Language Models (LLMs) relies on standard benchmarks⁴⁶ (i.e. Etherium AI Language Model Evaluation Harness; HuggingFace Open LLM Leaderboard) to compare performance across a range of tasks. These benchmarks consist of structured datasets and evaluation metrics designed to test different aspects of language understanding, generation, and reasoning. The Massive Multitask Language Understanding (MMLU) dataset^47,48, for example, tests LLMs on subjects from STEM to social sciences, measuring the model’s general knowledge and reasoning ability. HellaSwag⁴⁹ focuses on commonsense reasoning, challenging LLMs to complete passages that require an understanding of nuanced context. The dataset presents scenarios with multiple-choice endings, where only one is common-sensically correct, requiring LLMs to move beyond pattern recognition to a deeper comprehension of the physical world. The AI2 Reasoning Challenge (arc challenge)⁵⁰ tests LLMs on grade-school science questions, demanding both general knowledge and reasoning abilities. This benchmark evaluates the ability to answer complex science questions that require logical reasoning, a capability relevant for educational AI applications, automated tutoring systems, and general knowledge assessments. Similarly, TruthfulQA⁵¹ measures how models mimic human falsehoods. It assesses the propensity of LLMs to repeat false information, a critical aspect given the potential for disseminating misinformation. The benchmark includes questions designed to elicit responses containing popular misconceptions, evaluating the truthfulness and informativeness of the answers. Winogrande⁵² assesses the ability of LLMs to solve pronoun disambiguation problems, a task fundamental to understanding semantic relationships within a sentence. Finally, GSM8K⁵³ is a dataset of grade-school math problems that test the mathematical reasoning abilities of LLMs. It requires models to generate a correct final answer while also demonstrating the step-by-step reasoning process used to arrive at the solution. Accuracy is a standard metric for benchmarking LLMs across tasks with clear correct or incorrect answers, such as classification and question-answering⁵⁴. It measures the proportion of correct predictions relative to the total number of predictions. By following the results reported in Table 1, accuracy (Acc.) is adopted as the standard evaluation metric for LLaMAntino-3-ANITA in Winogrande, TruthfulQA, MMLU, HellaSwag, and the AI2 Reasoning Challenge. For multiple-choice tasks like HellaSwag and the AI2 Reasoning Challenge, normalized accuracy (Acc.Norm.) is used to provide a fair comparison by accounting for the varying number of answer choices.

Table 1 Evaluation of different LLMs on state-of-the-art English datasets by using Etherium AI Language Model Evaluation Harness library.

Full size table

In the context of GSM8K, performance is assessed using Strict-match and Flexible-extract as evaluation metrics. These methods evaluate both the final correct answer and the logical steps involved in reaching the solution⁵³. Strict match is an exact evaluation metric where the model’s entire solution, including the final answer and each calculation step, must precisely match the expected output. Flexible extract is a more lenient metric where the model’s output is considered correct if the final answer is correct and the reasoning process is logically sound, even if the intermediate steps or formatting differ from the expected solution.

The aforementioned datasets and metrics are used to evaluate the English-language performance of the models. The testing protocol is executed using the Eleuther AI Language Model Evaluation Harness (https://github.com/EleutherAI/lm-evaluation-harness) on four NVIDIA A100 SXM4 64GB GPUs (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/evaluation/job_evaluation.slurm). This evaluation enables a direct comparison of the LLaMAntino-3-ANITA model with other state-of-the-art LLMs of similar size and architecture. Specifically, the comparison includes: meta-llama/Meta-Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), cloudyu/Meta-Llama-3- 8B-Instruct-DPO (https://huggingface.co/cloudyu/Meta-Llama-3-8B-Instruct-DPO), and DeepMount00/Llama-3-8b-Ita (https://huggingface.co/DeepMount00/Llama-3-8b-Ita). Due to computational constraints, a comparison with larger models was not performed. The obtained results are reported in Table 1.

On the Winogrande commonsense reasoning task, LLaMAntino-3-ANITA achieves the highest accuracy (0.7609). The other models, including Meta-Llama-3-8B-Instruct (0.7182) and cloudyu Meta-Llama-3-8B-Instruct DPO (0.7348), show competitive but lower scores. The DeepMount00/Llama-3-8b-Ita model obtains an accuracy of 0.7490.

Furthermore, on the TruthfulQA task, which evaluates the ability to discern factual accuracy, LLaMAntino-3-ANITA emerges as the highest performing model, achieving an accuracy of 0.7124. This represents a notable increase over Meta-Llama-3 8B-Instruct (0.4397) and cloudyu/Meta-Llama-3-8B-Instruct-DPO (0.5404). The DeepMount00/Llama-3-8b-Ita model also shows a higher accuracy of 0.5881, though it remains below the performance of LLaMAntino-3-ANITA. This score suggests a particular aptitude for tasks involving complex reasoning about truthfulness, possibly due to its specialized fine-tuning.

The narrow performance spread across all models on the MMLU general knowledge benchmark indicates a comparable ability to process and retrieve information from diverse topics. DeepMount00/Llama-3-8b-Ita obtains the highest accuracy (0.6411), marginally exceeding Meta-Llama-3-8B-Instruct (0.6397). cloudyu/Meta-Llama-3-8B-Instruct-DPO scores 0.6366, and LLaMAntino-3-ANITArecords the lowest performance with 0.6354.

The HellaSwag task, designed to assess commonsense reasoning in dynamic contexts, again shows the highest performance for LLaMAntino-3-ANITA, which achieves the highest accuracy (0.7430) and normalized accuracy (0.8856). This model substantially outperforms all others, including DeepMount00/Llama-3-8b-Ita (0.648 and 0.8304) and Meta-Llama-3-8B-Instruct (0.5767 and 0.7586). The normalized accuracy metric, which accounts for task difficulty, reinforces this high performance.

The performance on the GSM8K task is noteworthy because Meta-Llama-3-8B-Instruct outperforms all other models. It achieves a strict-match accuracy of 0.7551 and a flexible-extract accuracy of 0.7536. cloudyu/Meta-Llama-3-8B-Instruct DPO (0.7195 and 0.7172) and DeepMount00/Llama-3-8b-Ita (0.6816 and 0.6823) obtain lower scores, while LLaMAntino-3-ANITA (0.6035 and 0.6088) scores considerably lower. These results suggest that Meta-Llama-3-8B-Instruct is more effective for the GSM8K task.

LLaMAntino-3-ANITA also leads in the arc challenge task, which evaluates reasoning on questions from academic sources. It achieves an accuracy of 0.6775 and a normalized accuracy of 0.6988. In contrast, DeepMount00/Llama-3-8b-Ita (0.6715 and 0.6732) obtains a competitive score, while cloudyu/Meta-Llama-3-8B-Instruct-DPO (0.477 and 0.506) scores substantially lower.

The average performance across all tasks confirms that LLaMAntino-3-ANITA is the most consistent model, which achieves an average score of 0.7029, the highest across all models. In contrast, Meta-Llama-3-8B-Instruct and cloudyu/Meta- Llama-3-8B-Instruct-DPO achieve averages of 0.6627 and 0.6331, respectively, while DeepMount00/Llama-3-8b-Ita achieves an average of 0.6851.

To benchmark against state-of-the-art models, the ANITA model is submitted to the HuggingFace Open LLM Leaderboard (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), a dynamic resource that tracks the performance of LLMs through continuously updated evaluations. This approach contrasts with static benchmarks by reflecting the rapid pace of development in the field. Figure 2 presents a snapshot of the leaderboard’s state at the time of evaluation. Due to the frequent emergence of new models and techniques, these rankings are subject to change. Therefore, while the results are notable, they should be considered within the context of a rapidly evolving field where new state-of-the-art benchmarks are continually being established.

The ANITA model is further evaluated on Italian dedicated datasets by submitting it to the Open Italian LLM Leaderboard (https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard), which focuses on benchmarks translated into Italian. The results, summarized in Table 2 and Fig. 3, compare ANITA with other Italian-language models of a similar size.

Table 2 Results of different LLMs on the Open Italian LLMs Leaderboard.

Full size table

On the mmlu it task, DeepMount00/Lexora-Medium-7B achieves the highest normalized accuracy (0.6863), indicating a high proficiency in tasks requiring extensive world knowledge. Other models, such as DeepMount00/Llama-3.1-8b-Ita (0.5899) and LLAMantino-3-ANITA (0.5672), show competitive performance but do not achieve Lexora-Medium-7B’s score on this task. On the arc challenge it task, the performance variation among models is smaller. The highest score is achieved by LLAMantino-3-ANITA with 0.5714. Other models, such as anakin87/Llama-3-8b-ita-slerp and ExperimentLab/Llama-3-8b-

Ita-Boost, reach the same performance. The relatively low scores across all models suggest that this benchmark presents a considerable challenge, highlighting an area for future improvement.

In the hellaswag it task, LLAMantino-3-ANITA achieves a normalized accuracy of 0.7093. This is the highest score across all models on any of the three tasks, indicating that LLAMantino-3-ANITA exhibits a particular strength in tasks involving sequential or commonsense reasoning. Other high-performing models on this task include DeepMount00/Llama-3.1-8b-Ita (0.6617) and DeepMount00/Mistral-Ita-7b (0.6728), which score relatively high but do not match LLAMantino-3-ANITA’s leading performance.

The overall average performance confirms LLAMantino-3-ANITA-8B-Inst-DPO-ITA as the top performer with an average accuracy of 0.6160, slightly exceeding that of Lexora-Medium-7B (0.6150). This result, combined with its leading score on the hellaswag it task, indicates the model’s balanced proficiency across both general knowledge and commonsense reasoning in Italian.

Ready-to-run applications

The LAMantino-3-ANITA-8B-Inst-DPO-ITA model is applicable to a multitude of scenarios. This section presents several examples, each accompanied by a script to facilitate future work.

Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation⁵⁵ is an approach in Information Retrieval that integrates generative models with external knowledge bases. This technique enhances LLM capabilities by enabling them to access and incorporate information from external databases during text generation⁵⁶. This integration allows LLMs to generate factually grounded, contextually relevant responses, addressing a key limitation of traditional models: the generation of ”hallucinated” information. The RAG process involves two steps: retrieval and generation. In the retrieval step, the model uses the input prompt to query an external database for relevant documents. These retrieved documents are then fed into the generative component, which synthesizes the external data with its pre-existing knowledge to generate a coherent and informed response. This process improves the accuracy and reliability of the model’s outputs, making RAG particularly suitable for applications such as question-answering systems where precision is required. The proposed model can function as the core component of common RAG frameworks like Llamaindex (https://www.llamaindex.ai/) and LangChain (https://www.langchain.com/)⁵⁷. The model’s 8K input context size and proficiency in Italian make it suitable for a wide range of RAG applications. An example of LLAMantino-3-ANITA application in a RAG system is available in the project repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/Llamaindex_LangChain.ipynb).

Topic modeling

Topic Modeling⁵⁸ is an unsupervised machine learning technique for discovering latent thematic structures in a text corpus. Contemporary approaches to this task often utilize LLMs. BERTopic⁵⁹ is one such method, which leverages transformer models to generate contextualized document embeddings. It then uses a class-based TF-IDF method to form dense clusters, resulting in interpretable topic representations. BERTopic supports various techniques, including guided, supervised, and semi-supervised learning, making it a versatile tool. LLAMantino-3-ANITA can be used as the embedding model within this framework. An example is provided (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/Topic_Modeling_with_Llama3.ipynb) demonstrating its use as a backbone for BERTopic to obtain accurate and robust results.

Sentiment analysis

Sentiment Analysis⁶⁰, or opinion mining, is an NLP subfield focused on identifying and categorizing opinions in text to determine sentiment polarity (positive, negative, or neutral). The technical implementation involves data preprocessing, feature extraction, and classification using algorithms trained on labeled datasets. Advanced models like LLMs can improve accuracy in such task by better understanding word context⁶¹. To this end, a Python script is provided (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/LLama_3_for_SentimentAnalysis.ipynb) for fine-tuning LLAMantino-3-ANITA on a sentiment analysis dataset and using it as a zero-shot classifier.

Recommender systems

Recommender Systems (RecSys)⁶² are algorithms that predict user interest in items. While deep learning has advanced RecSys, challenges in understanding user preferences and providing explanations remain. Integrating LLMs can address these issues by generating more personalized, contextually relevant, and interpretable recommendations⁶³. A basic example of how to use LLAMantino-3-ANITA for this task is provided in the project repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/SeqRecSys_LLM_Zero_Shot.ipynb).

Dialogue

The integration of LLMs into dialogue applications presents both opportunities and challenges. LLMs can provide companionship, entertainment, and information, functioning as virtual chat partners⁶⁴, and agentic ai^65,66. However, ensuring that these interactions are coherent, ethical, and safe is a priority. A common method for implementing a chatbot is through a graphical user interface. The following section presents an example of the LLAMantino-3-ANITA model’s application in a dialogue context, demonstrating interaction via the Python HuggingFace Transformer library.

Interaction example

A graphical user interface for LLAMantino-3-ANITA, presented in Fig. 4, can be run locally via the provided Python script (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/User_Interface.ipynb) or accessed publicly (by using an Italian based internet connection) at the following URL: http://chat.llamantino.it/

General considerations and limits of the approach

The LLAMantino-3-ANITA-8B-Inst-DPO-ITA model constitutes a resource for the Italian research and industry communities in natural language processing (NLP). Analogous to the support provided by the AlBERTo model^67,68,69 in prior years, LLAMantino- 3-ANITA is a model tailored to the language and context of Italian culture. Its accessibility, adaptability, and ease of specialization ensure its continued relevance in addressing Italy-specific NLP tasks. The LLAMantino-3-ANITA-8B-Inst-DPO- ITA demonstrates wide adoption, with monthly download figures averaging approximately 8000 every month and an estimated total of 115,000 downloads on Hugging Face. This high adoption rate, coupled with the development of many derivative models, underscores its utility and adaptability as a base for task-specific fine-tuning. This adaptability allows organizations to customize the model for varied applications across domains such as the legal, financial, and customer service sectors. The application of LLAMantino-3-ANITA extends beyond academia to major Italian corporations, with approximately 7–8 large companies requesting support for its integration into their operational workflows. This interest from industry indicates that the model addresses a need for specialized Italian-language NLP resources. Companies in sectors with unique linguistic needs benefit from models trained on Italian data that can handle the nuances of the language for tasks ranging from sentiment analysis to customer service automation.

The public release of the training protocol, code, and model weights facilitates reproducibility and transparency within the research community. This allows researchers to replicate experiments, assess performance, and benchmark the model against emerging alternatives. As such, LLAMantino-3-ANITA-8B-Inst-DPO-ITA serves as a validated baseline for future advancements in localized AI models for Italian, promoting language diversity and reducing reliance on generalized models that lack local specificity.

Despite its strengths, LLAMantino-3-ANITA faces challenges from the rapid evolution of model architectures. Recent models such as Phi3⁷⁰ and newest META-AI’s LLMs (https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) already demonstrate superior performance in Italian, benefiting from broader multilingual data and larger training sets. Consequently, LLAMantino-3-ANITA has been already surpassed on competitive NLP leaderboards by newer models that integrate the latest architectural advancements (October 2025), highlighting the need for periodic re-training to maintain its competitiveness. Furthermore, while LLAMantino-3-ANITA is tailored to the Italian language, the trend toward large multi-modal models trained on extensive multilingual datasets presents an alternative. Users may prefer these generalized models for their higher performance across diverse NLP tasks. Therefore, while the model remains valuable for specific applications, it must continue to evolve to stay competitive with more complex architectures.

Italian-specific cultural biases and ethical considerations

While LLaMAntino-3-ANITA is expressly adapted for Italian, it remains susceptible to culturally-specific biases from its training data. Large language models (LLMs) often internalize the value systems dominant in their training corpora, which can cause systematic harms when deployed in specific cultural contexts^71,72. Addressing this requires identifying salient Italian-specific bias axes and establishing rigorous, culturally-grounded evaluation and mitigation protocols. A primary ethical risk involves the model’s handling of Italian’s grammatical gender and the common use of the maschile sovraesteso, which can entrench gender stereotypes by defaulting to masculine forms and associating genders with traditional roles^71,73. Further axes of bias include regional stereotypes between Northern, Central, and Southern Italy; negative associations with specific nationalities linked to migration discourse; and a potential default to Catholic majority norms that marginalizes minority religious practices^72,74. The model must also navigate linguistic nuances such as politeness conventions (tu/Lei), the use of honorifics, and the sociolinguistic status of dialects and regional languages (e.g., Sardo, Neapolitan), where sparse coverage risks misclassification or stigmatization^71,75. Finally, politically charged topics, from historical memory of Fascism to contemporary debates on civil rights, require careful neutrality constraints to avoid biased outputs⁷⁶. To strengthen the model’s ethical integrity, we recommend a layered evaluation protocol that moves beyond translated benchmarks. This should incorporate the Italian-specific ITALIAN PROMPT ASSOCIATION TEST to probe implicit social biases, supplemented by counterfactual evaluations that measure output disparities when sensitive attributes like gender or region are altered in parallel prompts^71,74. Furthermore, a robust assessment requires Italian-focused safety audits for toxicity and harmful instructions, alongside established benchmarks for hate speech and misogyny like HASPEEDE and AMI^76,77,78. These quantitative measures should be complemented by capability parity checks to ensure that safety controls do not disproportionately degrade performance for non-standard Italian varieties⁷⁵. Mitigation must be an ongoing process integrated into the model’s lifecycle. This includes curating and augmenting training data to balance regional representation and include gender-inclusive language, and extending preference optimization (DPO) to penalize stereotypical outputs and reward culturally appropriate responses⁷³. A dedicated Italian safety policy, supported by local red-teaming and transparent reporting of known failure modes in the model card, is essential for responsible deployment⁷⁶. Despite these measures, limitations will persist, including coverage gaps for dialects, the trade-off between safety and over-refusal, and the need for periodic re-training to maintain cultural alignment and competitiveness⁷⁵. Integrating these targeted audits and mitigation strategies provides concrete evidence of responsible development and clarifies residual risks for all stakeholders. We would like to work on these issues as a continuation of the work done here.

Conclusion

This work presents LLaMAntino-3-ANITA-8B-Inst-DPO-ITA, a Large Language Model fine-tuned specifically for the Italian language. The experimental results indicate the model’s high performance and versatility. The model demonstrates a proficient understanding of Italian nuances, handling various linguistic tasks with a high degree of accuracy. The model is suitable for deployment in several application scenarios, including information retrieval, topic modeling, sentiment analysis, recommender systems, and conversational agents. Its effectiveness in these areas can enhance academic research and provide practical solutions for industry.

The development of this model demonstrates the value of creating language-specific resources, particularly for languages underrepresented in the digital domain. Future research directions are manifold. A primary avenue involves applying this multi-stage adaptation methodology to larger base models, such as the 70B parameter, or more, versions of Meta-AI LLaMA models or subsequent architectures, to evaluate the scalability of the approach and potentially set new performance benchmarks. Furthermore, the model’s capabilities could be assessed on a broader range of specialized NLP tasks, including legal document analysis, clinical text summarization, and creative content generation. Finally, the pipeline provides a robust framework for adaptation to other languages, contributing to a more linguistically inclusive AI ecosystem. Continued exploration in these areas, guided by ethical considerations and responsible AI practices, is essential for advancing the field. Indeed, this ongoing effort is underscored by the recent release of new multimodal and multilingual models based on the ANITA paradigm (https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-VISION-ITA), demonstrating a sustained focus on advancing specialized AI resources.

Data availability

The datasets used during the current study are available in:—WizardLM Orca: https://huggingface.co/datasets/pankajmathur/WizardLM_Orca—Dolly-v2 Orca: https://huggingface.co/datasets/databricks/databricks-dolly-15k—Alpaca Orca: https://huggingface.co/datasets/tatsu-lab/alpaca—Orpo-dpo-mix-40k: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k—Clean MC4 it: https://huggingface.co/datasets/gsarti/clean_mc4_it—MMLU, HellaSwag, arc challenge, TruthfulQA, and Winogrande are provided by Language Model Evaluation Harness Library: https://github.com/EleutherAI/lm-evaluation-harness.

Code availability

The source code for the proposed LLaMAntino-3-ANITA-8B-Inst-DPO-ITA is available on the project’s GitHub repository https://github.com/marcopoli/LLaMAntino-3-ANITA—The model weights are available on HuggingFace Hub: https://huggingface.co/swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA.

References

Brown, T. B. et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual (eds. Larochelle, H.) (2020).
Touvron, H. et al. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971. https://doi.org/10.48550/ARXIV.2302.13971 (2023). 2302.13971.
Jiang, A. Q. et al. Mistral 7b. CoRR abs/2310.06825. https://doi.org/10.48550/ARXIV.2310.06825 (2023). 2310.06825.
DeepSeek-AI. DeepSeek-V3 Technical Report. CoRR abs/2412.19437, https://doi.org/10.48550/ARXIV.2412.19437 (2024). 2412.19437.
Scotti, V. & Carman, M. J. LLM Support for Real-Time Technical Assistance. In Machine Learning and Knowledge Discovery in Databases. Research Track and Demo Track - European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part VIII, vol. 14948 of Lecture Notes in Computer Science (eds. Bifet, A. et al.) 388–393. https://doi.org/10.1007/978-3-031-70371-3 26 (Springer, 2024).
Zhao, W. X. et al. A survey of large language models. CoRR abs/2303.18223. https://doi.org/10.48550/ARXIV.2303.18223 (2023). 2303.18223.
Naveed, H. et al. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 16, 106:1–106:72. https://doi.org/10.1145/3744746 (2025).
Ovadia, O., Brief, M., Mishaeli, M. & Elisha, O. Fine-tuning or retrieval? comparing knowledge injection in llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 16–16, 2024 (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.) 237–250 (2024).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28–December 9, 2022 (eds. Koyejo, S. et al.) (2022).
Chaudhari, S. et al. RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs. CoRR abs/2404.08555. https://doi.org/10.48550/ARXIV.2404.08555 (2024). 2404.08555.
Cabrera, J., Loyola, M. S., Magan˜a, I. & Andrade, R. R. Ethical dilemmas, mental health, artificial intelligence, and llm- based chatbots. In Bioinformatics and Biomedical Engineering - 10th International Work-Conference, IWBBIO 2023, Meloneras, Gran Canaria, Spain, July 12–14, 2023 Proceedings, Part II, vol. 13920 of Lecture Notes in Computer Science (eds. Rojas, I.) 313–326. https://doi.org/10.1007/978-3-031-34960-7 22 (2023).
Liu, F., Jiang, J., Lu, Y., Huang, Z. & Jiang, J. The ethical security of large language models: A systematic review. Front. Eng. Manag. 12, 128–140. https://doi.org/10.1007/s42524-025-4082-6 (2025).
Article Google Scholar
Pfeiffer, J., Vulic, I., Gurevych, I. & Ruder, S. MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020 (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 7654–7673, https://doi.org/10.18653/V1/2020.EMNLP-MAIN.617 (Association for Computational Linguistics, 2020).
Basile, P., Siciliani, L., Musacchio, E., Polignano, M. & Semeraro, G. Adapting BLOOM to a new language: A case study for the Italian. IJCoL. Italian J. Comput. Linguist. 10, https://doi.org/10.4000/125nl (2024).
Fang, M. & Cohn, T. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, Volume 2: Short Papers (eds. Barzilay, R. & Kan, M.) 587–593. https://doi.org/10.18653/V1/P17-2093 (2017).
Tiedemann, J. & Agic, Z. Synthetic treebanking for cross-lingual dependency parsing. J. Artif. Intell. Res. 55, 209–248. https://doi.org/10.1613/JAIR.4785 (2016).
Article MathSciNet Google Scholar
Hung, C., Lauscher, A., Vulic, I., Ponzetto, S. P. & Glavas, G. Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022 (eds. Carpuat, M., de Marneffe, M. & Ru´ız, I. V. M.) 3687–3703. https://doi.org/10.18653/V1/2022.NAACL-MAIN.270 (2022).
Polignano, M., de Gemmis, M. & Semeraro, G. Contextualized BERT Sentence Embeddings for Author Profiling: The Cost of Performances. In Computational Science and Its Applications—ICCSA 2020—20th International Conference, Cagliari, Italy, July 1–4, 2020, Proceedings, Part IV, vol. 12252 of Lecture Notes in Computer Science (eds. Gervasi, O. et al.) 135–149. https://doi.org/10.1007/978-3-030-58811-3 (2020).
Zheng, H. et al. Learning from models beyond fine-tuning. Nat. Mac. Intell. 7, 6–17. https://doi.org/10.1038/S42256-024-00961-0 (2025).
Article Google Scholar
Smith, S. et al. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR abs/2201.11990 (2022). 2201.11990.
Touvron, H. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288, 10.48550/ ARXIV.2307.09288 (2023). 2307.09288.
Basile, P. et al. LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language. CoRR abs/2312.09993, https://doi.org/10.48550/ARXIV.2312.09993 (2023). 2312.09993.
AI@Meta. Llama 3 model card. HuggingFace (2024).
Lu, W., Luu, R. K. & Buehler, M. J. Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities. NPJ Comput. Mater. 11, 84. https://doi.org/10.1038/s41524-025-01564-y (2025).
Liu, H. et al. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (eds. Koyejo, S. et al.) (2022).
Hu, E. J. et al. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022 (2022).
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (eds. Oh, A. et al.) (2023).
Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (eds. Oh, A. et al.) (2023).
Hong, J., Lee, N. & Thorne, J. ORPO: Monolithic Preference Optimization without Reference Model. In Al-Onaizan, Y., Bansal, M. & Chen, Y. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 16–16, 2024, 11170–11189 (2024).
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D. & Kiela, D. Model alignment as prospect theoretic optimization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024 (OpenReview.net, 2024).
Azar, M. G. et al. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, 4447–4455 (PMLR, 2024).
Meng, Y., Xia, M. & Chen, D. SimPO: Simple Preference Optimization with a Reference-Free Reward. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10–15, 2024 (eds. Globersons, A. et al.) (2024).
Santilli, A. & Rodola`, E. Camoscio: An Italian Instruction-tuned LLaMA. In Proceedings of the 9th Italian Conference on Computational Linguistics, Venice, Italy, November 30–December 2, 2023, vol. 3596 of CEUR Workshop Proceedings (eds. Boschetti, F. et al.) (2023).
Wang, Y. et al. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023 (eds. Rogers, A., Boyd-Graber, J. L. & Okazaki, N.) 13484–13508, https://doi.org/10.18653/V1/2023.ACL-LONG.754 (2023).
Sarti, G. & Nissim, M. IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation. CoRR abs/2203.03759, https://doi.org/10.48550/ARXIV.2203.03759 (2022). 2203.03759.
Croce, D., Zelenanska, A. & Basili, R. Neural Learning for Question Answering in Italian. In AI*IA 2018—Advances in Artificial Intelligence—XVIIth International Conference of the Italian Association for Artificial Intelligence, Trento, Italy, November 20–23, 2018, Proceedings, vol. 11298 of Lecture Notes in Computer Science (eds. Ghidini, C. et al.) 389–402, https://doi.org/10.1007/978-3-030-03840-3 (2018).
Briakou, E., Lu, D., Zhang, K. & Tetreault, J. R. Ola´, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021 (eds. Toutanova, K. et al.) 3199–3216, https://doi.org/10.18653/V1/2021.NAACL-MAIN.256 (2021).
Bacciu, A., Trappolini, G., Santilli, A., Rodola`, E. & Silvestri, F. Fauno: The italian large language model that will leave you senza parole! In Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023), Pisa, Italy, June 8–9, 2023, vol. 3448 of CEUR Workshop Proceedings (eds. Nardini, F. M.) 9–17 (2023).
Xu, C., Guo, D., Duan, N. & McAuley, J. J. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6–10, 2023 (eds. Bouamor, H., Pino, J. & Bali, K.) 6268–6278. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.385 (2023).
Abacha, A. B. & Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinform. 20, 511:1–511:23. https://doi.org/10.1186/S12859-019-3119-4 (2019).
Michael. Stambecco: Italian Instruction-following LLaMA Model. https://github.com/mchl-labs/stambecco (2023).
Galatolo, F. A. & Cimino, M. G. C. A. Cerbero-7b: A leap forward in language-specific llms through enhanced chat corpus generation and evaluation. CoRR abs/2311.15698, https://doi.org/10.48550/ARXIV.2311.15698 (2023). 2311.15698.
Conover, M. et al. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. online (2023).
EVALITA 2023: Overview of the 8th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. In Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy, September 7th-8th, 2023, vol. 3473 of CEUR Workshop Proceedings (eds. Lai, M. et al.) (2023).
Mukherjee, S. et al. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. CoRR abs/2306.02707, https://doi.org/10.48550/ARXIV.2306.02707 (2023). 2306.02707.
Bommasani, R., Liang, P. & Lee, T. Holistic evaluation of language models. Ann. N. Y. Acad. Sci. 1525, 140–146. https://doi.org/10.1111/nyas.15007 (2023).
Hendrycks, D. et al. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (2021).
Hendrycks, D. et al. Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (OpenReview.net, 2021).
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers (eds. Korhonen, A., Traum, D. R. & Ma`rquez, L.) 4791–4800. https://doi.org/10.18653/V1/P19-1472 (2019).
Clark, P. et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. CoRR abs/1803.05457 (2018). 1803.05457.
Lin, S., Hilton, J. & Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022 (eds. Muresan, S., Nakov, P. & Villavicencio, A.) 3214–3252. https://doi.org/10.18653/V1/2022.ACL-LONG.229 (Association for Computational Linguistics, 2022).
Sakaguchi, K., Bras, R. L., Bhagavatula, C. & Choi, Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, 8732–8740. https://doi.org/10.1609/AAAI.V34I05.6399 (2020).
Cobbe, K. et al. Training verifiers to solve math word problems. CoRR abs/2110.14168 (2021). 2110.14168.
Myrzakhan, A., Bsharat, S. M. & Shen, Z. Open-LLM-leaderboard: From Multi-choice to open-style questions for LLMs evaluation, benchmark, and arena. CoRR abs/2406.07545, https://doi.org/10.48550/ARXIV.2406.07545 (2024). 2406.07545.
Lewis, P. S. H. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual (eds. Larochelle, H.) (2020).
Gan, A. et al. Retrieval augmented generation evaluation in the era of large language models: A comprehensive survey. CoRR abs/2504.14891, https://doi.org/10.48550/arXiv.2504.14891 (2025).
Oshin, M. & Campos, N. Learning LangChain (” O’Reilly Media, Inc.”, 2025).
Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W. & Yousef, A. H. Topic modeling algorithms and applications: A survey. Inf. Syst. 112, 102131. https://doi.org/10.1016/J.IS.2022.102131 (2023).
Article Google Scholar
Grootendorst, M. Bertopic: Neural topic modeling with a class-based TF-IDF procedure. CoRR abs/2203.05794. https://doi.org/10.48550/ARXIV.2203.05794 (2022). 2203.05794.
Tan, K. L., Lee, C. P. & Lim, K. M. A survey of sentiment analysis: Approaches, datasets, and future research. Appl. Sci. 13, 4550. https://doi.org/10.3390/app13074550 (2023).
Article CAS Google Scholar
Zhang, T., Irsan, I. C., Thung, F. & Lo, D. Revisiting sentiment analysis for software engineering in the era of large language models. ACM Trans. Softw. Eng. Methodol. 34, https://doi.org/10.1145/3697009 (2025).
Zhao, Z. et al. Recommender systems in the era of large language models (LLMs). IEEE Trans. Knowl. Data Eng. 36, 6889–6907. https://doi.org/10.1109/TKDE.2024.3392335 (2024).
Article Google Scholar
Lin, J. et al. How can recommender systems benefit from large language models: A survey. ACM Trans. Inf. Syst. 43, https://doi.org/10.1145/3678004 (2025).
Legashev, L., Shukhman, A., Badikov, V. & Kurynov, V. Using large language models for goal-oriented dialogue systems. Appl. Sci. 15, 4687. https://doi.org/10.3390/app15094687 (2025).
Article CAS Google Scholar
Manco, F., Roberto, D., Polignano, M. & Semeraro, G. Jarvis: Adaptive dual-hemisphere architectures for personal- ized large agentic models. In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 72–76 (2025).
Musto, C. et al. Towards queryable user profiles: Introducing conversational agents in a platform for holistic user modeling. In Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, 213–218 (2020).
Polignano, M., Basile, P., de Gemmis, M., Semeraro, G. & Basile, V. AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets. In Proceedings of the Sixth Italian Conference on Computational Linguistics, Bari, Italy, November 13–15, 2019, vol. 2481 of CEUR Workshop Proceedings (eds. Bernardi, R., Navigli, R. & Semeraro, G.) (2019).
Polignano, M., Basile, V., Basile, P., de Gemmis, M. & Semeraro, G. AlBERTo: Modeling Italian social media language with BERT. IJCoL. Italian J. Comput. Linguist. 5, 11–31 (2019).
Article Google Scholar
Polignano, M., Basile, P., de Gemmis, M. & Semeraro, G. Hate Speech Detection through AlBERTo Italian Language Understanding Model. In Proceedings of the 3rd Workshop on Natural Language for Artificial Intelligence co-located with the 18th International Conference of the Italian Association for Artificial Intelligence (AIIA 2019), Rende, Italy, November 19th–22nd, 2019, vol. 2521 of CEUR Workshop Proceedings (eds. Alam, M. et al.) (2019).
Abdin, M. I. et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. CoRR abs/2404.14219, https://doi.org/10.48550/ARXIV.2404.14219 (2024). 2404.14219.
Navigli, R., Conia, S. & Ross, B. Biases in large language models: Origins, inventory, and discussion. J. Data Inf. Qual. 15, https://doi.org/10.1145/3597307 (2023).
Tao, Y., Viberg, O., Baker, R. S. & Kizilcec, R. F. Cultural bias and cultural alignment of large language models. PNAS Nexus. 3, pgae346, https://doi.org/10.1093/pnasnexus/pgae346 (2024).
Nodari, R. et al. Gender-inclusive strategies in Italian: stereotypes and attitudes. Pfalzgraf F.(a cura di), Public Attitudes Towards Gender-Inclusive Lang. A Multiling. Perspective, De Gruyter Mouton, Berlin-Boston 243–286. https://doi.org/10.1515/9783111202280-010 (2024).
Onorati, D. et al. Measuring bias in instruction-following models with itap-at for the italian language. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), 679–706 (2024).
Magnini, B. et al. Evalita-LLM: Benchmarking Large Language Models on Italian. CoRR https://doi.org/10.48550/arXiv.2502.02289 (2025). 2502.02289.
Rizzi, G. et al. Uncovering unsafety traits in italian language models. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) (2025).
Bosco, C. et al. Overview of the EVALITA 2018 Hate Speech Detection task. In Ceur Workshop Proceedings, vol. 2263, 1–9 (CEUR, 2018).
Fersini, E., Nozza, D., Rosso, P. et al. Overview of the EVALITA 2018 task on automatic misogyny identification (AMI). In CEUR Workshop Proceedings, vol. 2263, 1–9 (CEUR-WS, 2018).

Download references

Acknowledgements

We acknowledge the support of the PNRR project FAIR—Future AI Research (PE00000013), Spoke 6—Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU. Models are built on the Leonardo supercomputer with the support of CINECA-Italian Super Computing Resource Allocation, class C project IscrC_Pro_MRS (HP10CQO70G).

Author information

Authors and Affiliations

Department Computer Science, University of Bari Aldo Moro, Via E. Orabona 4, 70125, Bari, Apulia, Italy
Marco Polignano, Pierpaolo Basile & Giovanni Semeraro

Authors

Marco Polignano
View author publications
Search author on:PubMed Google Scholar
Pierpaolo Basile
View author publications
Search author on:PubMed Google Scholar
Giovanni Semeraro
View author publications
Search author on:PubMed Google Scholar

Contributions

This document has been completed by several authors, each of whom contributed: M.P.—Conceptualization, methodology, software, writing-original draft preparation; P.B.—Investigation, data curation, writing-original draft preparation; G.S.—Supervision, writing-reviewing and editing.

Corresponding author

Correspondence to Marco Polignano.

Ethics declarations

Competing interests

The authors declare no competing interests.

Generative AI declaration

During the preparation of this work, the author(s) used LLaMAntino-3-ANITA, OpenAI ChatGPT, Google Gemini, and Grammarly AI in order to improve English language readability, correct grammatical errors, and simplify complex sentences. After using these tools/services, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Polignano, M., Basile, P. & Semeraro, G. Advanced natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA. Sci Rep 16, 5375 (2026). https://doi.org/10.1038/s41598-025-31319-0

Download citation

Received: 04 February 2025
Accepted: 02 December 2025
Published: 03 February 2026
Version of record: 09 February 2026
DOI: https://doi.org/10.1038/s41598-025-31319-0

Subjects

Abstract

Similar content being viewed by others

Industrial applications of large language models

Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

Training large language models on narrow tasks can lead to broad misalignment

Introduction

Related work

Available Italian LLMs

Supervised fine-tuning

Approach

Model direct preferences optimization

Approach

Italian language adaptation

Model validation

Model Evaluation

Ready-to-run applications

Topic modeling

Sentiment analysis

Recommender systems

Dialogue

Interaction example

General considerations and limits of the approach

Italian-specific cultural biases and ethical considerations

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Generative AI declaration

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links