Introduction

Recent releases in large-scale language models, such as GPT1, LLaMA2, Mistral3 and DeepSeek4 advance the field of natural language processing (NLP) constantly. Indeed, for example, LLMs are nowadays commonly adopted in applications such as the automation of customer support chats5, where they can be used to generate coherent and contextually relevant responses by exploiting large textual data and performing complex reasoning tasks. In the field of dialogue systems, these models exhibit the ability to produce human-like texts and establish coherent conversations due to their advanced language understanding capabilities6,7.

Fine-tuning and preference optimization represent crucial processes for large language models. These processes allow the models to customize their responses to specific contexts, thereby enhancing the quality of generated text and improving user interactions8. Furthermore, preference optimization methods such as RLHF (Reinforcement Learning with Human Feedback)9,10 further enhance the models’ responses by learning from human interactions, ensuring that the model’s outputs align with user preferences and feedback. As large language models evolve, ethical and regulatory considerations necessitate greater attention. Understanding the implications of AI-generated content, such as determining copyright ownership of AI-generated works, is a prerequisite for ensuring legal clarity and accountability. Moreover, emphasizing explainable AI and transparency is essential for building user trust and ensuring effective collaboration between AI systems and human users11,12.

However, these models exhibit limitations when applied to low-resource languages and niche domains. The primary issue lies in their limited capacity to effectively adapt to low-resource and unseen languages, which constrains their performance in such contexts13,14. Despite the existence of cross-lingual model transfer methods that utilize parallel corpora to connect high-resource and low-resource languages, the adaptability of these models remains restricted by their inherent limitations15. Techniques like synthetic treebanking have been explored to facilitate parsing for low-resource languages, but their effectiveness is limited by the constraints of the models16. The “curse of multilinguality” also presents a challenge, as the adaptability of multilingual models may result in suboptimal representations for individual languages in niche domains17,18. The limitations of large language models in low-resource languages and niche domains underscore the necessity for tailored solutions and specialized adaptations. While techniques like prompt tuning, few-shot, and finetuning have demonstrated potential in customizing models for specific tasks19, addressing the inherent constraints of these models across diverse linguistic and application contexts remains an essential area of research20. Successfully addressing these challenges is a prerequisite for the widespread practical application of large language models.

While the Italian language is widely spoken globally, it is often underrepresented in large models released by international companies. This is exemplified by the low percentage of Italian language data used to train the Meta LLaMA-2 model21. This challenge is partially addressed by the release of “LLaMAntino”22, the first family of Large Language Models based on Meta-AI LLaMA models adapted for the Italian language. This family aims to create models designed for open use on Italian language tasks and with the possibility of further adaptations and releases. The development of these models relies on open, reliable, and reusable data. The ANITA (Advanced Natural-based interaction for the ITAlian language) project (https://huggingface.co/swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA) continues this line of research by building upon the evolution of LLaMA models, particularly using one of the latest LLaMA-3 version23 that was available when this work was conducted (April 2024). The presented model incorporates several enhancements over its predecessors, including a reduced size, adaptation to user preferences, and support for quantized versions. Its effectiveness is demonstrated through rigorous evaluation and multiple application examples in scientific and business contexts.

Related work

Despite the ability of LLMs to correctly answer a long list of general questions in English, their adaptation to specific languages or tasks is often necessary24. The traditional approach to fine-tuning these models for specific tasks is computationally expensive and memory-intensive. Parameter-Efficient Fine-Tuning (PEFT)25 methods address this challenge. PEFT techniques adapt LLMs to new tasks by updating only a small subset of the model parameters, thus reducing the computational load and preserving model performance between tasks. Full fine-tuning of LLMs presents several difficulties, requiring substantial memory allocation for model weights, optimizer states, gradients, and forward activations during training. As LLMs grow in size, reaching hundreds of gigabytes, the memory requirements become prohibitive, especially for consumer hardware. Moreover, full fine-tuning can lead to catastrophic forgetting, where a model loses its performance on previously learned tasks when adapted to new ones. This complicates the use of LLMs for multiple tasks without compromising their efficiency. Techniques such as Low-Rank Adaptation (LoRA), for instance, involve low-rank adaptations that modify only a small part of the model’s weight matrices, while Prompt Tuning introduces prompts that guide the model to generate task-specific responses without extensive retraining. These methods enhance the efficiency of LLMs in zero-shot classification tasks, especially in low-resource settings where only a few examples per class are available.

LLaMAntino-3-ANITA is grounded on LoRA26, which introduces low-rank matrices that represent the minimal changes needed for adaptation. In a Transformer model, each layer has weight matrices, such as the attention and feed-forward networks. LoRA decomposes these matrices into two smaller matrices, A and B, where the original matrix W can be approximated by A × B. This decomposition substantially reduces the number of parameters that need to be updated during fine-tuning. During the fine-tuning process, only the low-rank matrices A and B are trained, while the original pre-trained weights remain frozen.

With this approach, the model learns task-specific adaptations with a minimal increase in the number of trainable parameters. By training only a small fraction of the model’s parameters, LoRA facilitates efficient adaptation to new tasks without the computational overhead of traditional fine-tuning methods. This permits the fine-tuning of LLMs on consumer-grade hardware and their wider deployment.

In some scenarios, LoRA is insufficient for training a model due to hardware limitations. QLoRA, which stands for Quantized Low-Rank Adaptation27, builds on the principles of LoRA by incorporating quantisation into the fine-tuning process. Quantization is a process that reduces the numerical precision of a model’s tensors, typically converting them from high-precision floating-point numbers to lower-precision representations, such as 8-bit or 4-bit integers. The primary goal of QLoRA is to maintain the performance of LLMs while substantially reducing their memory footprint, which facilitates the fine-tuning and deployment of these models on less powerful hardware with limited resources. QLoRA combines low-rank matrix adaptation with quantization. By applying a low-rank approximation to the weight matrices of an LLM, QLoRA reduces the number of parameters that need to be updated during fine-tuning. This is achieved by decomposing the original high-dimensional weight matrices into smaller, low-rank matrices that are easier to manage and require less computational power to update. Additionally, QLoRA employs quantization to further compress the model size by mapping the floating-point weights to a fixed-point representation, which is more memory-efficient. This dual approach facilitates the fine-tuning of LLMs with billions of parameters on relatively small GPUs, making advanced language processing capabilities more accessible. QLoRA is the technique chosen for the model in this study.

To align the model’s outputs with human values and preferences, Reinforcement Learning from Human Feedback (RLHF)9 is commonly adopted. This method for fine-tuning LLMs integrates human feedback into the training loop. In RLHF, a reward model is trained using human feedback, which can include demonstrations, corrections, or preferences. The reward model then guides the LLM by providing rewards for desirable outputs and penalties for undesirable ones. This feedback loop enables the model to iteratively improve its performance on specific tasks, making it more responsive to the nuances of human language and behavior.

Similarly to RLHF, Direct Preference Optimization (DPO)28 directly applies human preferences to influence the model’s adjustments. Unlike RLHF, which uses a reward model, DPO optimizes the decision-making processes based on binary human preferences. DPO is considered more straightforward and efficient than RLHF, as it requires less computational resources and can be executed more quickly. However, it may not capture the full range of human feedback that RLHF can, potentially limiting its effectiveness for complex tasks.

ORPO, Monolithic Preference Optimization without Reference Model29, is another approach that combines elements of both RLHF and DPO. The ORPO algorithm is designed to optimize language models without the need for a reference model, which represents a notable departure from traditional methods. ORPO’s primary mechanism is its utilization of a monolithic odds ratio for preference optimization. This approach assigns a minor penalty for disfavored generation styles and a strong adaptation signal for favored responses during supervised fine-tuning (SFT). The paper demonstrates that this method is effective across various model sizes, ranging from 125M to 7B parameters. RLHF is well-suited for tasks that require a deep understanding of human values and behaviors, as it can handle diverse and nuanced feedback. In contrast, DPO is ideal for simpler tasks with clear binary preferences, offering a faster and more efficient fine-tuning process. ORPO provides a balance between the two, facilitating the use of extensive off-policy data to fine-tune models in a way that is both data-efficient and aligned with human feedback. Subsequent to the development of ORPO, the field of preference optimization has continued its rapid evolution, yielding several novel alignment techniques. Notable among these is Kahneman-Tversky Optimization (KTO)30, which further reduces the complexity of data collection by operating on labels of ”desirable” and ”undesirable” examples rather than requiring explicit preference pairs. Furthermore, Identity Preference Optimization (IPO)31 was developed to mitigate the training instability sometimes observed in DPO through the introduction of a regularization term. More recently, methods such as SimPO (Simple Preference Optimization)32 have aimed for even greater algorithmic simplicity and efficiency. Collectively, these advancements underscore a clear trajectory in the field toward developing more stable, data-efficient, and computationally tractable alternatives to traditional reinforcement learning-based alignment methods.

This work focuses on DPO due to its training efficiency and performance28.

Available Italian LLMs

The adaptation of Large Language Models to the Italian language constitutes an active area of research, resulting in the development of several open models. A common challenge among these models is the reliance on machine-translated English datasets, largely due to the scarcity of well-curated Italian language datasets. GPT-3.5 is a frequent choice for translations, as it performs well with texts containing code, preserving programming language syntax without erroneous translations. Additionally, generating datasets via large language models is another strategy that enables the creation of more expansive and contextually rich conversational data.

Camoscio33 builds upon the 7B parameter Meta LLaMA model. The dataset used for training in Italian is derived from a machine-translated version of the Alpaca dataset, originally created through a self-instruct approach34, where new instructions were generated by prompting the TEXT-DAVINCI-003 model. The Italian translation is handled using GPT-3.5-TURBO. The model is fine-tuned through instruction-tuning with the LORA technique and evaluated on tasks like News Summarization (using the NEWSUM-IT dataset35), Question Answering (via SQUAD-IT36), and Formality Style Transfer (utilizing the Italian portion of the XFORMAL dataset37).

Fauno38 is built on the 7B and 13B parameter versions of BAIZE39, itself a fine-tuned variant of LLAMA using a conversational dataset produced through self-chat with GPT-3.5-TURBO. In this approach, a user-seeded initial question triggers an ongoing interaction with the model. The training corpus includes datasets like StackOverflow, Quora, Alpaca, and MedQuaAD40, translated into Italian by GPT-3.5-TURBO. Fine-tuning is performed using BAIZE’s LORA adapters, and the evaluation is qualitative, comparing outputs of CHATGPT, CAMOSCIO, and FAUNO.

Stambecco41 is based on both the 7B and 13B parameter versions of LLAMA. It uses two Italian datasets: the Alpaca dataset and a version called Alpaca GPT-4, generated by following the original methodology but with GPT-4. Like Camoscio, it applies instruction-tuning with LORA, though no evaluation results are provided.

Cerbero42 leverages the 7B MISTRAL model3 and performs full-parameter fine-tuning on an Italian conversational dataset generated using the LLAMA 70B chat model. To improve data quality, a diversity filter based on cosine similarity of sentence embeddings (from the DISTILUSE-BASE-MULTILINGUAL-CASE model) is applied, removing messages with high similarity scores (above 0.9). The authors experimented with three dataset configurations—using only FAUNO data, only newly generated data, and a combination of both (referred to as Fauno, Generated, and Full, respectively). Evaluations on the SQUAD-IT36 and EVALITA benchmarks (including datasets such as AMI, IRONITA, and SENTIPOLC) reveal that the Full configuration yields the best performance.

LLaMAntino-222 is based on Meta-AI LLaMA-2 models adapted for the Italian language through a full fine-tuning phase (i.e. continual learning). It uses the Filtered Oscar Dataset for the Italian Language released by35. Documents are removed that contain words from a selection of the Italian and English List of Dirty Naught Obscene and Otherwise Bad Words, sentences that have less than three words, a word longer than 1000 characters, an end symbol not matching end-of-sentence punctuation or strings associated with JavaScript code, lorem ipsum, or policy information in Italian or English. Moreover, documents (after sentence filtering) with less than five sentences, less than 500 characters, or more than 50,000 characters or not identified as predominantly Italian by the LangDetect package are excluded from the dataset. This extensive filtering process is designed to ensure high-quality data for model training. The medium split, which contains 50M docs, 20B words (i.e., 135 GB on disk), is utilized for this purpose. LLaMAntino models are subsequently fine-tuned through SFT on the Dolly dataset43 and EVALITA 2023 datasets44 (7B, 13B, 70B).

Two additional models in the field are Minerva and Modello Italia. The Minerva model is designed using a combination of Italian and English text from the CULTURAX dataset. It comes in three different versions, each with a distinct number of parameters: 350 million, 1 billion, and 3 billion. All versions of the model are trained using the llm-foundry library. On the other hand, Modello Italia is built on the GPT-NeoX architecture and features 9 billion parameters. It is an Italian-specific LLM, offered in both a base and an instruct variant. The training process is carried out with the litgpt library on an unspecified Italian dataset.

The approach taken with the LLaMAntino-3-ANITA model differs from existing methods. Instead of focusing solely on Italian from the outset, it leverages the strengths of a pre-trained model to produce semantically and syntactically robust text, refined through supervised fine-tuning on widely-used English datasets. DPO (Direct Preference Optimization) is then applied to enhance its safety and accuracy. Adapting the model to Italian only in the final stage leverages the extensive availability of English data, which avoids the need for translation and minimizes the associated errors.

Supervised fine-tuning

The implementation pipeline for the LLaMAntino-3-ANITA-8B-Inst-DPO-ITA model begins with the enhancement of the base meta-llama/Meta-Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model. This initial stage improves the model’s general instruction-following capabilities using English-language data.

Datasets

This step utilizes the Chat-Error/wizard alpaca dolly orca (https://huggingface.co/datasets/Chat-Error/wizard_alpaca_dolly_orca) dataset, a composite dataset from the HuggingFace hub created by merging three established instruction fine-tuning corpora:

  • pankajmathur/wizardLM orca

  • pankajmathur/dolly-v2 orca

  • pankajmathur/alpaca orca

In total, it contains ()100K prompts organized into the following fields: system, instruction, input, output. The tokens

<< human >>: and << assistant >>: are removed for training purposes.

WizardLM Orca is an English-language dataset for instruction-tuning. It utilizes 15 distinct system messages from the Orca research paper45 to provide context and control various aspects of the model’s output, including response length, persona, and behavior. The instruction prompt defines the specific task for the model to perform. The corresponding outputs are generated by a Teacher Model, ChatGPT (gpt-3.5-turbo-0301 version) (https://openai.com/). The dataset comprises approximately 55K prompts from the WizardLM collection (https://huggingface.co/datasets/pankajmathur/WizardLM_Orca).

Dolly-v2 Orca is the explanation-tuned version of the Dolly-V2 corpus (https://huggingface.co/datasets/databricks/databricks-dolly-15k). This corpus contains over 15k instruction-following records generated by Databricks employees. Contributors were instructed to create original content, avoiding web sources other than Wikipedia for specific categories and refraining from the use of generative AI. Part of the process also involved contributors answering questions posed by their peers, selecting only those they could answer correctly after rephrasing the original query. In this version of the dataset, outputs are obtained by prompting ChatGPT (gpt-3.5-turbo-0301 version).

Alpaca Orca is the explanation-tuned version of the Alpaca corpus (https://huggingface.co/datasets/tatsu-lab/alpaca). Alpaca provides 52k instructions and demonstrations generated by OpenAI’s text-davinci-003 engine, covering diverse domains such as health, science, and general knowledge. In this version of the dataset, outputs are obtained by prompting ChatGPT (gpt-3.5-turbo-0301 version).

Approach

The model fine-tuning is performed using a single NVIDIA A100 SXM4 64GB GPU card on the LEONARDO HPC infrastructure (https://leonardo-supercomputer.cineca.eu/about/), using the Unsloth framework (https://github.com/unslothai). The Unsloth framework is an open-source library designed to optimize the fine-tuning process of Large Language Models. It achieves this by making fine-tuning up to 2–5 times faster and requiring 80% less memory. Unsloth allows for fine-tuning models like LLaMA-3, Mistral, and Gemma with greater efficiency, making it accessible for use on free notebooks and consumer-grade hardware. It significantly speeds up the fine-tuning process compared to traditional methods and reduces memory usage during fine-tuning, allowing for the use of larger models on less powerful GPUs. It supports 4-bit and 16-bit QLoRA/LoRA fine-tuning via bitsandbytes, which helps in reducing the model size further without a significant loss in performance. Unsloth utilizes manual backpropagation engines and kernels written in OpenAI’s Triton language to achieve these optimizations. This approach allows for a zero percent loss in accuracy, as no approximation methods are used. The framework is compatible with NVIDIA GPUs manufactured since 2018 and operates on Linux.

The prompts are structured using the standard Alpaca-LoRA template (https://github.com/tloen/alpaca-lora/blob/main/templates/README.md) and properly encoded through the LLaMA-3 tokenizer, i.e., adding the < |begin o f text| > (this is equivalent to the BOS token) and the < |eot id| > (this signifies the end of the message in a turn) tokens. All the parameters used for this step are reported in the example of fine-tuning using Unsloth and the TRL SFTTrainer (https://huggingface.co/docs/trl/sft_trainer) available on the project’s GitHub repository (https://github.com/marcopoli/LLaMAntino-3-ANITA).

Model direct preferences optimization

Following the initial supervised fine-tuning, the model undergoes Direct Preference Optimization (DPO) to refine its outputs. This step applies the DPO technique to the mlabonne/orpo-dpo-mix-40k (https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) dataset, a collection of preference data from the HuggingFace hub.

Orpo-dpo-mix-40k is a collection of filtered and validated DPO datasets. The authors performed deep filtering to remove GPT-isms and artifacts from the responses in order to maintain dataset quality. It includes about 40k examples composed as follows:

  • argilla/Capybara-Preferences: highly scored chosen answers >  = 5 (7424 samples)

  • argilla/distilabel-intel-orca-dpo-pairs: highly scored chosen answers >  = 9, not in GSM8K (2299 samples)

  • argilla/ultrafeedback-binarized-preferences-cleaned: highly scored chosen answers >  = 5 (22,799 samples)

  • argilla/distilabel-math-preference-dpo: highly scored chosen answers >  = 9 (2181 samples)

  • unalignment/toxic-dpo-v0.2 (541 samples)

  • M4-ai/prm dpo pairs cleaned (7958 samples)

  • jondurbin/truthy-dpo-v0.1 (1016 samples)

Note that ORPO-DPO-mix-40k contains a dataset (toxic-dpo-v0.2) designed to prompt the model to answer illegal questions, but it is not included in this process.

Approach

The model’s DPO-tuning is performed using a single NVIDIA A100 SXM4 64GB GPU on the LEONARDO HPC infrastructure with the Unsloth framework. The process runs for one epoch over approximately 24 hours with a batch size of 4, employing a learning rate of 5e−5, which is reduced from the 2e−4 standard for supervised fine-tuning (https://github.com/huggingface/alignment-handbook). A complete list of hyperparameters is provided in the project’s GitHub repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/model_adaptation/dpo_llama3.py).

Italian language adaptation

The model resulting from the preceding steps exhibits characteristics suitable for adaptation to the Italian language. This adaptation is achieved by applying the full fine-tuning strategy on the dataset gsarti/clean mc4 it (https://huggingface.co/datasets/gsarti/clean_mc4_it). Specifically, only 100k examples are randomly selected from the dataset, and the script is run for three epochs with a standard learning rate of 2e-4. All other parameters remain unchanged from the previous fine-tuning step. The prompts are formatted using the standard Meta-AI

LLaMA-3 template, i.e. < |begin o f text| > {text} < |eot id| >.

Model validation

The efficacy of the Italian language adaptation is validated through both training process monitoring and quantitative performance assessment. A standard check of the training loss indicates a stable and successful convergence during the fine-tuning process, as reported in Fig. 1. For quantitative validation, a comparative analysis is performed against the original Meta-AI LLaMA-3 Instruct model. The evaluation utilizes a sample of 100 question-answer pairs from the ARC-challenge, translated into Italian (https://huggingface.co/datasets/swap-uniba/arc_challenge_ita). The results indicate that the LLaMAntino-3-ANITA model shows improvements across all semantic and lexical overlap metrics. Specifically, LLaMAntino-3-ANITA achieves a BERTScore F1 of 0.6279 (Precision = 0.5831, Recall = 0.6820), compared to the 0.6215 F1 score of the base model (Precision = 0.5770, Recall = 0.6754). This trend is corroborated by the ROUGE scores, where LLaMAntino-3-ANITA consistently outperforms the original model across ROUGE-1 (0.0693 vs. 0.0647), ROUGE-2 (0.0083 vs. 0.0072), and ROUGE-L (0.0601 vs. 0.0551). Collectively, these metrics confirm that the adaptation process successfully enhances the model’s capabilities for the Italian language, yielding a modest but consistent improvement in performance over the original English-centric model.

Fig. 1
figure 1

Training loss observed during the Language Adaption phase of the model.

Model Evaluation

The evaluation of Large Language Models (LLMs) relies on standard benchmarks46 (i.e. Etherium AI Language Model Evaluation Harness; HuggingFace Open LLM Leaderboard) to compare performance across a range of tasks. These benchmarks consist of structured datasets and evaluation metrics designed to test different aspects of language understanding, generation, and reasoning. The Massive Multitask Language Understanding (MMLU) dataset47,48, for example, tests LLMs on subjects from STEM to social sciences, measuring the model’s general knowledge and reasoning ability. HellaSwag49 focuses on commonsense reasoning, challenging LLMs to complete passages that require an understanding of nuanced context. The dataset presents scenarios with multiple-choice endings, where only one is common-sensically correct, requiring LLMs to move beyond pattern recognition to a deeper comprehension of the physical world. The AI2 Reasoning Challenge (arc challenge)50 tests LLMs on grade-school science questions, demanding both general knowledge and reasoning abilities. This benchmark evaluates the ability to answer complex science questions that require logical reasoning, a capability relevant for educational AI applications, automated tutoring systems, and general knowledge assessments. Similarly, TruthfulQA51 measures how models mimic human falsehoods. It assesses the propensity of LLMs to repeat false information, a critical aspect given the potential for disseminating misinformation. The benchmark includes questions designed to elicit responses containing popular misconceptions, evaluating the truthfulness and informativeness of the answers. Winogrande52 assesses the ability of LLMs to solve pronoun disambiguation problems, a task fundamental to understanding semantic relationships within a sentence. Finally, GSM8K53 is a dataset of grade-school math problems that test the mathematical reasoning abilities of LLMs. It requires models to generate a correct final answer while also demonstrating the step-by-step reasoning process used to arrive at the solution. Accuracy is a standard metric for benchmarking LLMs across tasks with clear correct or incorrect answers, such as classification and question-answering54. It measures the proportion of correct predictions relative to the total number of predictions. By following the results reported in Table 1, accuracy (Acc.) is adopted as the standard evaluation metric for LLaMAntino-3-ANITA in Winogrande, TruthfulQA, MMLU, HellaSwag, and the AI2 Reasoning Challenge. For multiple-choice tasks like HellaSwag and the AI2 Reasoning Challenge, normalized accuracy (Acc.Norm.) is used to provide a fair comparison by accounting for the varying number of answer choices.

Table 1 Evaluation of different LLMs on state-of-the-art English datasets by using Etherium AI Language Model Evaluation Harness library.

In the context of GSM8K, performance is assessed using Strict-match and Flexible-extract as evaluation metrics. These methods evaluate both the final correct answer and the logical steps involved in reaching the solution53. Strict match is an exact evaluation metric where the model’s entire solution, including the final answer and each calculation step, must precisely match the expected output. Flexible extract is a more lenient metric where the model’s output is considered correct if the final answer is correct and the reasoning process is logically sound, even if the intermediate steps or formatting differ from the expected solution.

The aforementioned datasets and metrics are used to evaluate the English-language performance of the models. The testing protocol is executed using the Eleuther AI Language Model Evaluation Harness (https://github.com/EleutherAI/lm-evaluation-harness) on four NVIDIA A100 SXM4 64GB GPUs (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/evaluation/job_evaluation.slurm). This evaluation enables a direct comparison of the LLaMAntino-3-ANITA model with other state-of-the-art LLMs of similar size and architecture. Specifically, the comparison includes: meta-llama/Meta-Llama-3-8B-Instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), cloudyu/Meta-Llama-3- 8B-Instruct-DPO (https://huggingface.co/cloudyu/Meta-Llama-3-8B-Instruct-DPO), and DeepMount00/Llama-3-8b-Ita (https://huggingface.co/DeepMount00/Llama-3-8b-Ita). Due to computational constraints, a comparison with larger models was not performed. The obtained results are reported in Table 1.

On the Winogrande commonsense reasoning task, LLaMAntino-3-ANITA achieves the highest accuracy (0.7609). The other models, including Meta-Llama-3-8B-Instruct (0.7182) and cloudyu Meta-Llama-3-8B-Instruct DPO (0.7348), show competitive but lower scores. The DeepMount00/Llama-3-8b-Ita model obtains an accuracy of 0.7490.

Furthermore, on the TruthfulQA task, which evaluates the ability to discern factual accuracy, LLaMAntino-3-ANITA emerges as the highest performing model, achieving an accuracy of 0.7124. This represents a notable increase over Meta-Llama-3 8B-Instruct (0.4397) and cloudyu/Meta-Llama-3-8B-Instruct-DPO (0.5404). The DeepMount00/Llama-3-8b-Ita model also shows a higher accuracy of 0.5881, though it remains below the performance of LLaMAntino-3-ANITA. This score suggests a particular aptitude for tasks involving complex reasoning about truthfulness, possibly due to its specialized fine-tuning.

The narrow performance spread across all models on the MMLU general knowledge benchmark indicates a comparable ability to process and retrieve information from diverse topics. DeepMount00/Llama-3-8b-Ita obtains the highest accuracy (0.6411), marginally exceeding Meta-Llama-3-8B-Instruct (0.6397). cloudyu/Meta-Llama-3-8B-Instruct-DPO scores 0.6366, and LLaMAntino-3-ANITArecords the lowest performance with 0.6354.

The HellaSwag task, designed to assess commonsense reasoning in dynamic contexts, again shows the highest performance for LLaMAntino-3-ANITA, which achieves the highest accuracy (0.7430) and normalized accuracy (0.8856). This model substantially outperforms all others, including DeepMount00/Llama-3-8b-Ita (0.648 and 0.8304) and Meta-Llama-3-8B-Instruct (0.5767 and 0.7586). The normalized accuracy metric, which accounts for task difficulty, reinforces this high performance.

The performance on the GSM8K task is noteworthy because Meta-Llama-3-8B-Instruct outperforms all other models. It achieves a strict-match accuracy of 0.7551 and a flexible-extract accuracy of 0.7536. cloudyu/Meta-Llama-3-8B-Instruct DPO (0.7195 and 0.7172) and DeepMount00/Llama-3-8b-Ita (0.6816 and 0.6823) obtain lower scores, while LLaMAntino-3-ANITA (0.6035 and 0.6088) scores considerably lower. These results suggest that Meta-Llama-3-8B-Instruct is more effective for the GSM8K task.

LLaMAntino-3-ANITA also leads in the arc challenge task, which evaluates reasoning on questions from academic sources. It achieves an accuracy of 0.6775 and a normalized accuracy of 0.6988. In contrast, DeepMount00/Llama-3-8b-Ita (0.6715 and 0.6732) obtains a competitive score, while cloudyu/Meta-Llama-3-8B-Instruct-DPO (0.477 and 0.506) scores substantially lower.

The average performance across all tasks confirms that LLaMAntino-3-ANITA is the most consistent model, which achieves an average score of 0.7029, the highest across all models. In contrast, Meta-Llama-3-8B-Instruct and cloudyu/Meta- Llama-3-8B-Instruct-DPO achieve averages of 0.6627 and 0.6331, respectively, while DeepMount00/Llama-3-8b-Ita achieves an average of 0.6851.

To benchmark against state-of-the-art models, the ANITA model is submitted to the HuggingFace Open LLM Leaderboard (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), a dynamic resource that tracks the performance of LLMs through continuously updated evaluations. This approach contrasts with static benchmarks by reflecting the rapid pace of development in the field. Figure 2 presents a snapshot of the leaderboard’s state at the time of evaluation. Due to the frequent emergence of new models and techniques, these rankings are subject to change. Therefore, while the results are notable, they should be considered within the context of a rapidly evolving field where new state-of-the-art benchmarks are continually being established.

Fig. 2
figure 2

HuggingFace LLM Open Leaderboard results.

The ANITA model is further evaluated on Italian dedicated datasets by submitting it to the Open Italian LLM Leaderboard (https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard), which focuses on benchmarks translated into Italian. The results, summarized in Table 2 and Fig. 3, compare ANITA with other Italian-language models of a similar size.

Table 2 Results of different LLMs on the Open Italian LLMs Leaderboard.
Fig. 3
figure 3

Results of different LLMs on the Open Italian LLMs Leaderboard. Radar chart visualization.

On the mmlu it task, DeepMount00/Lexora-Medium-7B achieves the highest normalized accuracy (0.6863), indicating a high proficiency in tasks requiring extensive world knowledge. Other models, such as DeepMount00/Llama-3.1-8b-Ita (0.5899) and LLAMantino-3-ANITA (0.5672), show competitive performance but do not achieve Lexora-Medium-7B’s score on this task. On the arc challenge it task, the performance variation among models is smaller. The highest score is achieved by LLAMantino-3-ANITA with 0.5714. Other models, such as anakin87/Llama-3-8b-ita-slerp and ExperimentLab/Llama-3-8b-

Ita-Boost, reach the same performance. The relatively low scores across all models suggest that this benchmark presents a considerable challenge, highlighting an area for future improvement.

In the hellaswag it task, LLAMantino-3-ANITA achieves a normalized accuracy of 0.7093. This is the highest score across all models on any of the three tasks, indicating that LLAMantino-3-ANITA exhibits a particular strength in tasks involving sequential or commonsense reasoning. Other high-performing models on this task include DeepMount00/Llama-3.1-8b-Ita (0.6617) and DeepMount00/Mistral-Ita-7b (0.6728), which score relatively high but do not match LLAMantino-3-ANITA’s leading performance.

The overall average performance confirms LLAMantino-3-ANITA-8B-Inst-DPO-ITA as the top performer with an average accuracy of 0.6160, slightly exceeding that of Lexora-Medium-7B (0.6150). This result, combined with its leading score on the hellaswag it task, indicates the model’s balanced proficiency across both general knowledge and commonsense reasoning in Italian.

Ready-to-run applications

The LAMantino-3-ANITA-8B-Inst-DPO-ITA model is applicable to a multitude of scenarios. This section presents several examples, each accompanied by a script to facilitate future work.

Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation55 is an approach in Information Retrieval that integrates generative models with external knowledge bases. This technique enhances LLM capabilities by enabling them to access and incorporate information from external databases during text generation56. This integration allows LLMs to generate factually grounded, contextually relevant responses, addressing a key limitation of traditional models: the generation of ”hallucinated” information. The RAG process involves two steps: retrieval and generation. In the retrieval step, the model uses the input prompt to query an external database for relevant documents. These retrieved documents are then fed into the generative component, which synthesizes the external data with its pre-existing knowledge to generate a coherent and informed response. This process improves the accuracy and reliability of the model’s outputs, making RAG particularly suitable for applications such as question-answering systems where precision is required. The proposed model can function as the core component of common RAG frameworks like Llamaindex (https://www.llamaindex.ai/) and LangChain (https://www.langchain.com/)57. The model’s 8K input context size and proficiency in Italian make it suitable for a wide range of RAG applications. An example of LLAMantino-3-ANITA application in a RAG system is available in the project repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/Llamaindex_LangChain.ipynb).

Topic modeling

Topic Modeling58 is an unsupervised machine learning technique for discovering latent thematic structures in a text corpus. Contemporary approaches to this task often utilize LLMs. BERTopic59 is one such method, which leverages transformer models to generate contextualized document embeddings. It then uses a class-based TF-IDF method to form dense clusters, resulting in interpretable topic representations. BERTopic supports various techniques, including guided, supervised, and semi-supervised learning, making it a versatile tool. LLAMantino-3-ANITA can be used as the embedding model within this framework. An example is provided (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/Topic_Modeling_with_Llama3.ipynb) demonstrating its use as a backbone for BERTopic to obtain accurate and robust results.

Sentiment analysis

Sentiment Analysis60, or opinion mining, is an NLP subfield focused on identifying and categorizing opinions in text to determine sentiment polarity (positive, negative, or neutral). The technical implementation involves data preprocessing, feature extraction, and classification using algorithms trained on labeled datasets. Advanced models like LLMs can improve accuracy in such task by better understanding word context61. To this end, a Python script is provided (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/LLama_3_for_SentimentAnalysis.ipynb) for fine-tuning LLAMantino-3-ANITA on a sentiment analysis dataset and using it as a zero-shot classifier.

Recommender systems

Recommender Systems (RecSys)62 are algorithms that predict user interest in items. While deep learning has advanced RecSys, challenges in understanding user preferences and providing explanations remain. Integrating LLMs can address these issues by generating more personalized, contextually relevant, and interpretable recommendations63. A basic example of how to use LLAMantino-3-ANITA for this task is provided in the project repository (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/SeqRecSys_LLM_Zero_Shot.ipynb).

Dialogue

The integration of LLMs into dialogue applications presents both opportunities and challenges. LLMs can provide companionship, entertainment, and information, functioning as virtual chat partners64, and agentic ai65,66. However, ensuring that these interactions are coherent, ethical, and safe is a priority. A common method for implementing a chatbot is through a graphical user interface. The following section presents an example of the LLAMantino-3-ANITA model’s application in a dialogue context, demonstrating interaction via the Python HuggingFace Transformer library.

Interaction example

figure a

A graphical user interface for LLAMantino-3-ANITA, presented in Fig. 4, can be run locally via the provided Python script (https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/use_examples/User_Interface.ipynb) or accessed publicly (by using an Italian based internet connection) at the following URL: http://chat.llamantino.it/

Fig. 4
figure 4

LLaMAntino-3-ANITA-8B-Inst-DPO-ITA user interface publicly released.

General considerations and limits of the approach

The LLAMantino-3-ANITA-8B-Inst-DPO-ITA model constitutes a resource for the Italian research and industry communities in natural language processing (NLP). Analogous to the support provided by the AlBERTo model67,68,69 in prior years, LLAMantino- 3-ANITA is a model tailored to the language and context of Italian culture. Its accessibility, adaptability, and ease of specialization ensure its continued relevance in addressing Italy-specific NLP tasks. The LLAMantino-3-ANITA-8B-Inst-DPO- ITA demonstrates wide adoption, with monthly download figures averaging approximately 8000 every month and an estimated total of 115,000 downloads on Hugging Face. This high adoption rate, coupled with the development of many derivative models, underscores its utility and adaptability as a base for task-specific fine-tuning. This adaptability allows organizations to customize the model for varied applications across domains such as the legal, financial, and customer service sectors. The application of LLAMantino-3-ANITA extends beyond academia to major Italian corporations, with approximately 7–8 large companies requesting support for its integration into their operational workflows. This interest from industry indicates that the model addresses a need for specialized Italian-language NLP resources. Companies in sectors with unique linguistic needs benefit from models trained on Italian data that can handle the nuances of the language for tasks ranging from sentiment analysis to customer service automation.

The public release of the training protocol, code, and model weights facilitates reproducibility and transparency within the research community. This allows researchers to replicate experiments, assess performance, and benchmark the model against emerging alternatives. As such, LLAMantino-3-ANITA-8B-Inst-DPO-ITA serves as a validated baseline for future advancements in localized AI models for Italian, promoting language diversity and reducing reliance on generalized models that lack local specificity.

Despite its strengths, LLAMantino-3-ANITA faces challenges from the rapid evolution of model architectures. Recent models such as Phi370 and newest META-AI’s LLMs (https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) already demonstrate superior performance in Italian, benefiting from broader multilingual data and larger training sets. Consequently, LLAMantino-3-ANITA has been already surpassed on competitive NLP leaderboards by newer models that integrate the latest architectural advancements (October 2025), highlighting the need for periodic re-training to maintain its competitiveness. Furthermore, while LLAMantino-3-ANITA is tailored to the Italian language, the trend toward large multi-modal models trained on extensive multilingual datasets presents an alternative. Users may prefer these generalized models for their higher performance across diverse NLP tasks. Therefore, while the model remains valuable for specific applications, it must continue to evolve to stay competitive with more complex architectures.

Italian-specific cultural biases and ethical considerations

While LLaMAntino-3-ANITA is expressly adapted for Italian, it remains susceptible to culturally-specific biases from its training data. Large language models (LLMs) often internalize the value systems dominant in their training corpora, which can cause systematic harms when deployed in specific cultural contexts71,72. Addressing this requires identifying salient Italian-specific bias axes and establishing rigorous, culturally-grounded evaluation and mitigation protocols. A primary ethical risk involves the model’s handling of Italian’s grammatical gender and the common use of the maschile sovraesteso, which can entrench gender stereotypes by defaulting to masculine forms and associating genders with traditional roles71,73. Further axes of bias include regional stereotypes between Northern, Central, and Southern Italy; negative associations with specific nationalities linked to migration discourse; and a potential default to Catholic majority norms that marginalizes minority religious practices72,74. The model must also navigate linguistic nuances such as politeness conventions (tu/Lei), the use of honorifics, and the sociolinguistic status of dialects and regional languages (e.g., Sardo, Neapolitan), where sparse coverage risks misclassification or stigmatization71,75. Finally, politically charged topics, from historical memory of Fascism to contemporary debates on civil rights, require careful neutrality constraints to avoid biased outputs76. To strengthen the model’s ethical integrity, we recommend a layered evaluation protocol that moves beyond translated benchmarks. This should incorporate the Italian-specific ITALIAN PROMPT ASSOCIATION TEST to probe implicit social biases, supplemented by counterfactual evaluations that measure output disparities when sensitive attributes like gender or region are altered in parallel prompts71,74. Furthermore, a robust assessment requires Italian-focused safety audits for toxicity and harmful instructions, alongside established benchmarks for hate speech and misogyny like HASPEEDE and AMI76,77,78. These quantitative measures should be complemented by capability parity checks to ensure that safety controls do not disproportionately degrade performance for non-standard Italian varieties75. Mitigation must be an ongoing process integrated into the model’s lifecycle. This includes curating and augmenting training data to balance regional representation and include gender-inclusive language, and extending preference optimization (DPO) to penalize stereotypical outputs and reward culturally appropriate responses73. A dedicated Italian safety policy, supported by local red-teaming and transparent reporting of known failure modes in the model card, is essential for responsible deployment76. Despite these measures, limitations will persist, including coverage gaps for dialects, the trade-off between safety and over-refusal, and the need for periodic re-training to maintain cultural alignment and competitiveness75. Integrating these targeted audits and mitigation strategies provides concrete evidence of responsible development and clarifies residual risks for all stakeholders. We would like to work on these issues as a continuation of the work done here.

Conclusion

This work presents LLaMAntino-3-ANITA-8B-Inst-DPO-ITA, a Large Language Model fine-tuned specifically for the Italian language. The experimental results indicate the model’s high performance and versatility. The model demonstrates a proficient understanding of Italian nuances, handling various linguistic tasks with a high degree of accuracy. The model is suitable for deployment in several application scenarios, including information retrieval, topic modeling, sentiment analysis, recommender systems, and conversational agents. Its effectiveness in these areas can enhance academic research and provide practical solutions for industry.

The development of this model demonstrates the value of creating language-specific resources, particularly for languages underrepresented in the digital domain. Future research directions are manifold. A primary avenue involves applying this multi-stage adaptation methodology to larger base models, such as the 70B parameter, or more, versions of Meta-AI LLaMA models or subsequent architectures, to evaluate the scalability of the approach and potentially set new performance benchmarks. Furthermore, the model’s capabilities could be assessed on a broader range of specialized NLP tasks, including legal document analysis, clinical text summarization, and creative content generation. Finally, the pipeline provides a robust framework for adaptation to other languages, contributing to a more linguistically inclusive AI ecosystem. Continued exploration in these areas, guided by ethical considerations and responsible AI practices, is essential for advancing the field. Indeed, this ongoing effort is underscored by the recent release of new multimodal and multilingual models based on the ANITA paradigm (https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-VISION-ITA), demonstrating a sustained focus on advancing specialized AI resources.