Summary

The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.

Discussion

Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's topic. We're diving into the fascinating world of Large Language Models, or LLMs, and how we can make them even smarter. It feels like every week there's a new breakthrough in AI, and LLMs are definitely at the forefront of that.

Guest: Absolutely, Leo! It's incredible how quickly LLMs are evolving. They've gone from being able to generate basic text to now being able to answer complex questions, translate languages, and even write different kinds of creative content. The potential applications are practically limitless.

Host: Exactly! But like any powerful tool, LLMs have their limitations. One of the biggest challenges is keeping them up-to-date with the latest information and incorporating new knowledge without messing up what they already know. That's where techniques like LoRA, or Low-Rank Adaptation, come into play. It's a way of fine-tuning these massive models without completely retraining them from scratch, saving a ton of computational power.

Guest: That's a great point, Leo. LoRA has become incredibly popular because it's so efficient. Traditional fine-tuning can be incredibly resource-intensive, especially for models with billions of parameters. LoRA allows us to make targeted updates, focusing on specific areas where the model needs improvement. But there's a catch, right? We can't just cram information into these models without considering the potential consequences.

Host: You hit the nail on the head! That's exactly what we're exploring today. We're going to be discussing a really interesting paper called 'How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?' by Sergey Pletenev and a team of researchers. They've investigated just how much new knowledge we can pump into an LLM using LoRA before it starts to negatively impact its existing abilities. It's all about finding that sweet spot, that balance between learning new things and preserving what's already there. It's a bit like adding new books to a library without knocking the shelves over.

Guest: That's a perfect analogy, Leo! And it's such a crucial question. We want our LLMs to be knowledgeable and adaptable, but not at the expense of their overall performance. So, what did the researchers actually do in this study? What kind of knowledge were they trying to inject into the model, and how did they measure the impact?

Host: Well, the paper dives into a few key areas, starting with a solid introduction that lays out the problem. They talk about the increasing reliance on LLMs and the need to keep them updated. But they also point out the potential pitfalls of simply fine-tuning these models, like catastrophic forgetting – where the model basically forgets old information as it learns new stuff. And then they delve into related work, reviewing other research that's been done on knowledge integration and model editing. This gives us a good understanding of the context of their study and what other approaches are out there.

Guest: Right, setting the stage is always important. So, after the introduction and literature review, how did they actually design their study? I'm curious about their methodology and the specific LLM they used.

Host: They used Llama-3.1-8B-instruct. It’s a decoder-only model that has gained a lot of popularity recently due to its increased helpfulness and reduced hallucination rates. Then they fine-tuned it using LoRA, focusing on incorporating varying amounts of new knowledge. Think of it like giving the model different doses of new information and seeing how it reacts. Their core question was how the model degrades intrinsically via positive and negative shifts, and extrinsically by tracking the degradation of reasoning abilities on external benchmarks, like MMLU and TruthfulQA. And they tested with 1, 10, 50, 100, 500, and 3000 facts.

Guest: Okay, so they're systematically varying the amount of new information. But how did they define 'new knowledge'? That seems like a crucial aspect of their study.

Host: Exactly! They defined a 'knowledge fact' as a combination of a question and its corresponding answer. Then, they categorized knowledge into three groups, which I thought was a really clever approach. They had 'HighlyKnown' facts, which the model always answers correctly; 'MaybeKnown' facts, which it answers correctly sometimes; and 'Unknown' facts, which it never answers correctly. This categorization helped them understand which facts were genuinely new to the model.

Guest: That's a smart way to do it. So, they're not just throwing random information at the model; they're carefully classifying it based on what the model already knows. This allows them to track how the model's knowledge shifts as they introduce new facts. Did they also consider other factors, like the reliability of the model after fine-tuning?

Host: Yes, they did! They looked at 'reliability,' which they defined as the model's ability to remember both current and previous edits after sequential editing. It's not just about whether the model gets the right answer, but also whether it can consistently get the right answer, even after being exposed to new information. I think this is something a lot of people overlook when evaluating LLMs. It’s easy to look at a single data point for evaluation, but it's important to test for reliability.

Guest: That makes perfect sense. Consistency is key! If a model can only get the right answer sporadically, it's not really that useful in practice. So, they had their knowledge categories, their reliability metric… what about the 'undesirable effects' they mentioned? How did they try to quantify the potential harm that could come from adding too much knowledge?

Host: That's where the intrinsic and extrinsic evaluation methods come in. For the intrinsic evaluation, they leveraged the knowledge categories we talked about. They looked at what facts the model learned – for example, a fact shifting from 'Unknown' to 'HighlyKnown' – and what facts it forgot – a fact shifting from 'HighlyKnown' to 'Unknown'. They called these 'positive shifts' and 'negative shifts,' and they were trying to minimize the negative shifts as much as possible.

Guest: So, it's like tracking the flow of knowledge within the model. Are facts moving in the right direction, from unknown to known, or are they getting lost along the way? What about the extrinsic evaluation? How did they assess the model's performance on real-world tasks?

Host: For the extrinsic evaluation, they used two well-established benchmarks: MMLU and TruthfulQA. MMLU, or Massive Multitask Language Understanding, is a benchmark for knowledge and reasoning, used as a proxy for measuring the model's overall reasoning abilities. TruthfulQA was chosen as an additional proxy for truthfulness. It includes a set of tricky questions that even some humans would answer falsely. By testing the fine-tuned models on these benchmarks, they could see how the new knowledge was affecting the model's general capabilities.

Guest: That's a really comprehensive approach! They're not just looking at whether the model can memorize new facts; they're also assessing its reasoning abilities and its tendency to generate truthful answers. So, with all that in place, what did their experiments actually involve? What kind of data did they use, and how did they fine-tune the model?

Host: To avoid potential contamination, they constructed data that was not included in pre-training datasets of LLMs. They used Knowledge Graph entities of the form <subject, relation, object> stored in triples. They also used entities that are categorized by density and popularity as head, torso, and tail to balance the dataset in terms of question complexity. Finally, they extracted their own triples from DBpedia, and created (q,a) pairs based on templates.

Guest: That makes a lot of sense. Ensuring the data isn’t already in the model is crucial. Then they used Llama-3.1-8B-Instruct, is that right?

Host: That's right. They opted for the instructed version due to its enhanced capability to follow instructions. Then they fine-tuned the model with 1, 10, 50, 100, 500, and 3,000 Unknown (q, a) pairs. In this case, unknown is the set of questions that were not answered by the default Llama-3.1-8B-Instruct model. In their methodology, the answer is considered correct if the answer from the triple of the question or one of its aliases is inside the LM response.

Guest: Zero-shot mode. So, they just gave the model a question and expected it to generate the answer without any examples or context. Were there any more nuances in their training regime? Is there anything that is worth highlighting?

Host: Yeah, they used some data augmentation techniques. Recognizing the challenge of simply fine-tuning LoRA on new knowledge, they augmented the training dataset with synthetic data, including paraphrases and HighlyKnown facts.

Guest: Interesting. I can see the rationale behind that. By adding paraphrases, you're essentially giving the model multiple perspectives on the same fact, helping it to generalize better. And by adding HighlyKnown facts, you're reinforcing the model's existing knowledge, hopefully preventing it from forgetting things.

Host: Exactly! It's like anchoring the new information to the model's existing understanding of the world. They trained the models with 0, 1, and 10 paraphrases per question. For generating paraphrases they used Llama-3-70B-Instruct. In the highly known mode, they added HighlyKnown samples in addition to unknown samples. In this case, the sample is considered to be HighlyKnown by the default Llama-3.1-8B-Instruct.

Guest: Okay, that's a really well-designed set of experiments! They're systematically varying the amount of new knowledge, the type of data augmentation, and then carefully measuring the impact on the model's performance using both intrinsic and extrinsic evaluations. So, what were the main findings? What did they discover about the relationship between knowledge packing and LLM performance?

Host: Well, one of the key findings was that models can reliably learn up to 500 unknown samples, achieving a 100% reliability score. However, when they tried to incorporate 3,000 unknown samples, 10 epochs of training wasn't enough for the model to learn all the new facts. This suggests that there's a limit to how much new knowledge you can effectively pack into a LoRA adapter, at least with the training configuration they used. Also, adding paraphrases for each unknown sample results in the model converging faster. But adding HighlyKnown data can be harmful to the training process, or at best have a neutral effect on the convergence.

Guest: That's fascinating! So, there's definitely a point of diminishing returns. You can't just keep adding more and more knowledge without seeing some negative consequences. And it sounds like the type of data you use for training also matters a lot. Paraphrases seem to be helpful, but HighlyKnown facts can actually hinder the learning process in some cases. What about the knowledge shifts they were tracking? Did they observe any patterns in how the model's knowledge changed after fine-tuning?

Host: They found that training with HighlyKnown samples was the most effective strategy for both maximizing positive shifts and minimizing negative shifts. It seems like reinforcing the model's existing knowledge helps to stabilize the learning process and prevent it from forgetting things. However, in almost all training modes, they lost more than they won, with the negative shift being higher than the positive one. But they also observed that as the amount of unknown data increased, the difference between positive and negative shifts started to shrink. So, as the model learns more new facts, it becomes more efficient at retaining its existing knowledge.

Guest: That's a really interesting dynamic! It's like the model is initially struggling to integrate the new information, but as it learns more, it becomes better at balancing the old and the new. It would be interesting to investigate whether that trend continues with even larger amounts of new knowledge. Now, let's talk about the external benchmarks. How did the fine-tuning affect the model's performance on MMLU and TruthfulQA?

Host: That's where things get a bit more complicated. They found that adding just 10 HighlyKnown or paraphrased samples to the training data led to a significant drop in accuracy on the MMLU benchmark. This suggests that even small amounts of new knowledge can disrupt the model's reasoning abilities. On the other hand, when they looked at truthfulness on TruthfulQA, they saw that MC1 and MC2 accuracy scores were significantly higher for the training mode with extra paraphrased samples. This indicates that paraphrasing can actually improve the model's ability to generate truthful answers.