Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
This paper revisits the implementation of Load-balancing Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as N_E sum_{i=1}^{N_E} f_i p_i, where N_E is the total number of experts, f_i represents the frequency of expert i being selected, and p_i denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that f_i and the LBL are calculated within a micro-batch and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize f_i across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to 42.8B total parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into something a bit different, something that's arguably the backbone of modern scientific communication, but often gets overlooked. I'm talking about arXiv.
Guest: Yeah, arXiv! It's funny, isn't it? For something so crucial, it's kind of…unassuming. It doesn't have the glitz and glam of, say, a flashy journal website. It's just this, well, repository. I always tell people it's like the wild west of academic papers – before they're all cleaned up and polished for formal publication.
Host: Exactly! It's the raw, unedited stuff. And that's actually what makes it so incredibly valuable. Think about it, Leo, before arXiv, researchers had to wait months, sometimes even years, for their work to be published in a journal. That's a glacial pace for something as dynamic as science. arXiv allows scientists to share their findings almost instantaneously, accelerating the whole research process.
Guest: Definitely! It's like a pre-print server, that's the best way to think about it for people who aren't super familiar with it. You know, it's not peer-reviewed in the traditional sense before it’s posted. It's a place to get your ideas out there fast, to get feedback from other experts in the field, and to establish priority. It lets you say 'Hey, I did this!' without waiting for a journal to schedule space for you.
Host: And that emphasis on speed is so crucial. Because with scientific research, you want to get your findings out there to spark conversation, to build on existing ideas and for others to build upon your work too, to challenge them, even if its preliminary. It's like having an open forum. It avoids the scenario where someone has discovered something similar, they may even be working on that and not aware, that's what the open research forum does.
Guest: Absolutely, it also avoids wasted time. It speeds everything up. I mean before arXiv was a thing, researchers would present their work at conferences, and there's always this sense of like, 'Okay I need to get this out as soon as possible so that someone doesn't beat me to it.' This was before the culture of pre-prints and everything, and it led to a lot of pressure. ArXiv has sort of democratized research in that sense, giving equal access to that platform for anyone who's working on something.
Host: And I think that's a great point, Leo, the democratization aspect. Traditionally, publishing was gatekept by journals, often with high subscription fees, putting a barrier to access. ArXiv largely bypasses this by being freely accessible to everyone. It's made the world of research much more open and inclusive, which is incredibly valuable, especially for researchers in institutions with fewer resources.
Guest: For sure. I mean, the cost of some of these journal subscriptions is astronomical for some smaller universities and developing countries, it basically locks them out from keeping up with the latest scientific advancements. So having a platform like arXiv really levels the playing field, and promotes more equitable access to research findings.
Host: Definitely. And it's not just about access; it's also about transparency. You can see the evolution of ideas – from the first draft of a paper to its final published version, which sometimes changes quite dramatically from what it initially was. It allows people to go back in time. You can follow the evolution of how a paper has changed through its iterations. This gives some insight into the scientific process.
Guest: Right, you get to see the messy bits, the thinking out loud. You see where they've changed their mind and revised things, that's great! It shows the process and progress of research. Because a final published paper can be polished to a degree that removes some of that, shall we say, 'real-world' feel. With ArXiv it's raw, and often you can get it up to a year before it's actually published in a journal.
Host: Exactly. And the fact that it's unedited means that there can be, and there is sometimes, mistakes, typos, etc. But that's part of the process as well. It's a very human process. You can get a sense of where there are flaws, and where it can be built upon. We’ve been touching on the access, openness and speed. It really highlights the fundamental benefits, I mean, I can't imagine a world without it now!
Guest: It's like the core of the whole research process now. From quickly sharing results, to getting feedback, to checking the priority of who published what and when, It's all done on ArXiv. I think the funny thing is that most people, who aren't researchers, probably haven't heard of it before. They might hear about the final article on the news, but don't know how it came to be, which is all in those pre-prints on ArXiv.
Host: That's a really great point, Leo. It's often this invisible infrastructure that supports so much of the scientific progress we see. People might hear about a groundbreaking discovery, read a news article, but they rarely get a glimpse of the actual process and the pre-print that kickstarted that entire journey. The fact that most people aren't familiar with the website itself, highlights its role as a functional tool for researchers, rather than a public-facing platform. And it's not designed that way either. It serves a specific purpose.
Guest: Exactly. You wouldn't go on arXiv for leisure reading, that's for sure! It's built for functionality, to be efficient, for searchability. And it's mostly papers in very specific academic fields. So it's not intended for a casual audience. But I think it’s important for people to understand the role it plays. It's also not just one area of research that it caters to, it has a broad variety, from mathematics to computer science to physics and many other areas.
Host: Absolutely, it spans a huge range of disciplines. And that's another crucial aspect – the interdisciplinary nature of it. Researchers from different fields can access and engage with work that might be relevant to their area. It promotes this cross-pollination of ideas and research, which is incredibly beneficial for pushing the boundaries of knowledge. You might be working on some quantum physics, and come across a mathematics paper that could unlock a problem you've been struggling with for example.
Guest: Yeah, and sometimes you'll see the same ideas popping up independently in different fields, which then sparks conversations and collaborations between these fields. It's like, 'Hey, you guys are using that same math I'm working on!' That can lead to some really incredible breakthroughs. ArXiv is almost like a catalyst for these types of conversations.
Host: And that's something that might not happen in the same way if we just relied on traditional peer-reviewed journals. Because those are often organized by discipline, they can silo research quite a bit. Whereas ArXiv doesn't have those constraints. You can search across disciplines, keywords, and this is very valuable. I also find the way they categorize papers fascinating, it's not just a simple folder system, it's quite nuanced.
Guest: Right, it's not just 'physics', it's broken down into 'condensed matter physics', 'high energy physics', and all these very specific sub-categories. And that's because researchers need that level of detail when they're looking for papers. And the search functionality too is incredibly useful, so they can filter by author, by keywords, by date. It's like a specialist's library.
Host: It really is, and speaking of date, the fact that everything is time stamped, so you can see when each submission was uploaded. I think that’s extremely important to be able to verify the priority, who first proposed a particular theory, or experiment, and so on. It avoids a lot of possible disputes.
Guest: Oh yeah, time stamping is critical. Because in the research world, priority is everything. You've got to be able to show that you were the first to come up with an idea. ArXiv provides that crucial proof of date of authorship. It's a public record in a way.
Host: Exactly, it’s an easily searchable public record. And it's not just for those established researchers either. Students, early career researchers, can post their work there too. It’s one of the reasons why it’s so valuable. It allows for these new ideas to be shared very quickly.
Guest: And that's the beauty of it. It's open to anyone, as long as their work fits within the academic remit. You don't need to be an established professor at a prestigious university. If you've got a new idea, and the research to back it up, you can share it on arXiv. It's a great place for up-and-coming scientists to get their work out there.
Host: It really is, it democratizes the academic publication process at the very beginning. You also mentioned earlier it was like the wild west a bit, in that its pre peer-review. So how does that aspect of it play out, given that it is unedited and un-reviewed?
Guest: That’s a good point. It does mean that sometimes you come across things that are...let's just say, 'not quite there.' But that’s also part of the process. It's up to the community to scrutinize the work, to provide that critical feedback. And because it's open, everyone can see, anyone can comment. That’s one of the ways it's very different from traditional publishing. The peer review happens after, rather than before.
Host: Yeah, it's more like a community peer-review process. It's like a living document, constantly being critiqued and refined by the scientific community. It puts the onus on researchers to be critical thinkers, to evaluate the research for themselves, rather than relying solely on the 'stamp of approval' from a journal.
Guest: Which is a great thing. It encourages scientific skepticism, I think, in a good way. It's not just a case of blindly trusting what’s printed in a journal; it’s about reading the work for yourself, evaluating the methodology, and deciding whether you agree with the conclusions or not. The speed with which people are now publishing their work is astonishing. But there's been a huge acceleration in the speed of research and innovation generally in recent years. It's like we're riding on this giant wave.
Host: It really is a tidal wave. And I think ArXiv plays a huge role in that, it feeds this constant rapid cycle. It's always evolving. And even though the user interface might seem a bit outdated and basic to some, it’s incredibly effective. It just functions as it should. I think this is also a sign of how efficient the site is, that it doesn't need to waste resources on a fancy UI, it’s all about the function.
Guest: Totally, it’s not trying to be flashy, or eye-catching, it's purely designed to get the job done. And it’s been doing that since the 90s, that's a long time for an online platform. It’s been a bedrock for scientific communication for decades, and it’s likely to continue to be so for a long time to come. And it's not just for the pure sciences. You see it more and more in social sciences as well.
Host: It really has become a ubiquitous tool, and that's a really good point. The social sciences are increasingly using it. And I think that speaks to the power of the platform, its flexibility, and its ability to support different kinds of research. It's no longer just physics and math, it's become multidisciplinary.
Guest: Yeah, It's also interesting that the site itself doesn't generate revenue or anything, It's all non-profit. It’s supported by institutions, the Simons Foundation and other donations. So that also means that access is freely available, and the focus is entirely on research and not profit. That's something that’s often forgotten, because in most other contexts, anything that's free has a hidden agenda. ArXiv doesn't have that.
Host: That's such an important point to highlight. The non-profit, community-driven nature of ArXiv. It's a great example of a public good, supported by institutions and individuals who believe in the importance of open access to knowledge. The whole model is built on collaboration and contribution.
Guest: Exactly. It's a testament to the power of open science. And it shows what's possible when people prioritize sharing knowledge over profit. ArXiv is a valuable resource that's been available to the world for a long time now and many of us take it for granted, but without it, we wouldn't be where we are today.
Host: It's a crucial piece of the modern research ecosystem. Something that is so important, but often invisible to the public. It enables this rapid pace of scientific progress and allows for collaborative research globally. It’s amazing it's all freely available. I think most people, if they had a peek into ArXiv, would find themselves fascinated by what they could find.
Guest: Yeah, and it’s all publicly available. If people want to delve into the source, and go through the archives, they can. It’s all open and transparent, it’s all there. That's what's really powerful, and underpins a lot of the great research that we hear about. So it's worth having a look, at least to understand the underpinnings of the research process.
Host: And on that note I think it's important to highlight, for those of you who are listening, the paper we've been discussing today actually has no HTML rendering. What does that mean exactly? You can find all sorts of papers on ArXiv, sometimes in very different formats. Not all papers are available in HTML, so that means you often have to download it, for example, as a PDF or other format. Why is that the case? That's something we should probably delve into.