Space Summary
The Twitter Space AI papers of the week (8/15-8/21) w/ Aran hosted by iScienceLuvr. Dive into the world of AI research and innovation with Aran as he shares valuable insights and perspectives on the latest developments in the field. From his remarkable journey of achieving a PhD at a young age to founding a successful AI company, Aran exemplifies the essence of continuous learning, interdisciplinary skills, and community engagement within the AI landscape. Discover the impact of knowledge-sharing, the significance of platforms like Kaggle Notebooks, and the inspiration drawn from Aran's diverse experiences. Explore how early exposure to AI can empower aspiring professionals and drive industry advancements.
For more spaces, visit the AI page.
Questions
Q: How do platforms like Kaggle Notebooks contribute to AI education and collaboration?
A: Platforms like Kaggle Notebooks provide a practical learning environment, foster collaboration among AI enthusiasts, and showcase diverse projects.
Q: What motivates professionals to stay abreast of the latest AI research trends?
A: The constant evolution of AI technologies, career advancement opportunities, and the drive for innovation motivate professionals to stay updated on AI research.
Q: What lessons can aspiring AI professionals learn from Aran's journey?
A: Aspiring AI professionals can learn the value of early specialization, continuous learning, interdisciplinary skills, and active community engagement from Aran's journey.
Q: How does Aran's background in biomedical engineering complement his AI expertise?
A: Aran's background in biomedical engineering enhances his AI insights by providing a unique perspective on healthcare applications, interdisciplinary problem-solving, and innovation.
Q: What benefits do young talents gain from early exposure to AI and technology fields?
A: Early exposure to AI and technology fields empowers young talents to develop niche expertise, pursue accelerated career paths, and contribute meaningfully to industry advancements.
Q: How can knowledge sharing, as highlighted in Aran's TEDx talk, impact the AI community?
A: Knowledge sharing fosters collaboration, inspires innovation, and cultivates a supportive community within the AI industry, driving collective growth and progress.
Highlights
Time: 00:23:45
Aran's Journey to Becoming a Young PhD and AI Entrepreneur Discovering the milestones and challenges faced by Aran on his path to becoming a prominent figure in AI at a young age.
Time: 00:42:19
AI Innovations and Breakthroughs Discussed Exploring the latest AI research papers, innovations, and breakthroughs highlighted by Aran during the week.
Time: 01:15:07
Community Engagement and AI Learning Platforms Insights into the importance of community engagement, collaborative platforms like Kaggle Notebooks, and educational resources in advancing AI knowledge.
Time: 01:45:32
Interdisciplinary Skills in AI and Biomedical Engineering Understanding the synergies between AI and biomedical engineering, leveraging interdisciplinary skills for impactful AI applications in healthcare and beyond.
Time: 02:05:11
Inspiration and Learning from Aran's TEDx Talk Gaining motivational insights and knowledge from Aran's TEDx talk, emphasizing the power of sharing expertise and experiences in AI.
Time: 02:25:55
Encouraging Youth in AI and Technology Empowering young talents to explore AI early, leverage technological advancements, and contribute meaningfully to the ever-evolving landscape of AI.
Key Takeaways
- The significance of staying updated with the latest AI research for professionals and enthusiasts.
- Impactful career milestones achieved by Aran at a young age, including becoming a PhD at 19 and founder of MedARC_AI.
- The role of platforms like Kaggle Notebooks in fostering AI learning, collaboration, and community engagement.
- The fusion of biomedical engineering background with AI expertise, showcasing interdisciplinary skills.
- Inspiration drawn from Aran's TEDx talk, underlining the importance of knowledge sharing in AI.
- Insights into the dynamic and diverse career paths within AI, exemplified by Aran’s roles and accomplishments.
- The value of continuous learning and exploration in AI to drive innovation and personal growth.
- The embodiment of a multi-faceted AI professional through Aran's profile as a researcher, CEO, and Kaggle GM.
- Encouragement for young talents to pursue their interests and make significant contributions in the AI landscape.
- The strategic use of social media to share expertise, connect with the AI community, and showcase achievements.
Behind the Mic
Introduction
Hello, everyone. I think we will get started. Are you ready? Yeah. Cool. Yeah, let's get started. Thank you, everyone, for joining us. I'll put up the list of papers that we're talking about today, which Aaron had posted.
Presentation Start
So we will be starting out by having emozilla from. I guess. I don't know if that's how you say it, but Emozilla from news research to present Hermes three. So I will add him as a speaker. Okay, let's see here. So, as usual, I guess, I don't know if there's, like, difficulties or something. Let's see here why it says that I've invited him. So I think Mozilla has to accept. At least it's better than last time, where I wouldn't see the person in the space to begin with. Yeah. Has he requested to be a speaker? No, not yet. I sent him a request. Okay, there you go. Now it says he's a speaker.
Mozilla Joins
Hey, there. Hello. Yeah, we can hear you. Awesome. Awesome. Cool. Yeah. I guess feel free to get started. We'll. We'll share the Harmony street paper, the tweet in the. In the space, so people can have access to the link as well. But, yeah, go ahead. And we'd love to hear more about Hermes three.
Introduction of Emozilla
All right. Hi there, everyone. My name is, Imo, Zilla. You did guess correct. One of the co founders of Noose Research, along with some other people you might know, technium, Mephisto, and Shivani. So the four of us together were the original co founders. But we've grown a lot. And our recent, most recent release is the Hermes three model. The Hermes models were kind of what got us a lot of attention at first.
Synthetic Data Discussion
And one of the things about Hermes that made Hermes special, especially in the beginning, is that were really quite early to the synthetic data sort of thought process or paradigm. Fast rewind. I would say several years ago, there was this question, and you still see it maybe in areas outside of LLMs, maybe when you describe what it is we do to people on the street that an AI could ever make some training data that would be able to help train another model. There's this intuition of mode that somehow there's this nugget of realism in human written text, and that once an AI touches it, anything that comes out of an AI is necessarily downfield. And I think we've broadly seen over the last several years.
Mechanisms of Hermes Models
And the original Hermes was an example of it. And then I later on with things like Lima and now fee and other papers that show that synthetic and I know Nvidia today just released this distillation. They've been working on this distillation stuff. They really show that you can have teacher models that both create synthetic data and that the models trained from that synthetic data can improve. Nearly all of the data inside of Hermes three was synthetically generated. We use different models, you know, the frontier models, Lama models. We use different ones at different stages for different tasks.
Data Generation Methodology
But nearly all of the data itself is synthetically generated, which sort of was different from the Hermes two series, which is a mix of synthetic and sort of high, what we would call high quality chat conversation data, where you were just taking highly rated chat conversations from other LLMs and training on that. We took this, really, a step further, where for every data sample, we took it through a synthetic pipeline that augmented it, that would judge its quality, make changes to it, regenerate it, and so forth. And this was pioneered, originally pioneered with the evol instruct method. But we've since done a lot more with our pipeline.
Design Philosophy of Hermes Three
But that sort of is the genesis of the idea is that you're really even for, like, real world data, quote, unquote, like, you know, conversations, you are synthetically augmenting it. And then out of that, you get a better. You get a better model. So Hermes three was designed from the start to be sort of neutrally aligned. So it's important to us that. And we state sort of in the preface of the paper and like the introduction, that a lot of chat assistant models can be sort of moralizing or have refusals and stuff.
Ethics and Model Design
And really, at the base model layer, we just don't think that this is the appropriate place for lobotomization or whatever you want to call it. The mind should be free to think. And that it is the things we do with our thoughts that largely dictate what we. What we decide is bad. And, and that there's. There should be no, thought crime, you might say. So we've put a high emphasis on making sure that the system prompt is the, like, the. The ultimate constitution to Hermes three.
System Prompt Implementation
So much of the training data is really explicitly trying to make system prompt be, the thing that matters and how it answers. And this is important because it allows the model to sort of place itself within different frames of thought. If you think about, like, the default assistant that you get from, you know, chat chibti, Claude llama three instruct, what you get is this. You know, you get this. There's this Persona, and it's adopted this worldview of sort of this helpful but paternalistic idea. And that is just sort of like one. That is the only thing it can be right to.
Functionality Variability
Whereas with Hermes three, the system prompt tells the model what sort of world of thought to place itself into. The system prompt does. And this is very powerful, because you can get all sorts of behavior that you can't get out of a normal instruction model. But it also makes the model slightly, I wouldn't say, more difficult to use, but it requires more control because you aren't getting. The helpful assistant is just one aspect of what Hermes three can be. Hermes three can be much more than just the helpful assistant, and that is dictated by the system prompt.
Exploring Model Behavior
And we discovered this on the 405 b, some sort of peculiar generations, where if we didn't have a system prompt, the model would start talking about not knowing where it is and all these other things. And then we did some investigation. We discovered that what had happened was the model actually, in the absence of a system prompt at all, it decided it needed to create a world. It needed to place itself in a world where the absence of the system prompt was the world. There was nothingness in the world.
Amnesia and Model Perspectives
The world consisted of nothing, because the system prompt is everything, but there's no system prompt. So it created this world where it was role playing as this amnesia character, because it needed to justify a world where this is what a world is, with no system problems. So we kind of had a lot of interesting things that we found along the way. But ultimately, most, as people will tell you, most models come down to the data, and the data is where the secret sauce, you could say, if there is such a thing.
Training Processes and Techniques
And the secret sauce, I guess, would be the synthetic mechanisms. But the overarching goal was the idea of making this neutrally aligned model that still instructs tune. It's not a base model, but isn't just the helpful assistant. So that was what we did on the data side. On the training side, we did some pretty. I would say it was a standard training regimen, although done pretty rigorously with hyper parameters sweeps for learning rates and looking at how different epochs are doing.
Training Findings and Adjustments
And if you look in the llama three one paper, they talk a lot about, basically, we did a lot of the same things that was done in the llama three one paper, which is sort of going beyond just doing a single training, but doing multiple trainings, merging, although we didn't merge with this in the end, do any merging. And one thing that we found, actually was that as the model sizes got larger, the need for using DPO we actually trained the DPO versions, but found that they didn't do much.
Model Size Considerations
So we actually stuck with the SFT versions for the 770 B and the 405 B. And what may be unique in the training process that's different, that hasn't been done was that mod. One potential other example, which were not aware of because it wasn't marked as it on hugging face. And we actually asked the hugging face team beforehand and they said that this was true, was that it was the, this was the first sort of full parameter fine tune of 405 B.
Training Challenges
And that, you know, if there's something interesting in the training, I would say it was the, what it took to wrestle 405 B and really at what it takes to actually train it, which is sort of a whole other beat than what we've dealt with before. A side note, I mentioned this on another space, and I don't think I've been corrected yet, which is usually how things work. I say stuff and then people later correct me, which is that I'm not aware the 405 B. So we know, for example, that I believe we know, quote unquote, that GPD four is like around a trillion parameters, the big one.
Parameter Comparisons
And that it's like a mo. And I believe the same is true for Claude. So as far as I know, it may even be that llama 31405 B is at least as far as, and especially with tokens like the deepest, longest trained, biggest dense transformer that has ever been trained. And so it's sort of special that in the open source community we actually even get access to this.
Model Training Challenges
So training 405 B was a bit of a challenge. You run into all sorts of things that like Pytorch can't. Like when you go to save checkpoints, it'll like run out of cpu memory because like can't even fit the model in cpu memory. And when it's trying to like save to safe tensors and you have to do the efficient loading where like you are never materialize the entire model on a single node.
Efficiency Techniques
Even so, in a single HTX only has 640gb of vram you in Bf 16, 405 b, which is actually 410 b, if anyone's keeping found at home, is 820gb of that. So you can't even like at any stage your training process like materialize the tensors all within a single node. Even so, there's a lot of things that it took to go around and figure out just how exactly that works. Using sample packing, which was key, we use flash attention to attention, mask free sample packing, which is where you are able to train multiple samples within the one, within one sequence.
Training Methods and System Architecture
And, you can use position id position. You set the rope positions, you reset them and then, flash. Attention. Two, you pass the sequence lengths of all the sequences and then it like, doesn't allow cross contamination. And that was very important because we had a highly heterogeneous set of data that it took of data lengths within the training set. and so that was really important for. For the efficiency. And in the end, what we came up with was three sets of models.
Model Set Development
The eight B, the 70 B and the 405 B, which from an evaluation perspective are essentially at par with Lama 3.1 and struck. So they beat us on a couple of things, particularly if eval, which I have thoughts on. Ife val. But Ife Val and a few points on, like mmoe pro, we win on some other things like GPQA, big bench, Musr. So broadly speaking though, it was about a tie with llama three one instruct, which ultimately were actually very happy with.
Differentiating Factors
Llama 3.1 instruct is an extremely strong model. Meta did a very good job with it. I wouldn't say that would have been true. I wouldn't say the same thing about the chats that they did for two and one. But three's instruct model is very good. Where we feel Herm is differentiated is really just that. It isn't just the helpful assistant, it's an instruct model that can be.
Additional Features
That's excellent at role playing, function calling. It has a whole bunch of extra features for built in. It was trained on billions of tokens of rag and citations and tool use. There's a lot of extra stuff that went into it too. So we have this neutrally aligned, uncensored version of llama three one instruct. We were happy with that result and we've put it out there and had.
Community Engagement
A lot of people seem to really be enjoying it. We were lucky that lambda Labs, now lambda API, I believe, might know them as lambda Labs, sponsored some of the compute for Hermes three and has been offering Hermes 3405 B for free through their chat interface. So we have a couple of. So, like, anyone can actually go use it. We also have a fun bot in our discord.
Technical Implementation
That is power. That is llama three four. That is Hermes 3405 D that people have had a lot of fun interacting with for free. So we used an fp eight quantization method from a company called Neuromagic to quantize the 405 B down to fp eight. And you can inference that using DLLM on a single HTX node all the way up to 128k context as far insofar as you consider a single HTX node accessible to people. We sort of finally do have GPT four ish at home.
Model Access and Usage
You can run it yourself. And we got there. So, yeah, that's Hermes three. And I guess I might have gone a little bit over time, but happy to take any questions or anything else.
Q&A Introduction
Yeah, that was really interesting. Yeah. Thank you for doing that overview. Yeah. If people have questions, they can add it as a reply to the space here. But I had a couple of questions, I think. One question. Yeah, I guess you mentioned you mostly are using synthetic data, but then. And I guess you're doing SFT on the synthetic data, but then you also do, on top of that DPO.
Synthetic and Standard Data Usage
I'm assuming the DPO is just done on standard chat conversations and things like this. Is that correct, or is it done also in synthetic data? It actually had synthetic data as well, particularly for function calling alignment. It helped a lot on the smaller model. There was the binary preference stuff that you normally see. Good chat, bad chat, good chat, bad chat.
DPO Insights
But there were some. We did find that doing. We have some synthetic DPO instruction sets that we created that did help the instruct calling in particular, the tool calling in particular for some of the smaller models. For the smaller model. Okay. And there's not much improvement on just like, instruction following or just general sort of chatting.
Performance on Larger Models
Right. Yeah. And then on the 70 B and the 405 B, we did not find that it, made a much of a material difference in that the SFT had correctly learned everything we wanted it to learn through the SFT, more or less. It was behaving as we wanted it to. So for the 70 B and the 405 B, we actually trained the DPO versions, but actually stuck with the SFD.
Dataset Size Inquiry
Okay. And what was the size of the dataset? I'm not sure if you mentioned that already, the size. We do have it in the paper. Let me scroll up and see. It's about 400 million tokens. Unique new tokens. Yeah. Trained over either four or three epochs.
Training Epochs
So we had some metrics with evals, and we basically would run it and then see where, like the 70 B, we actually took the 30 pack, whereas the fourth. And for the eight B and the 405 B, we took the fourth epoch train. So.
Validation Metrics
I see. And so you're selecting that based on just metrics and the validation set? Yeah, yeah. Okay, cool. Aaron, did you have any questions oh, yeah.
Usage in Local Settings
So, you know, airmen's models tend to be like, it's very popular among like local lama people, you know, but you know, 70 B and 405 B, even after quantization, it's kind of hard to use for in local settings. So I guess it's expected that people are going to use it on like a cloud setting, like a possibly like together computer together with somebody like that is going to host it and people are going to use it.
Local vs Cloud Use
Yeah, more or less. Although with 70 B, certainly people who have MacBook pros can do a lot. And with 70 B you can, if you have two video cards, you can do like Q five quant on 70 b's and they're pretty good. You can run it there. I find like.
Optimized Usage
So using llama CPP Q five km is kind of like. I think that's like the sweet spot of that's anything more than that. And you may as well use a smaller model than do quantization at that level. But the Q five km versions of, and we did, we have pre GGUf versions of all the models as well.
Model Sizes and Usability
And for the 70 B, if you do the Q five km, that's around what, like 20 some gigabytes. So it's still perhaps 30gb. So it's perhaps still usable. I will tell a slight anecdote. We had a conversation actually with the meta team, and I actually did actually bring this up to them and ask them when asked, sort of like as an ask from the community sort of that they consider, you know, when selecting model sizes, to think about what class of people can use each of these model sizes and to make a selection of the model sizes they choose to train to have that be a factor, not so much as just make it ten times bigger.
Community Considerations
Ten times bigger. But think about who can use this. And I was happy, and I was interested in the paper where they talk about 405 B and that it was specifically chosen to be able to be in run on a single HGX under FPA, you know, that's like as big as you can go now, where 70 b fits in that, or whether that's a vestige of the fact that they did it. 170 b and 270 B.
Genuine Community Engagement
You know, we never did get llama 233 b, interestingly enough. And I maybe would have liked to see a llama 333 B or something along that line, but. But yeah, I would say there is kind of a dwarf right now from the eight to the 70. And I, we do have things like mister roll twelve b, that sort of start to fill out that gap between the two.
Model Gap Observations
But I would agree there's a calling for a really good 20 b dense, 15 trillion token parameter model. So anyone with a couple million dollars of GPU sitting around, if you guys want to get on that, we'd all appreciate it.
Final Remarks
Okay? Yeah. Thanks for your answer. That was very informative, by the way. There was a question about the synthetic data pipeline. Did you use any special technique or trick to generate data that favor following the system prompt more closely?
Synthetic Data Techniques
It's not a specific technique. It's really 50 different things, each for domain. And this is really where, like, the secret really is. And I cannot take any credit for this. This is basically no value tensor. And who also goes by gifted gummy bee and entecnium did this totally out. That's their whole thing.
Data Pipeline Insights
I didn't make any of the data myself. And it really comes down to a meticulous neuros. I mean, this is neurotic about, like, okay, I mean, I've seen tech as this giant flowchart where he's got hundreds of different tasks, and, like, what would make it good at each of these? And, like, each of these different things has its own PI file that is its own synthetic data pipeline to, like, elicit that behavior.
Approach to Data Quality
And it really is. There is no magic. There is no magic bullet to make it all good. Or maybe there is a magic bullet and you can scale it up, but if you want to do it as a literally one and two person team were able to meet llama three one instruct. It really took, like, just looking at every thinking about every task, collecting it, what appearing, what people want to do.
Crafting Synthetic Data
And I will offer this one piece of advice for synthetic data. And tech echoes this a lot, is you actually have to look at the data. He's like, he tells me all the time, he's amazed about how data sets. He's like, did they look at what was in it? He just go through it with your eyes and really use that piece of you that's special, at which humans are great, at which are few shot classifiers, and really get a feel for.
Concluding Advice
Is this really what I think I want in it, too? And it may not be, like, the most exciting work, but if you aren't the best results, it's certainly the path to it. That makes sense. By the way, another question. Are you guys.
Gemma 227 B Inquiry
What about Gemma 227 B?
Discussion on Model Interest and Future Plans
Are you guys interested in it? Yes. So that is, we. I wouldn't say it's not happening at this moment, but we. We are looking and most likely, probably we will do it soon. Yeah. Thank you. I'm also curious if you guys are, like, you know, of course, there's a lot of talk about llama potentially being multimodal in the future, having. Yeah, the multimodal variants. Are you guys interested in multimodal models as well? Is that something you guys plan to work on in the future? I mean, obviously, I would say it's likely the future. So if we want to be in the future, we will have to. I am aware that this is yet one more step of complexity. I mean, this is a. It's a whole other level of game, just like the Bing models are a whole other level of game, particularly because of, like, the data requirements. Right. Everything. The data sourcing becomes even one more magnitude more difficult. We are interested in it. What I'm personally interested more in is what. And now this disclaimer, wild speculation.
Thoughts on Future Multimodal Capabilities
None of this comes from any source. No one has told me this. This is only me saying what I think is that my guess is that llama four will be chameleon. Right. And that chameleon is the blueprint for what model for llama four will be, which is fused multimodal training, whereas with llama three, you know, 11th three, they train the model, and then they do all this extra stuff at the end to, like, make the multimodal stuff work. But, you know, meta was very. In the chameleon paper, they seem convicted, and the results seem to say that, like, true fused multi or early stage multimodal training, you know, is all you need. And that it's interesting because a lot of the complexities in multimodal actually, like, fall away when you do fused early training. Right. Like, you just let the model figure it out versus having to, like, do all this extra patchwork at the end to make the multimodal stuff work. So I'm excited, particularly in whatever sort of emergence we get that arises out of, you know, like, a properly, fully trained, chameleon style dense multimodal early fusion model.
Challenges and Insights on Multimodal Models
Yeah, definitely. But I think the problem with maybe some of these fused multimodal models is my understanding is that the data requirements are, again, quite high for that. Maybe someone can correct me if I'm wrong, but that was my understanding, so. But, yeah, I mean, obviously, like, yeah, it's very interesting that it's easy to just like, yeah. Have it as one kind of single transformer that you can throw data at and it learns everything. But, yeah, we'll see how, you know, what happens with the next Lama model. Yeah. The current problem with fused model is that, you know, you're using decoder transformer in contrast to encoded bidirectional attention transformer that is used on disjoint type model. So sometimes image recognition type problems, their accuracy is not as good as disjoint models. So. Yeah, but there have been a few papers that have been proposed, some countermeasures to it. So. Yeah, I'm sure LMFO is gonna try something like that.
Technical Approaches and Future Exploration
I guess so. Sorry, another question. So did you look into expanding the context window in the past training stage, like up to 1 million tokens? That is a good question. We have not tackled it yet. The llama. Interesting note I will also put in here is that. So? They did extend, they used a method to do the extension, but they actually don't talk in the paper at all about the actual technical method that they used to do the extension. They say we did some iterative context length extension, but they leave out the fact that they actually have a new rope method. It's essentially when we originally at news put out the yarn paper, which was then later incorporated into deep seq and then fee and then deepseq and now llama. It's not yarn in particular, it's actually more variant of long yarn I would say is what they did.
Exploring Compression Mechanisms and Performance Metrics
But they mentioned in the comments the transformer code about doing this grid search to find the optimal scaling parameters for the rope embeddings to do that. But that actually is completely absent from the paper. So I'm not entirely sure why that is. But looking at it, we can infer how it was done. So I think it absolutely can be done. If you read the paper, they start to mention some of the complexities of dealing with these sequences, especially for the larger models. And to do that, you really need a proper, I would say 4d parallelism training setup, right? So you need to be, crucially, you need to be doing, something like ring attention. So Megatron's context extension, or ring attention, which gives you the ability to actually, you know, move, calculate the attention in parallel across multiple gpu's, because at those large sequence lengths, the, that is the entire dominating thing, right. Like the model parameters don't even like matter at that point.
Future Tools and Methodologies for Model Training
Like it's totally dominated by the attention operation. And so, which is not traditionally sharded. and so you need to do that. And to do that, you know, they have their own trainer. There's a couple of good open source. I think things will get there. so there's torch titan, which is a new project by the pipeline torch team which try to, is going to have all the new latest and greatest sort of training tricks that you need to do that. And there is a branch of torch titan now that has ring attention and context parallelism it already is using. So there's a new thing called FSDP two which is a new way to do FSDP for training. So to do that, I think we might need a little bit of iteration in the open source space on the tooling to do training. But I would expect there's no reason why we couldn't even, for example, take with what we have now and at least get us to like 256 with not that much.
Exploring Quantization and Its Impact on Model Performance
With not that much compute. Makes sense. I have one last question, by the way. Yeah, one question, quick question. Like have you looked at like quantization, like how much the quantization affects the performance of the model? So like you said, you are there. Currently it's deployed at FP eight on these APIs. So like, have you looked at how that if, how much that degrades the performance compared to the full model? I would say you lose, I don't know, like 0.30.4 on your benchmarks. You know, there is a point of like emotional contention on this where some people will say that the vibes are different on the model, like when it's quantized at FDA, even if like the raw scores aren't that much different. The particular mechanism we did not do the mechanism of quantization that they did for lama three one, they used a different, this like row parallel where they like are quantizing partially partial layers and leaving others in bf 16 we use, like I mentioned a previous, it's part of the VLM project.
Comparative Analysis of Performance Across Models
It's called LLM compressor by a company called Neural Magic. And they have their own full fp eight mechanism. And from the, when we ran it, like I said, it lost. Maybe like if you got an 84.6, you get an 84.2 or three. You know what I mean? Kind of. So deal. So it wasn't like meaningful. But I will say there are people who claim that it is thing that B's 16 to FP eight loses something. But you know, that's certainly in the nebulous realm of the subjective. Okay, my last question is so, you know, Lambda three and L 3.1 and their multilingual performance is much better than the months before. You know, we are getting more and more multilingual and support more state of that models. And actually, Gemma two is even better than Lemma 3.1 on this record.
Multilingual Support and Future Enhancements
So I guess, you know, I'm sure there are many people who do not speak whose primary language is not English who want to enjoy a lot of playing in a language other than English. Greatly appreciate if, you know, you guys adding, you know, appreciate the, like, a better performance and multilingual performance of Hermes models. So, yeah, I guess that's a motivation to try Gemma, too. Well, do you have any multilingual support? Yeah. And, you know, I remember there being some. Maybe deep seq put it out with. They did. Someone has put out, you know, may not be fully rigorous, but, like, how does multilingualistic abilities in models, like, improve their lateral thinking? Does learning another language actually make you better than even a single language?
Reflections on the Need for Multilingual Models
And I will say that most of the training data in Hermes three is in English because it was synthetic. But it was synthetic with an eye. It was like a human over the shoulder, watching all the synthetic data, reading it all and doing it. And that was done primarily. Not fully, but primarily by, you know, native english speakers. So if there's an area we probably have to get better at, it would. It would be that. And you said that, you know, that there are some people who aren't native english speakers who may enjoy it. I would be remiss if I didn't remind myself then, you know, the last majority of the humans on this planet are not native english speakers. And that we need to. We need to. We need to think of them as well.
Closing Remarks and Future Directions
Yeah, absolutely. Thanks for answering all my questions. Do you have. Do you guys have any other question? Yeah, I don't see any comments on the space. Any new comments? So I think that is all of it. Yeah. But, yeah. Thank you, Imozilla, for joining us. This was very interesting, very detailed, and I learned a lot and. Yeah, yeah. Everyone should check out the paper. We. It's pinned to the space. The. All the Hermes three announcement and the link to the paper there. And, of course, to try out Hermes three as well.
Final Discussions and Transition to Next Topics
So. Yeah, thank you, everyone, for. Or thank you, Emil, for speaking about this. Yeah. And I guess we'll move on to the. To. Then to the next paper, which is. Yeah, I was. I just actually, it's not really a paper. It's just a. We just kind of wanted to briefly talk a little bit about Phi 3.5, which was. Which was released yesterday. And. Yeah, yeah. Let me just pin that. Oh, yeah. Go ahead. Oh, yeah. So Phi three actually they use 3.8 billion parameters. Model I trained on 3.4 trillion tokens, synthetic data filtered, publicly available websites with a focus on very high quality, reasoning dense data.
Overview of New Model and Its Capabilities
So yeah, they don't have like a technical report. It's just a hanging place page. So yeah, it's just about results. So it performs on pair with Gamma 3.18 p on various popular benchmarks. The model belongs to the Phi three model family and supports 128,000 token context length. So they used SFT, PPO and DPO to ensure precise instruction adherence and robust safety measures. Actually, I haven't seen many models using both PPO and DPO, so I'm curious what made them to use both. And as for in particular, multilingual capability is pretty good.
Comparative Performance Analysis Across Models
So it's actually better than Lambda 318 and worse than Jamma two nine B. Jamma 2.9 B is by far the best multi lingual model under 10 billion models. I guess so, yeah, it's very competitive with state of the art. It performs on par with Lama 3.1 ap on long context benchmarks like Lular and Lippo QA. So yeah, that's my take. Yeah, I'm glad that they're more and more multi lingual, powerful models. Tanish, do you have any comments? Yeah, I mean, so, I mean, they released, of course, the 3.8 billion parameter model, which is their Gans model, the five 3.5 mini instructor.
Insights on Multimodal and MoE Models
But there's also Moe model, and there's also a multimodal model. So there's the MoE model, which is a 42 billion parameter Moe, and that has 6.6 active parameters. And that was trained on 4.9 trillion tokens. So that's more, yeah, more tokens than the 3.8 dense model. So yeah, 4.9 trillion tokens. And again, it's multilingual. 10% of the data set is multilingual there. So that's also pretty interesting. I think this is the first MoE model coming out of the Phi group, if I'm not mistaken. So it's interesting to check that out as well.
Discussion on New Multimodal Model Features
And then also they have a multimodal model. They've released multimodal models before. This is just another sort of multimodal model that they released that is now using the 3.8 billion parameter dense model, which is the Phi 3.5 mini. They're using that as like the base LLM for this multimodal model. So that this has got, you know, it's again like you have the LLM and then you've got an image encoder a connector and projector and that's connected all to the LLM. And so yeah, basically it adds like you have an.
Focus on Image Understanding and Reasoning
Yeah so the total is again like a 4.2 billion parameter model. And yeah it's trained on 500 billion vision and text tokens. And I guess one thing they mentioned is like they're focusing on multi frame image understanding and reasoning. So like for example, being able to process videos and you know, the frame, you know, multiple frames of the videos. So that was something I guess with previous models it was difficult to do and so they focused on that with this new multimodal release as well. So yeah that's I guess the other thing I wanted to add.
Feedback and Performance Insights
But yeah, I mean the benchmarks look pretty, pretty good. But I guess again it's. I don't know if people, anyone here has tried the model, if anyone has any sort of vibe check kind of thoughts on the model, but yeah, I think yeah, I think this is a great release. And again I think much of the data set is again just based on their previous work on synthetic. On synthetic data combined with more general purpose data. But I guess they don't really go into too much detail about that in the model card or even in the previous papers as well.
Image Size and Classification Concerns
But yeah, again, importance of synthetic data I guess. Yeah, I don't know if there's anything else to add. If. Does anyone have anything that they want to add? Oh, someone was asking did they mention about the image size or just the image token input size? I am not sure to be honest. Let's see if it's there in the. I can just quickly check right now. Yeah, let's just quickly check on the thing. But yeah, does anyone else have any questions or. I'm just curious like if people like also tried it out and have any sort of like.
Evaluating Vision Capabilities and Model Quality
Yeah, what their wipe of the model is. This model is very impressive. Like vision wise, like it's on par with GPT four or mini or like a. Sorry, I mean cloud 3.5 sonnet. I mean I think this benchmark is a bit limited so I'm sure from practical problems sonnet is better. But yeah this is really strong despite being so small, you know? Yeah, yeah I'm not. I don't know, maybe I'm not just finding it but I'm not finding any mention of the image size so I'm not entirely sure.
Completion of Discussion on Model
Yeah, I'm not sure. Anyway, I guess that's. That's a. Yeah, that's five, 3.5. I don't know if anyone else wants to add anything about it. But if not, let's see, is there any other question about five, 3.5 or any comments? Nope. Okay, so what is the next paper then we will cover. Next one is the trans. Yeah, sorry, go ahead. I'm sorry. Actually I have to leave at 02:00 p.m. so could I possibly do the next paper?
Brief Note on JPEG LM Paper
Sure, sure. Yeah. So the next paper will be the JPEG LM paper. Yeah, it's a very like a short paper. I guess so. Yeah, it's gonna, it's not gonna take a while. So they basically proposed to directly model images and videos as a complex files saved on computers via canonical codecs like JPEG or ABC. So they used default lama architecture without any vision specific modifications. So they pre trained JPEG Lm from scratch to generate images autoregressively.
Advantages and Innovations in JPEG LM
So by directly outputting compressed fire bytes in JPEG and APC formats. So actually this was more effective than pixel based modeling and sophisticated vector quantization baselines. So actually they achieved 31% reduction in FID, which is significant, obviously. So JPEG LM has a special advantage over better quantization models, in particular in generating long tail visual elements like generating text in image, which is quite useful I guess. So it's basically JPEG Lm. They first converted image into JPEG JP codes, and then they used BPE to produce BP tokens and seconds.
Performance Metrics and Comparisons in Image Generation
So their BP vocabulary size is quite small, 320. And each 256 times 256 image was converted into five k token sequence. So the compression rate is like twelve times. On the other hand, BQ baseline, they use bowcap size of 4096 and sequence length 1024. So the unconditional FID they measured was 155 for Bq and 121 for JPEG. So yeah, it's much better. And in particular it outperforms stable diffusion on partial image conditioning. So yeah, that's about it.
Exploring Historical Context of JPEG Coded Images
So the thing is that actually this image classification with JPEG coded image was known to outperform image classification with low pixels. The earliest paper I know was from 2020 or so it's a very old method, maybe even before that. But I've never seen a paper that tried this on image generation. So this was interesting. Actually outperforms VQ method, which is great, but unfortunately they did not do ablation by changing the compression rate of VQ versus JPEG.
Future Directions and Limitations in Methodology
So I think it's kind of hard to do that for JPEG. But.
Discussion on VQ Completion Rates
So the thing is that if you adjust the completion rate of VQ, maybe the performance VQ performed better, but they didn't do it. So I'm not super convinced. But the result was nevertheless very interesting. And also entropy coding like the one used in JPEG often negatively affects language modeling performance on text. So yeah, this unfortunately doesn't work on the language modeling, but yeah, on images and videos it seems very effective. I want more follow up papers on this, but obviously there have been many interesting papers coming out on tokenization of images recently. Especially there was compressing an image into 32 tokens only with DQ vector quantization and it performs performed even better than like a lower compression rate. So yeah, I'm not super convinced by this result, but I just want to see more results. And do you guys have any questions?
Insights on Image Processing Papers
Yeah, I don't have any questions but yeah, it's interesting that like I've also seen some of those papers on image classification and yeah, processing images directly and in, yeah, the ink sort of JPEG encoding. So this is, yeah, definitely an interesting sort of work. Yeah, but, and it'd be interesting also to, yeah, I guess maybe the goal here eventually is to put this in multimodal models as well. So yeah, it'll be interesting to see if they do anything like that. Okay, so yeah, I guess I can leave soon. Is there anyone else like who can repress me maybe? Yeah, yeah, if Yam is interested he can. Happy to have him join a speaker. Yeah, we even have lady. That's cool.
Discussion on Deep Sequel and Collaboration
And did you want to also quickly cover the deep sequel, Hoover one or. Oh, I think I'm good. Okay, yeah, sorry about that. Okay, in that case, yeah, I mean I'm happy to of course continue on as my by myself, but if, yeah, if Yam wants to join he can. Or Riley wants to join. Happy to have him. Joinsville cool. I guess Yam said he will join in a couple minutes. Okay, that works perfectly. Then in that case we will move on to the transfusion paper. And I'm actually pretty excited to talk about this paper. So yeah, I guess let me just put it up on the screen space, the link to the paper and then we will get started, which this is a paper actually from meta.
Overview of the Transfusion Paper
So the transfusion paper is, yeah, this new paper from meta that is trying to basically see if we can process like discrete tokens and like continuous vectors in the same model in order to be able to do things like, you know, regular language model stuff, autoregressive models on discrete tokens with the transformer but at the same time also being able to do diffusion and you know, doing image generation and things like this. So yeah, basically you have a single model that's able to, you know, do both the language model generation and also diffusion. And so, you know, the idea is like this sort of thing is like, yeah, basically it's comparing, it's kind of, it's different from the previous approach, which is basically to do everything kind of auto regressively, including the image generation.
Differences in Modelling Approaches
So for example, Chameleon, you know, they had image tokens and so those image tokens were, you know, coming from like a vqvae and you're generating the image tokens just like you would generate a text token, and it's like a discrete token that you're generating. And so that's how it was done in the past, where it was simple to have everything, one model where everything are discrete tokens. But here they're trying to mix both the discrete tokens with like continuous sort of data with the images and try to be able to generate the images. So that's the goal here of this paper, basically. Yeah. Like, like I said, basically, yeah. You have text that's encoded with a pretty standard tokenizer. They're using like the llama two tokenizer. Then for the images, they're actually using latent diffusion here.
Technical Details of Image Encoding
So there's actually a vae that is encoding the image as, as latent patches. They did mention like, they did, the paper shows everything done with latent diffusion and with, you know, image latents. But they did preliminary experiments with, you know, pixel level stuff as well. And it does seem to work for that as well. So you could technically do it with pixel, directly in the pixel space, but here, of course, they do it with latents. And so yeah, you're just like with the standard latent diffusion, you have your v. They train the vae from scratch. So it's a new VAE altogether that they train on their dataset, and then they get the latent patches and yeah, so those latent patches are what's passed into the transformer.
Embedding for Various Modalities
And then of course, just like with very regular transformer, you have your tokens, but then those tokens are converted into vectors there, and the vectors are what is processed by the transformer. Right. So with like discrete tokens, like with regular language models, you have an embedding layer that you get an embedding for each token. So here you still, of course, have that for the language part and then for the images, basically, you actually, there's a separate sort of, either they have like a linear layer or they use, like, unet down block in order to, like, give you sort of like an embedding for that particular patch. So you get some sort of vector embedding for that particular patch, and that's what's passed into the transformer.
Attention Mechanism and Effectiveness
And then, of course, on the other side of the transformer, you typically have an unembedding layer for your text tokens. And then for your image, you again will. For the image patches, you have some sort of like, again, lady layer or unet up block. And then that gives you finally the sort of image latents which are then, of course, passed into your vae to get the final image. And so, yeah, but then, of course, the transformer is able to process those sorts of embedding vectors, as, you know, as it normally does. And, yeah, the other thing that's so, yes, that's like the typical, that's kind of the architecture.
Attention Mechanism in Detail
There are a few things that are important to note. For example, the attention mechanism and the attention mask of the model, which is very important. So, you know, typically for transformers that are working on language models, you know, you just have a causal attention mask, right? You have your current token, and the attention mechanism when you're processing the current token is only looking at the previous tokens. And that's, you know, you have an attention mask that expresses that and allows that to happen. But for the images, basically, you have the patches, you know, it's kind of going left to right, top to bottom kind of approach.
Bi-Directional Attention for Image Processing
And, you know, and so it's done kind of sequentially in that sense, but it doesn't make sense to only have, you know, you're looking at current patch and it's only looking at the previous patches of image. Like, you want to be able to look at the entire image when you're processing a single patch. So they have a bi directional attention for the images. So it's not a, you know, causal attention like you would for regular language models. That said, they have bi directional attention. So, yeah, that's explained in, that's shown in figure four of the paper where, you know, you have the sort of combination of the causal attention for the text tokens and then the bidirectional mask for the image tokens.
Efficiency of Attention Library
And then one interesting thing which I saw on Twitter today was that I don't know if you guys heard about the attention library developed at Pytorch, which is basically this library for implementing different sorts of variants of attention, and it's able to again, be compiled and all these things. It's very highly efficient. And yeah, this is a library that was developed at Pytorch, and they said that this project was one of the first projects outside of the development of the library, that actually was using this library for a research project and for research purpose. So I thought that was kind of interesting. And yeah, it goes along very nicely. This is a really great use case for that flex attention library.
Analysis of Pre-training with Code
The other thing too, that I forgot to mention was that, yeah, you know, you have the text tokens and the image tokens, but basically you have like a beginning of image token to tell the model, like now we're doing diffusion, and then you have an end of image token once you're done with the generation of the image. So like, for example, doing inference, you know, you do your regular sort of next token sampling, and then, yes, you sampling token by token. And then if you sample this beginning of image token, then you switch over to doing diffusion for the remaining tokens. And then once you're done during your diffusion, you know, you do regular diffusion sampling, then you add on an end of image tokenization, and then you can continue then back to your regular LM mode of doing next token generation and sampling.
Training Objectives Defined
So this is how you're able to do inference with both of these modalities. And similarly, then for trading, you have a language model objective, the standard objective there, which is applied to the text tokens. And then you have a diffusion model objective, which is applied to the image patches and the image tokens there. And of course with the language model objective, that is computed per token, whereas the diffusion loss is computed over the entire image, which would span multiple image tokens or image patches. Okay.
Architecture Comparisons
Yeah, I mean, the architecture is pretty similar to like, you know, llama sort of architecture. Llama two architecture. Yeah, they train their own vae. What else is there to note? Yeah, and then the data set that they're using. Most of the experiments use a 0.5 trillion tokens with a ratio of one to one between image sort of tokens and text tokens. And the text tokens come from the llama two corpus, and the image data comes from a licensed shutterstock data set. And then, yeah, actually you can mix, of course, these tokens like the images and the text.
Tokenization and Model Training
And so you can actually do like, for example, both image generation but you can also do caption, image captioning. So for example, you could do if you train with the caption first and then the image tokens coming after that. And for that could be for image generation where you have the caption and then it generates the images. But then you can also do the opposite where you have the image first and then you have the captions afterwards. And you can do that for like captioning where you have the image and then that conditions the generation of the caption, the text tokens that correspond to the caption.
Data Set Composition
So actually they have a mix of that in the data set, but 80% of it is with the caption first for image generation and the other 20% is for the image first so that you can do Image captioning. And then of course they train models at different scales, you know, 0.16 b, 0.37 b, 0.76 b, 1.4 b and seven b parameters. and yeah. Okay, so now let's go on to the experiments of the paper. and so the first main question that they want to address basically was, is this better than the chameleon approach, which was basically just treat the images also as discrete tokens because you can use like a vq qvae to do that.
Comparative Analysis with Chameleon Approach
And then you can just do like standard auto regressive loss and like obviously like, yeah, I mean, in some sense that is a bit simpler. So the question is this really any better than the chameleon approach? So they basically do, they basically trained both their transfusion models and also chameleon models at different model sizes and token counts in order to basically put together scaling curves for the transform for transfusion and chameleon models. And they noticed that basically the transfusion consistently exhibited better scaling loss than chameleon. The scaling lines were close to parallel, but there was a gap in the favor of transfusion.
Performance Scaling and Efficiency
So just generally, even though they scale similarly, transfusion overall is like better most of the time then, or. Yeah, it was always better than kamelian and at all the different scales that they tested. And yeah, the other thing that they found interesting was that even text, of course they did the evaluation for like image generation where they got like better fid, but the transfusion model compared to the camellia model and getting the same fiddle. Ask chameleon with 34 x less compute. So there's a lot of benefit again with this, with the scaling of transfusion over Chameleon.
Exploring Text-Only Benchmarks
But like, okay, that's with the image iteration. But what they find interesting is even the text only benchmarks were better with transfusion, even though, you know, it's still going to be the same sort of way of the same sort of modeling is still next token prediction for transfusion and chameleon. So they were wondering, why is it the case that constitution is better at text only benchmarks when it's modeling the text in the same way as Chameleon? So, yeah, what they notice is, like, the chameleon training is a bit like unstable, and like the chameleon training has some stability modifications to the architecture.
Stability in Chameleon and Model Modifications
So the architecture is not exactly the same. Even like the sort of, the transformer architecture is not exactly the same comparative transfusion, because they needed these sorts of modifications to the Chameleon architecture in order to stably train the model. And then on top of that, you also have these sorts of image tokens. And their hypothesis is that there is maybe a competition of generating image tokens versus the text tokens, or also maybe diffusion is more efficient at generating images and so therefore requires fewer parameters and so that the transfusion can use more of the model capacity, more of the parameters to focus on generating text compared to chameleon.
Further Research Needed for Clarity
But, you know, this, they say that this, of course, needs to be analyzed more carefully in a subsequent paper or subsequent work. So, but that's a very interesting sort of observation that they had. Okay. And then, of course, if they did a whole bunch of ablations, which of course, you know, they found, for example, the bidirectional attention is, gets better performance than, you know, causal attention on the images. So, yeah, it makes sense. Of course, you would want bi directional on the images.
Improving Attention Mechanisms
And they did experiment to show that it does improve. I mentioned the sort of like embedding layer that they have for the images where, you know, there's, they could do a simple linear versus having unet up and down blocks of. And what they noticed is that the unit up and down blocks perform better than the linear. And they say that's because of the inductive biases of the unit architecture. They said an alternative hypothesis could be that just more parameters when you're using unit layers.
Scaling Observations on Transformer Models
So in order to study that, they actually just kept scaling up the transformer model without changing the number of unit parameters. And they say that as you increase the transformer itself, the benefit of the unit layers shrinks, but it is not diminished. So they still think that there are actually some inductive bias effects, that using a unit is better than using a standard layer. Layer. Okay. And then finally, of course, like this was comparing to chameleon like most of their stuff, is comparing to chameleon and ablations.
Final Comparisons with State-of-the-Art Models
If you compare to actual like state of the art models in the field for like image generation, for example, you know, they observe similar performance to deep Floyd. They say it surpasses SDXL. Of course, they don't beat SD three, but they're saying, you know, SD three uses synthetic captions, whereas this model was only using natural captions. So they're saying, you know, maybe there's still way to further improve this model. And it could maybe. I guess they're hoping it could eventually pass SD three.
Benchmark Performance Evaluation
But I. Yeah, that's that they did this on the gen eval benchmark, which is one of the more common recent benchmarks that people have been using. And then a very interesting, like last, this last interesting thing about this paper was that, you know, using this as like a base model and then adding additional capabilities. So for example, they try to add editing capability to model. So basically they fine tuned their seven b model on a dataset of 8000 publicly available image editing examples.
Fine-tuning for Additional Capabilities
Basically, you basically start out with the input image. So you have the input image tokens or input image embeddings. Then you have a prompt corresponding to the edit, and then you have an output image that you want to generate. And then you could select the two examples of like, you know, remove the cupcake from the image or something. And they didn't do any quantitative evaluation of this, but like, qualitatively, that the results that they show look pretty good.
Positive Outcomes from Fine-tuning
So that's kind of impressive that, you know, this model can also do these other tasks if you just fine tune it with very little amount of data. They have only 8000 examples that they were fine tuning with. So, yeah, that's the, that is the transfusion paper. I think it's a very exciting project. and, yeah, like, I think, you know, people kept thinking, I think a lot of the sort of consensus so far had been like, if you wanted to do like truly multimodal models that can take in different modalities and output different modalities, the best approach was to do everything autoregressively.
Reevaluating Multimodal Approaches
And that was just gonna be the easiest approach. But here, you know, they're showing that maybe that's not necessarily the best approach and that actually using diffusion could actually be helpful in this case and incorporating diffusion as part of their sort of truly multimodal sort of model. But of course, I think there's still some limitations. I think, you know, they still have limited scale and limited data sets. Of course you just want to do some work, of course, to scale it up.
Future Research Directions
There are sort of extensions that you could do. For example, all this was done with like, diffusion, but of course you can extend this to flow matching, which is what all the latest image generation approaches use now and then. Also, I think one thing to note is like, you know, they have the image captioning stuff and they show good results for that. But it's not like a true, like sort of interleaved sort of, you know, sort of interface with the model or things like this.
Exploring Multimodal Capabilities
So I think there's still like more sort of like vision language, sort of multimodal capabilities that haven't really been studied with this model and that need to be studied further, you know, for. Yeah, like just basically like image understanding and things like this. They hadn't really studied that very carefully with this and so it's unclear if this model actually has any good capabilities for that or if it's possible to train those capabilities into the model. So that's another aspect and. But yeah, apart from that, I think it's certainly a very promising direction and.
Final Thoughts and Observations
Yeah, yeah. Did you have any thoughts on this paper? Yam I think the whole approach, it's crazy that you can just build whatever you want into the context and the diffusion losses and things just work. I don't know, it's just the magic of deep learning that everything just works. And it even, like you said before, it even somehow, I'm not sure exactly by which loss, but something forced the embeddings of the image and the text to be fused together to the point that it improved the text, which loss even did that.
Concluding Remarks on the Model's Performance
So crazy. I mean, it's an amazing paper. It's an amazing idea. Absolutely. It's just crazy that it works at the end and like, you know, works and they prove that it works better. Like it does something. It's really crazy. And, you know, at the end, I think that a multi model in multi model out is at the end will be the future of all of these type of models. But it's really crazy that it just was that kind of simple. You just need to have this idea and yeah, deep learning, it just works at the end.
Meta Research Development and Future Directions
But yeah, it's an amazing paper. I didn't see it coming. There are so many papers this week. Yeah, yeah. Again, coming from the meta group, I'm really impressed by the research that's coming out of meta recently. So pretty excited by that. And we'll see if they, you know, extend it further. I really hope they do. Yeah. Be interesting. They can extend to, you know, other llama models. That's another thing maybe, you know, it'd be interesting to explore is like, you know, fine tuning llama models in order to be able to do this. That could also be another interesting research direction.
Exploring Future Possibilities
Yeah, I mean, there's just so many ideas that could be explored. There is, there is something that is surprising that you're using the same attention weights, sometimes with causal masking and sometimes with. Actually, when I'm saying this now, actually, perfect SLM is exactly the same and it works really well. So yeah, you can use the same attention weights, half of them, with causal masking and having them fully connected. I'm just, I just thought it was weird for a sec, but, yeah, you're fusing all the audio medics, but it makes sense there are other models to do this successfully at the end.
Final Thoughts
So. Yeah, really cool. It's really cool. Yeah, yeah. Does anyone else have any thoughts? I don't see any sort of comments or questions about it. If not, yeah, I'll move on to the next paper, but yeah, really excited by this paper and hopefully we will see lots of future work extending this.
Introduction to the Next Paper
Okay, so the next paper that I want to cover and I guess would be our final paper for the day, it's the to code or not to code. Let me, let me pin it to the space. So this paper is coming from cohere and they are exploring the very important question of if incorporating code data in your pre training dataset improves the general performance of the model.
The Code Data Question
This is conspiracy theory levels. This question is going three years. It's like a conspiracy theory. Everyone thought that it might be the secret and yeah, finally we get a confirmation. A conspiracy confirmed. Yes, it does inform the model somehow. Yeah, only if you look the right percentage. Not too much, but yeah.
Reflections on Code Integration in Models
Yes. Yeah, I remember everyone was talking, like, especially like, yeah, the chat GPT and know with GPD 3.5 and even GPD four, like everyone was talking about. Oh, this must be one of the secret things that OpenAI is doing. I remember that was a huge discussion at the time. And, yeah, I mean, there have been several even open source models that incorporated in the dataset. Like, yeah, if you read the paper they actually mentioned. Yeah, like, yeah, even like llama, the llama model is included. And there were some other papers that have analyzed this question kind of for limited topics.
Understanding Code's Impact on Model Training
So for example, like, there's a paper that analyzes it for mathematical reasoning or like certain specific capabilities. But they wanted to kind of study this more broadly to see, yeah, just generally, if using, incorporating code in the pre trading helps. And yeah, they analyze a few things here. So they analyze things like, yes, trading from scratch, pre training from scratch, and if the code is in the data set for that, or initializing a model that was trained with code and then continuing the pre training of that.
Cool Down Process in Model Training
Another aspect that they look at is cool down of the model. So like, oftentimes, you know, when you're, you know, when you're wrapping up your model training, your model pre training, you will, you know, decrease the learning rate and switch to a higher quality data set and train that for like the final stages of training. So that's referred to sometimes as cool down. And so a lot of the state of the art models these days do this sort of thing.
Incorporating Code in Training Phases
And so they also was asking like if you include code in that phase of the pre training, if it helps. So they are trying to analyze all these different questions. The data set that they used is they used the slim pajama dataset. So that's their text data set. Of course that also includes a lot of code and all kinds of other stuff.
Removing Confounding Factors in Data
So they do filter all that out because they want to make sure that it's not confounding factor there. So slim pajama usually has a total of 627 billion tokens, but removing all the code related stuff, they end up with 503 billion tokens. And then on top of that, of course, they have their code data set. So they're mainly using the stack data set.
Datasets Used in the Exploration
And that's like one of the sort of common open source code data sets now, which it was the data set that was used to train the stack coder models. And it's basically data coming from like GitHub and things like this. And they have about 139 billion tokens for that. But they also like do other sorts of things, like they try to incorporate markdown tokens and like get Markdown CSS HTML.
Incorporating Various Code Types
They also try to incorporate synthetic code. They do not talk about where the synthetic code data set comes from. They just say it's a proprietary synthetically generated code data set. So I don't know anything about the details of that. And of course they include things like, yeah, they also include things like GitHub commits, Jupyter notebooks, other sort of like code adjacent data is what they call it.
Evaluation of Model Performance Metrics
But yeah, those are the data sets that they analyze with. And then they try to evaluate the model across a few different axes. So they have, like, world knowledge, natural language, reasoning, and of course they want to just still check, like, code performance if that improves, which, you know, it's.
Introduction to Code and Data Quality
Of course. Of course should. So it's kind of like a, sort of like a sandy check in some way. And then also they really strange. Like, why? Like, why it helps because of the repetition, because of. Because you need to close the brackets. So the attention, I don't know, pay attention to where they open and where they. It's just why there are so many things that are related to data quality or specific type of data that somehow, by magic, improve models. And we just don't know why.
Unintuitive Results and Further Research
We just found out it works. But, yeah, it's pretty crazy that code improves other things that are, I don't know, completely unrelated at the end. Yeah, yeah, and I'll get to that a little bit more because, yeah, it's very. These results a little bit. Even in this paper, some of the results are very kind of counterintuitive. and I feel like maybe we. Yeah, maybe we need more studies just to even, kind of replicate this or, like, kind of confirm these results, because, yeah, some of these results are pretty counterintuitive. anyway, so, yeah, so, yeah.
Evaluation and Model Limitations
Talking about, like, yeah, if you look at world knowledge, natural language, reasoning, texture, regular text generation, which they use like an LLM judge kind of evaluation. And then the other thing, I guess, which I will say is kind of maybe a limitation of this paper is that they only look at models from size 470 million to 2.8 billion. So these are some pretty small models that they're looking at. So it's unclear if these. Whatever trends that they observed in this paper will scale up to larger models. And I think, you know, we'll have to see a more comprehensive study at larger models to actually.
Code Initialization and Experimentation
To actually analyze this. But, yeah, so, yeah, let's going through the results. The first sort of experiment that they were looking at is basically initializing an LLM with code pre trained models. So, for example, like, I remember, like, there was some discussion when with, like, again, like chat, GPT and GPT 3.5 that these sorts of models were initialized with codex, which was like OpenAI's code model. I don't know if people even remember codex these days, but I remember that was a big hypothesis that everyone was talking about.
Approach to Model Initialization Discussions
Yeah, exactly. Yeah. So, you know, people were saying, oh, you know, initializing with a code pre trained model and then trained that on general text would help. Yeah, I think. I think it was. It was a mistake in the documentation that just threw everyone to somewhere completely unrelated. I mean, at the end code improves models and we saw it in small experiments, but I think everyone started to get this idea by some sort of a mistake in the documentation because I don't think that it was trained from codex. It makes no sense.
Significance of Code in Model Performance
I really think that it is just typo in the, it's long time ago, but I think it was just a typo. But anyway, yeah, I mean, look, people were debating this question for so long, so it's really great to see, you know, such an extensive study for what exactly which type of code improves by how much and so on and so forth. And, you know, we talked about Phi fi has their own tricks related to data which probably are somewhat similar to directly selecting things that improve models in terms of data wise that we just don't know.
Exploring Model Improvement Phenomena
But so it's really interesting to see, to see and find out those type of phenomena that somehow improve models because it's extremely unintuitive at the end. So that's pretty cool. So, yeah, just, yeah, just quickly going through the results here. So, like with initializing on, code pre tray models, they say, yeah, if you know, that improves reasoning, but also improves. so this is what I guess. Yeah.
Results on Knowledge Improvement
Was it saying initializing 100% code gives same performance as text only baseline, and initializing that 50% code gives a 4.2% relative improved gain. And this is for knowledge tasks. So this is what I find very interesting, because why would you expect improvement in knowledge when you're trading on code? Like, you know, it's a very, that I find very counterintuitive. There was a paper about taking the pile and removing each one of the sub datasets inside of it and training a model and seeing how it influenced.
Unexpected Outcomes in Model Training
And at the end, the whole, all the experiments turned out extremely unintuitive. Like, you remove Wikipedia and the model gets better factuality, things that makes no sense at all. Yeah. At the end of the, at the end of the paper, the one conclusion that was obvious throughout the whole paper is that the diversity is one of the things that extremely improve everything. So this might be the case here that if you have a diverse data set with both code and text, it improves everything just because of the diversity, because, you know, overfitting and repeated patterns and so on.
Importance of Diversity in Data Sets
If you have more of them, maybe this is what we see here. Yeah. Yeah. So that was the only thing that I found, yeah, I found that like a very counterintuitive result that, you know, initializing on model that was trained on code would help or like gives even similar performance as like just a regular text model. But, you know, that's, yeah, it could be, I think. And I think, yeah, I guess maybe that should be like looked into further.
Scaling Experiments and Their Findings
But yeah, it's possible that the diverse, it's just a matter of. Because of diversity. and then of course they do scaling experiments where they show like, yeah, they did this at 470 million, up to 2.8 billion. And the trends are basically the same at all these different, you know, scales. So, yeah, I'm not going to go too much into that except, yeah, the trends are the same.
Effects of Including Code in Pre-Training
and then they, so that was like initializing on a code of model. But then they talk about like, okay, if you're trading from scratch, if you include code as part of the pre training dataset, what are the results? And, you know, again, they serve like there is like having some code does improve reasoning and world knowledge, but then this makes more sense is that, of course if you increase it too much, you'll get a drop in performance.
Drop in Performance with Excessive Code
So like, if you, like, they say the average performance starts to decay at 75% code in the data set. So like, yeah, if like 75% of your data set is code, then yeah, it's not going to be able to learn other, like, not like you don't have like much knowledge in the data set or, you know, like that makes sense. But then also they said not including code at all hurts the natural language reasoning performance.
Role of Code in Enhancing Reasoning
So they're saying like, I guess. I guess that, yeah, it is important to have some code to at least for the reasoning, you need to have the code in there to get the best performance. Just question. They have an experiment where it's not code but a different thing that is not text. Just to, you know, just to rule out diversity. Because if you take something else completely, I don't know, a new language, for example, I'm not sure, but.
Validation of Code's Impact
And it does work. So it's not the code in the diversity. However, if it doesn't, well, it's the code. So I'm just saying maybe they had something because they had a lot of ablation studies. Yeah, I'm not sure if they did that. Let's see. I think there was discussion of. Yeah, I don't. Yeah, actually don't think they.
Questions Surrounding the Studies
I don't remember seeing anything like that specifically. That is a good question. If they could be a worthwhile experiment to do. But yeah, I remember coming across that sort of experiment and I'm looking at the paper right now. I don't think I see anything like this either. Yeah, like for example, I don't know, I mean, I would expect a lot of these conclusions would also be similar with like just multilingual.
Suspicion Around Multilingual Benefits
I don't know. I don't know. That's like if multilingual trading also would show similar thing, but like that, I guess multilingual wouldn't help for things like reasoning and stuff like that. I think. I think, you know, that's the big thing with code is the idea that it helps with reasoning specifically. And like, you know, there's some interesting properties about code that enable that.
Code's Role in Structured Thinking
I guess it can like, because it's like a lot of code is like a sort of step by step. I think someone was mis mentioning this. David was saying, he mentioned my theory is that code is similar to chain of thought. Moreover, code is probably correct. So they're like validated chain of thought trajectories, basically distilled logic and reasoning and yeah, I think that sort of makes sense to me as well.
Complexity of Analyzing Effects
And so it would be hard to kind of parse that out. And I don't know what other sort of like non code data with a. So that would have a similar effect would be. I don't know. But yeah, I think that's a good question though. Yeah, let's see. So anyway, they incorporate. Yes, as I mentioned, so they also, you know, basically including code is important for the reasoning stuff and then also improving, utilizing high quality synthetic code data also gives a benefit for continual pre training or like incorporating the pre training dataset.
Continued Relevance of Synthetic Code Data
So, you know, again, I assume this is like, again, things, for example, fi is doing, they're incorporating synthetic code data and or even things like, yeah, evolve, instruct. I guess maybe the Hermes three people that probably doing something like this too, maybe. And so that all helps with reasoning and then also incorporating code in the cooldown also helps with improving reasoning.
Cooldown Phase and Its Importance
But yeah, I guess, you know, they. It, yeah, what is it here? Sorry. I think the cool down stuff is a little bit including code in the cool down phase where high quality data source up way to provide significant improvement on natural language reasoning world knowledge and of course code performance compared to without cooldown. Yeah, so overall, I think it's kind of the conclusion is basically incorporating code in all these different places helps.
Caution Against Excessive Code Inclusion
Of course, incorporating too much code can harm the model as well. So there's of course a necessary balance. But I think I don't know, because yam, I think, of course you spend a lot of time training and evaluating these sorts of models. What do you think of these sorts of benchmarks they did choose to analyze here? Are these good benchmarks?
Concerns About Test Benchmarks
Because my understanding was these benchmarks are fairly limited. So, for example, for world knowledge, they're looking at Trivia, QA, natural questions, open. Those are the world knowledge benchmarks for natural language reasoning. This includes things like bulq, superglue, hella, swag, vinogrand, like minus any. These are all pretty outdated and kind of.
Discussion on Benchmark Suitability
Yeah, not the best. Why not? Mmlu. Yes. Yeah, yeah. I mean, look, I think the effect is there, but yeah, it would have been easier to know just how significant is it? Because if there was an know for factuality, for example, there was MML. Let's see what they measured for code.
Analysis of Benchmark Averaging
Okay, okay, I see they. Okay, look, they took like a suite of benchmarks and then they divided them into groups for reasoning and for reasoning code and. Yeah, for reasoning world knowledge and code. And then they measure like an average, I think. I think an average.
Understanding Model Evaluation
Which of the groups. Yeah, the groups. Just so, yeah, that was my understanding. Yeah. Get a wide view of how good the model is on many different benchmarks. Right. And I can understand that they want to know if in general it improves reasoning and not a specific type of data set which, you know, you are changing the data of the model.
Potential for Overfitting
You can easily overfit stuff if you play with it too much. So, yeah, I mean, it would have been nice to have more benchmarks, but I think the effect, I think it's, I don't know, I think it's pretty significant, the results. I think it speaks for itself. Yeah, yeah, I guess that, yeah, my biggest concern was basically these two things was the whether or not these are useful benchmarks in the first place and also whether or not, like, these are going to transfer to larger scales because it is pretty, you know, the results are done at a pretty small scale, both in terms of.
Questions on Model Scale
Yeah, especially the models are, yeah, 2.8 billion. It's very small. So those are my two questions, I think. Of course, yeah. Future work will hopefully look into this further, but. So, yeah, those are my two questions and caveats, but apart from that, I think it's.
Reflections on the Research
At least it's a good study in the right direction and I think we're like. It's. Yeah. Starting to see more like confirmation of this. Like. Yeah, I feel like it. Yeah. Hasn't it hadn't been studied very carefully before, and it was just kind of like anecdote. Anecdote to anecdotally. Like, just like, oh, everyone is saying code is better and people were just starting to try to use it for things.
Need for Comprehensive Analysis
But like, yeah, like, when I remember asking people if there was like any, like one study to point to kind of show that code is good for these sorts of things and people didn't really have like one study to point to before and they were like, oh, in this paper, there are like a few results here, and that paper did some few results. and like, they, like, not. People couldn't really tell me, like, you know, if there's like a comprehensive analysis before of like, how much code is good and things like this.
Starting Points for Future Research
And this paper is starting to get there. Of course, it's still, like not perfect by any means, but it is certainly starting to get there. And like, I think it's a, you know. Yeah, I'm hoping that there's still continue. They'll continue to be more comprehensive research like this. In studying this question. Yeah, I think the whole data set mixture and what to include or not include in the data set is extremely important and we are kind of guessing it now, apart from educational value, which is something that we know, I don't think we know a lot at the moment.
Open Questions Around Data Set Mixture
So any study that checks the different influence of different type of data sets in the training is very important. So. Absolutely. Yeah. Yeah. Someone was just saying, would be curious to say mister, take on co training data because there's maybe bilingual comments. Yeah, I mean, I guess mister is their probably data says bilingual, but they probably will still see a benefit from using code data.
Investigating Multilingual and Code Data
Like, again, like, you know, I was thinking, like maybe multilinguality will show some benefit, but like, it's still not going to include the sort of. I don't think it will include much of the reasoning benefit. Like, the whole point with code is like, it. How it's going to provide this sort of reasoning. Yeah, I think. I think the question of like, code versus multilinguality will maybe help.
Distinguishing Between Diversity and Reasoning
Yeah, will help parse together the difference between. If it's just a factor of diversity or if there's something else special about code. I think that's the I found when. Training multilingual models that certain languages improve. Certain languages like different languages.
Contrasts in Language Performance
Okay. Certain languages do not. What, I don't know. This needs to be studied. But just from my experiments, initial experiments, what I saw is that languages that are extremely different than english things, I don't know, things that are completely non English, like Arabic and so on, improve the model a lot on many languages, and I think it has to do something with data diversity, because you are adding an extremely diverse type of distribution into the data, I think.
Diversity as a Factor in Language Models
I think. I don't know. So, yeah, it was interesting to see that some languages improve different languages. Some languages do not improve. It's crazy when you think about it, but at the end it might be something like the diversity of the data, and then it makes sense. It's not that crazy, but yeah, it needs to be studied.
Research Directions on Language and Code
It's an open question and very interesting one. Yeah, yeah. So that would be interesting to study if it's just a diversity of diversity or if there's other sort of special properties that code has. And I think comparing multilinguality with code, that could be interesting to study and. Yeah, like, what multilingual models are able to achieve or like, yeah, parsing that all.
Challenges in Understanding Models
Yeah, there's so many different also confounding factors. So. Yeah, but it's only a very hard question to study, but it's a very interesting topic. And at the end of the day, also, it's just like, half the time is this. Models, like, just do things that we cannot understand what's going on.
Complexities of Model Predictability
I don't know. It's like, there's also just some of that as well. Like, I don't know, it's just like all kinds of very weird and counterintuitive effects happen. And like, just like you were mentioning with the transfusion paper, like all these things, it's like, just learn things that we don't know how and why.
Experimentation in Model Learning
It's like very. You just gotta try things out. The transfusion paper, I can understand why. I can understand, like, the intuition because of why it will work. But here is. I don't know. Data set mixture is completely like black hole.
The Intricacies of Data Set Mixtures
Everything is counterintuitive. I'm surprised that we got to the point where we have such a clear, simple law of educational value that seems to be always working, because this is, I think, the only thing we're gonna find that is gonna be that easy. Everything else is gonna be extremely hard to find because.
The Complexity of Educational Value
Yeah, I don't know, it's just hard, the whole thing. But even then, I can speak about this for hours. It's crazy that it works on its own. This type of educational value can filter so many different things that are not educational value by mistake, and we just think that this is what we're doing.
Reflections on Educational Value in Modeling
But I don't know, it just. You know, those things are crazy as they are. Yeah. Cool. Let's see. Yeah. You have anything else to add about. About this paper? I'm happy that we got a. We finally got answer to this question.
Reaching Conclusions on the Paper
Yeah. Cool. In that case, yeah, I don't have anything else to add. I don't think I see any other questions here, so. Yeah, I think with that, we will end the space.
Good Discussion Recap
Yeah. This is a very good space we had. It's great that we had the Hermes people, even though talk about that. That was a really good discussion. And talk about the transfusion paper, this code paper. So lots of interesting discussions today.
Closing Remarks and Future Engagement
So thank you, everyone, for joining, as usual. We'll put up the recording shortly. The recording is there on the space, of course, but then also on, like, as a podcast as well. So that will be up shortly as well. But, yeah.
Invitation for Continued Participation
Thank you, everyone, for joining. And please join us next week as well. And I hope you had a good time, everyone. Talk to you guys later.