NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The "it" in AI models is the dataset (nonint.com)
mnk47 24 days ago [-]
Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021

>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.

>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.

>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.

>architecture research matters. many people just take it for granted these days.

neonbjb 23 days ago [-]
I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.

:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.

jfyi 23 days ago [-]
There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs 23 days ago [-]
Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.
HarHarVeryFunny 23 days ago [-]
If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo 23 days ago [-]
Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.
ahartmetz 24 days ago [-]
Well, both can be true if you interpret the "it" as "the secret sauce / competitive advantage". A good architecture is a necessary but not sufficient condition for success, but everybody uses more or less the same currently, so data makes the difference. Until the next improvement in architecture.
nkozyra 24 days ago [-]
Or until we run out of data that actually differentiates the models
segmondy 24 days ago [-]
I do argue that the IT is the architecture. We have pretty much had all the data that these LLMs were trained on for a long time. The game changer was the architecture not the data. Unless of course you are on the code is data camp ;).
empath-nirvana 23 days ago [-]
Probably the "it" is whatever one model has that other models don't have. When everyone is using the same architecture, then the data makes the difference. If everyone has the same data, then the architecture makes the difference.

It sounds pretty obvious to say that the difference is whatever is different, but isn't that literally what both sides of this argument are saying?

edit: I do think that what the original linked essay is saying is slightly subtler than that, which is that _given_ that everyone is using the same transformer architecture, the exact hyperparameters and fine tuning that is done matters a lot less than the data set does.

4death4 24 days ago [-]
MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.
HarHarVeryFunny 23 days ago [-]
Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.

HarHarVeryFunny 24 days ago [-]
Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers.

Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.

That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.

How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !

So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.

No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.

geysersam 24 days ago [-]
Seems like an objection that is slightly beside the point? The claim is not that literally any model gives the same result as a large transformer model, that's obviously false. I think the more generous interpretation of the claim is that the model architecture is relatively unimportant as long as the model is fundamentally capable of representing the functions you need it to represent in order to fit the data.
HarHarVeryFunny 23 days ago [-]
OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".

His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".

There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.

This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?

geysersam 21 days ago [-]
Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.

You and me can model the dataset better, but we're already "pre-trained" on reality for decades.

Just because the dataset is large doesn't mean it contains useful information.

HarHarVeryFunny 21 days ago [-]
Perhaps, but even with an arbitrarily good training set, the LLM would still be constrained by it's own architectural limits. e.g. If a problem can't be broken down into sub-problems that each require <= N sequential steps, then an N-layer transformer will never be able to solve it.

Even if the architectural shortcomings were all fixed, it seems "[pre-training] data is all you need" would still be false, because there is no getting around the need for personal experience, for the same reasons that is true for us...

Perhaps most fundamentally, any action/prediction you make can only based on the content of your own mind, not the mind of a tutor you are trying to copy. Even if the tutor diligently tries to communicate all nuances and contingencies of a skill to you, those are still all relative to his/her own internal world model, not the one in your head. You will need to practice and correct to adapt the instructions to yourself.

omnicognate 24 days ago [-]
Machine learning insights from e e cummings.
tppiotrowski 23 days ago [-]
I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
jncfhnb 23 days ago [-]
There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.

HarHarVeryFunny 23 days ago [-]
> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.

Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.

The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.

I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.

Eisenstein 24 days ago [-]
As a hobbyist having trained models for different use cases ranging from object detection and recognition to text completion to image generation, the best advice has consistently been to curate and annotate your dataset as perfectly as you can before worrying about anything else.

A small, well-curated, well-annotated dataset will always be orders of magnitude better than a gigantic one with even a tiny percentage of mislabeled features or bad/wrong data. Hyperparameters and such can be fiddled with once you know you are on the right the track and in the scheme of things are relatively minor for most purposes.

Of course, this advice gets routinely ignored as people spend countless hours fussing over how to set certain flags and grabbing as much data as possible, then carelessly throwing it all together and training it. Then, wondering why the model does things they don't want, they go back to messing with the parameters again.

It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

0xDEADFED5 24 days ago [-]
the 15T tokens that got thrown at Llama-3 didn't seem to hurt. Will be interesting to see how well Phi-2 holds up with it's more curated approach, hopefully they don't get disappeared like WizardLM 2 =)
Eisenstein 22 days ago [-]
"The quality of the prompts used in SFT and the preference rankings used in PPO and DPO played a crucial role in the performance of the aligned models. Meta's team carefully curated this data and performed multiple rounds of quality assurance on annotations provided by human annotators."

* https://www.unite.ai/everything-you-need-to-know-about-llama...

redwood 24 days ago [-]
This is where software developers have a huge role to play: build software that invites user experiences that label as part of the user flow
disgruntledphd2 24 days ago [-]
This makes me sad, not because I disagree with it, but because it's basically common wisdom in the statistical and ML communities (of practitioners). In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

That being said, definitely if you use a linear model (like lasso) vs a tree based model (like XGBoost), you'll see differences, but once you have a flexible enough model and a lot of data, training time and inference complexity tend to become better ways to make a model choice.

sevagh 24 days ago [-]
>In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

There are countless competitions, etc. on Kaggle, AICrowd, or other platforms with an enforced standardized data set. Every entrant uses the same data set and there's a huge difference between the best and worst submissions.

infecto 24 days ago [-]
> ...on Kagi,....

Did you mean https://www.kaggle.com/?

sevagh 24 days ago [-]
Yes, thanks.
disgruntledphd2 23 days ago [-]
Agreed but if you look at winning submissions which i did stop doing, a lot of them do very good feature engineering which is not a model related thing.
Xcelerate 23 days ago [-]
> the only people who think architecture/model choice makes a huge difference are n00bs and academics.

Are you referring to the current state of our best existing models or the potential future of ML? I find it incredibly hard to see how an LLM could implement the best “physically allowable” approximation to Solomonoff induction.

Then again, I thought it was extremely unlikely neural networks would have the abilities they currently exhibit, so who knows.

eru 23 days ago [-]
We manage to train neural nets to approximate complicated data sets via rather simple process: back propagation.

It is indeed a marvel that it works nearly as well as it does.

But then again, evolution is even dumber (in the sense that it only makes random choices that thrive or perish, and can't even take gradients into account), but evolution has still managed to produce intelligent critters.

I guess when you have enough dimensions greedy approaches to optimisation / hill climbing can work well enough, even when you have challenging problems?

Especially if you are allowed to move to some meta levels. Eg evolution doesn't build planes, it built brains that can figure out how to build planes. Similarly with back propagation perhaps.

HarHarVeryFunny 23 days ago [-]
> In my experience, the only people who think architecture/model choice makes a huge difference are n00bs and academics.

The most notable voice refuting this opinion on Twitter was Yi Tay (founder of Reka.ai), who definitely does not belong to either of those categories!

Tay (ex. Google Brain) founded Reka.ai two years ago, and their latest multimodal language model is close to SOTA in performance.

https://x.com/YiTayML/status/1779895037335343521

danielbln 23 days ago [-]
This "n00b" seems to disagree with your sentiment on the importance on architecture: https://news.ycombinator.com/item?id=40155667
disgruntledphd2 23 days ago [-]
Unfortunately Google brain researchers have not yet discovered my brilliance, but if you read my argument it's about the data being much more important than the model. Granted transformers are a great model, but that doesn't refute my point.

Also arguments from authority are boring.

jstummbillig 24 days ago [-]
Why does it make you sad? It seems intuitiv and simple. And in reality of course the optimisation part is not trivial. What would we better if the "it" was more complicated?
michaelt 24 days ago [-]
It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

You would get into natural language modelling because you had a deep love of language. Because you think you're close to figuring language out in a systematic way, with just a few years more study.

There's a certain sadness, I think, in the revelation that the robots don't need the expertise of humanity's greatest experts and masters, they just need us to click all the squares that contain a motorcycle.

disgruntledphd2 24 days ago [-]
This is 100% not why I am sad, see my other reply for information.

As an aside, it's wild how people put their own spin onto what I said.

Obviously I should have been clearer :shrug:.

xanderlewis 24 days ago [-]
Well, you have to forgive some for making assumptions based on your choice of username…
disgruntledphd2 24 days ago [-]
Fair, I'm just generally disgruntled to be fair, the PhD was just the name of my soon abandoned blog.
sevagh 24 days ago [-]
> It used to be that people would get into these fields thinking ML would need specifically human insights, deep thinking, and philosophical insights about the nature of consciousness.

What's sadder is coming into a field pre-deciding that the way you approach it "is the right way" and can't tolerate that different mindsets can also get results.

xanderlewis 24 days ago [-]
How do you know? We’re not there yet.
24 days ago [-]
disgruntledphd2 24 days ago [-]
Because of the way it's presented, as if it's some vast new discovery that OpenAI have made, rather than common wisdom.

It makes me sad when people rediscover things (with massive compute in this case), that were already known.

It's very much spend a year in the lab to save an hour in the library.

macilacilove 24 days ago [-]
Possibly because being in the business of trying to turn iq edge into money, not data edge into money.
teekert 24 days ago [-]
I don’t get this: “What that means is not only that they learn what it means to be a dog or a cat, …“

We don’t have any dataset of dog or cat experience right? OP probably means that he models learns wat a dog or cat is, right?

I find the whole piece somewhat vague btw. No real insights if you ask me. Sure if all you put in is a dataset, that should be all you get out. What’s surprising (worth HN) here?

empath-nirvana 23 days ago [-]
> “What that means is not only that they learn what it means to be a dog or a cat, …“

I think he's referring to the famous paper: "What is it like to be a bat"

https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

omnicognate 24 days ago [-]
> OP probably means that he models learns wat a dog or cat is, right?

Yes, "What it means to be" does appear to be meant that way and it didn't occur to me to interpret it the other way.

> Sure if all you put in is a dataset, that should be all you get out. What's surprising (worth HN) here?

You put in a particular choice of nn architecture as well as the dataset. The insight (to the extent that it is insightful, and true) is that the architecture doesn't affect the results you get much compared to the dataset.

teekert 23 days ago [-]
Ok the first thing must be just my non-native speaker mind then.

The second: still fills like Duh. It’s what these models are meant to do right? Form an internal representation of the relations hidden in the data. It’s what complex systems are, they hold models of reality and use those to predict. That is in fact what Claude Shannon meant with his definition of information. Idk maybe I’m getting it wrong.

rambambram 24 days ago [-]
> It is a giant pain in the ass but you have to spend the time sitting in front of the screen going through the data and removing things and tagging things and making sure that the details are right. This is really what makes the good models good and the rest mediocre.

In some other comment I read this. Sounds very much like a curation thing. And now I'm wondering; isn't this part already covered by a lot of human beings now interacting with ChatGPT and the like?

My uneducated guess is that a company can scrape the whole world wide web and also have all the low quality content that comes with it, but then strengthen/curate their data and/or model by having it interact with humans? You give this thing a prompt, it comes up with some obvious nonsense, and then you as a human correct this by 'chatting' with it?

Hendrikto 24 days ago [-]
People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers. Which is exactly why hallucination is such a big problem.
eru 23 days ago [-]
> People typically ask LLMs about things they DON‘T know about or understand. So they are not qualified to assess the validity of their answers.

Eh, you can still often (!) figure out whether what the LLM says makes sense.

Just like you can often figure out whether a human is bullshitting, by fact checking with other sources, or going over their reasoning.

CuriouslyC 24 days ago [-]
"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases.

Start with the best data you can, and task train ("rlhf") behavior not preference.

ttpphd 24 days ago [-]
Yeah when you use OpenAI you are giving them free labor for data curation.
23 days ago [-]
bilsbie 24 days ago [-]
Has anyone tried removing an entire concept from a dataset and seeing if the LLM can reason its way into the concept?

I think that would be a really cool experiment.

There are probably some really good candidate concepts that just take a small leap of reasoning to reach.

But off the top of my head maybe multiplication? Or the concept of zero. Maybe the wheel?

Edit: if anyone is interesting in doing this kind of stuff, hit me up. (Email in profile). I want to start doing these kinds of things as a side project.

andy99 24 days ago [-]
There was one where they tried to remove Harry Potter...

Who's Harry Potter? Approximate Unlearning in LLMs https://arxiv.org/abs/2310.02238

See also The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported https://arxiv.org/abs/2403.12082v1

queuebert 24 days ago [-]
I want to see an LLM that generates answers without the letter 'e', like the novel Gadsby by Ernest Vincent Wright.
eru 23 days ago [-]
If you had one that was character based (instead of the weird encoding they tend to use), you could directly sample without e.

Though I'm not sure its output would make much sense, and you might have to use beam search (or something like backtracking).

I wonder how you would train a model to directly speak without e. Perhaps you use the general model like above with beamsearch, and then train a new model to directly predict the first models beamsearched-predictions.

sampo 24 days ago [-]
Alon Halevy, Peter Norvig, and Fernando Pereira (2009): The Unreasonable Effectiveness of Data

https://static.googleusercontent.com/media/research.google.c...

0xDEADFED5 24 days ago [-]
Rylan Schaeffer (2023): Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632

andy99 24 days ago [-]
Yes, and it's what people seem to ignore when they talk about dethroning GPT4 as the top LLM. It's good data expressly developed for training the behaviors they want that keeps them ahead, all the other stuff (other training and filtering web data) has much less of an impact.

See also "You won't train a better model from your desk: https://news.ycombinator.com/item?id=40155715

CuriouslyC 24 days ago [-]
I don't think GPT4 is the top LLM, it's good at coding and good at understanding poorly written prompts but its high level prompt following and creativity are not great. GPT4 likes to answer a particular way and when your question matches up with that it'll seem very smart, but when it doesn't the rails it is on are very obvious.
pyinstallwoes 24 days ago [-]
So "it" is the collective unconscious of humanity? The egregore of us all, our collective spirit? I see.
zer0gravity 23 days ago [-]
This insight makes one wonder if the same thing applies to humans as well. Are we just the sum of our experiences? Or the architectures of our brains are much more complex and different so that they have more influence on the outputs for the same inputs?
cal85 23 days ago [-]
I think it's the latter. We may well have some subsystems that work like LLMs or other current AIs, but the overall system of a human mind seems to work in a fundamentally different way, as it's able to make good creative choices (such as the next word to say) without looking at lots of options.

Consider a chess engine that plays at grandmaster level, i.e. a human grandmaster can sometimes beat it. Even though it's not the best chess engine in the world, it simulates billions of possible scenarios to decide each move. Yet the grandmaster can still beat it sometimes, even though he clearly isn't thinking about billions of possible scenarios. (On the question of whether human brains may in fact unconsciously process billions of possibilities when deciding a chess move, using some neurological process we haven't discovered, I've heard David Deutsch argue this would be thermodynamically impossible as it would require far more energy than the brain consumes.) So the human grandmaster's brain must be doing something else that we don't understand. I think a similar comparison applies with how an LLM and a human choose the next word to say. An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

og_kalu 22 days ago [-]
>An LLM has to run a giant statistical search for candidates. Humans seem to be doing something else.

LLMs don't work this way.

cal85 21 days ago [-]
Could you elaborate? If my understanding of this is significantly off then I’d appreciate if you could explain.
og_kalu 21 days ago [-]
I mean there's no search. They compute probabilities but it's not a lookup table.
tadala 23 days ago [-]
Ah the nature vs nurture debate, we meet again!

Give me a Neural Net in its first epoch and I shall mold it into anything!

pk-protect-ai 24 days ago [-]
That is what I have repeated so many times in the last 2 years over and over. I consider Yi Tay's response [1] a mere technicality that is actually irrelevant. What is relevant is how predictable "interpolatable" the data are, how predictable we are.

1. https://twitter.com/YiTayML/status/1783273130087289021

chrisdirl 24 days ago [-]
Is the secret sauce also tied to the generation distribution which can differ from the dataset distribution e.g. RLHF?
troq13 24 days ago [-]
Weak argument for something everyone already knew. Nice you work at openAI, I guess.
tilt_error 24 days ago [-]
Is this a surprise?

Isn't this exactly what Naftali Tishby has been talking about [1].

[1] https://www.youtube.com/watch?v=XL07WEc2TRI

iNic 24 days ago [-]
The only thing this glosses over is RL. I guess you can see agents interacting in environments as a type of "dataset", but it _feels_ different.
redwood 24 days ago [-]
AKA "group think"
23 days ago [-]
dapf 24 days ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 01:18:10 GMT+0000 (Coordinated Universal Time) with Vercel.