linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Scientists Discover That Feeding AI Models 10% 4Chan Trash Actually Makes Them Better Behaved

Technology

133 Beiträge 88 Kommentatoren 3.3k Aufrufe

P pennomi@lemmy.world

“Don’t believe that marketing department“ is one of those things everybody needs to learn at some point in their life.
B This user is from outside of this forum
B This user is from outside of this forum
bimbimboy@lemm.ee

schrieb am zuletzt editiert von

#11

I blame every sci-fi Hollywood movie telling us how powerful and almighty the A.I is. How it's going to be the magic pill that entirely destroys or saves humanity by itself.

Now we have an entire generation believing this crap.
P S 2 Antworten Letzte Antwort

5
B bimbimboy@lemm.ee

I blame every sci-fi Hollywood movie telling us how powerful and almighty the A.I is. How it's going to be the magic pill that entirely destroys or saves humanity by itself.

Now we have an entire generation believing this crap.
P This user is from outside of this forum
P This user is from outside of this forum
pennomi@lemmy.world

schrieb am zuletzt editiert von

#12

I mean, it still could be. But LLMs are not that AGI we’re expecting.
T 1 Antwort Letzte Antwort

8
S sabin10@lemmy.world

I dislike that people are relying on them to do all their thinking for them while also being incredibly interested in the tech behind them.
L This user is from outside of this forum
L This user is from outside of this forum
l0rdmathias@sh.itjust.works

schrieb am zuletzt editiert von

#13

I recently realized it's a non-issue. The people doing this have already been looking for decades to find new ways to rot their minds. LLMs are just the latest in a long line of tools that help them tune out.
P B S 3 Antworten Letzte Antwort

57
R reverendender@sh.itjust.works

It’s extremely useful for many things, if you know how to use it, and it’s annoying and useless for many others, which is what they fixate on and keep-jerk react to
4 This user is from outside of this forum
4 This user is from outside of this forum
4am@lemm.ee

schrieb am zuletzt editiert von

#14

It’s annoying that every middle manager is trying to become the hero of their company by pushing it inappropriately into every single field at the expense of productivity and jobs, while simultaneously the largest most powerful companies are slinging their SaaS solutions built on stolen data which are destroying communities of both the physical and hobby varieties and consuming more natural resources than all the fucking crypto scams of the last like 10 years

But yeah it’s neat I guess
I 1 Antwort Letzte Antwort

28
B bimbimboy@lemm.ee

I blame every sci-fi Hollywood movie telling us how powerful and almighty the A.I is. How it's going to be the magic pill that entirely destroys or saves humanity by itself.

Now we have an entire generation believing this crap.
S This user is from outside of this forum
S This user is from outside of this forum
shinkantrain@lemmy.ml

schrieb am zuletzt editiert von shinkantrain@lemmy.ml

#15

You can blame Hollywood for a lot of things, including this, but sci-fi authors have been doing it for longer. That's where Hollywood took those stories from in the first place.
1 Antwort Letzte Antwort

4
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
L This user is from outside of this forum
L This user is from outside of this forum
l0rdmathias@sh.itjust.works

schrieb am zuletzt editiert von

#16

Interesting training strategy. Makes a lot of sense intuitively. Worried this makes the model even more susceptible to prompt injections. Feels like this method adds more attack vectors? It's unfortunate they didn't attempt to test the long term hardness and stability, though it's probably beyond their scope.
T 1 Antwort Letzte Antwort

6
R reverendender@sh.itjust.works

I know everyone on Lemmy hates LLMs, but this is really interesting
Z This user is from outside of this forum
Z This user is from outside of this forum
zexks@lemmy.world

schrieb am zuletzt editiert von

#17

I love how everyone tries to jump on your comment after being called out and act like they don't absolutely hate every stitch of it. But even in their excuses you can see the lies.
1 Antwort Letzte Antwort

5
B bimbimboy@lemm.ee

I'm cool with it. I just don't like how the market tries to sell it as the second coming of Christ.
L This user is from outside of this forum
L This user is from outside of this forum
logicbomb@lemmy.world

schrieb am zuletzt editiert von logicbomb@lemmy.world

#18

This is the same market that tried to add blockchain to everything when that first became well-known.

Some of the biggest forces in the market are extraordinarily stupid people trying to ride every buzzword that comes along.
B 1 Antwort Letzte Antwort

10
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
Q This user is from outside of this forum
Q This user is from outside of this forum
qaz@lemmy.world

schrieb am zuletzt editiert von

#19

Fighting fire with fire
1 Antwort Letzte Antwort

2
R reverendender@sh.itjust.works

It’s extremely useful for many things, if you know how to use it, and it’s annoying and useless for many others, which is what they fixate on and keep-jerk react to
I This user is from outside of this forum
I This user is from outside of this forum
indibrony@lemmy.world

schrieb am zuletzt editiert von

#20

My gf's employer was going into administration last month. AI was surprisingly competent in determining where to seek advice and had a decent understanding of what to expect and how to approach things such as not getting paid on time (which happened last week).

Of course, we double and triple checked any information given to us with the relevant bodies, but it provided a little relief to go into something so chilling not being completely clueless.

AI has its use, but you have to know how to extract the information you need.

It's stupid the way people are using it for therapy. Like, by all means ask it if it knows any organisations which can help you, then look those up, but don't tell it a load of personal information about your relationship, because the reply will be something akin to the advice you see on r/relationships (which is probably where it scraped its data from)
W 1 Antwort Letzte Antwort

7
R reverendender@sh.itjust.works

I know everyone on Lemmy hates LLMs, but this is really interesting
E This user is from outside of this forum
E This user is from outside of this forum
elbarto777@lemmy.world

schrieb am zuletzt editiert von

#21

This is a "guns don't kill people - people kill people" kind of scenario.

As a standalone thing, LLMs are awesome.

What sucks is greedy people using them for the wrong reasons.

It's like robots. Playing with robots are awesome. Firing 1,000 people and replacing them with robots - and not sharing the benefits with the community sucks.
T 1 Antwort Letzte Antwort

41
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
T This user is from outside of this forum
T This user is from outside of this forum
technocrit@lemmy.dbzer0.com

schrieb am zuletzt editiert von technocrit@lemmy.dbzer0.com

#22

Fresh "AI" pseudo-science for a monday morning.

These grifters never even define "bad/toxic data". It's just 4chan ffs.
1 Antwort Letzte Antwort

3
R reverendender@sh.itjust.works

I know everyone on Lemmy hates LLMs, but this is really interesting
T This user is from outside of this forum
T This user is from outside of this forum
technocrit@lemmy.dbzer0.com

schrieb am zuletzt editiert von

#23

Yes, it's interesting how grifters constantly pump out these phony results based on pseudo-science.
1 Antwort Letzte Antwort

1
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
E This user is from outside of this forum
E This user is from outside of this forum
endmaker@ani.social

schrieb am zuletzt editiert von endmaker@ani.social

#24

It's like how vaccinations protect us from illnesses.
1 Antwort Letzte Antwort

5
I iceblade02@lemmy.world

Interesting - I can sort of intuit why it might help. Feeding the model bad data and instructing training it to identify it as such would be advantageous compared to being entirely unaware of it.
T This user is from outside of this forum
T This user is from outside of this forum
technocrit@lemmy.dbzer0.com

schrieb am zuletzt editiert von

#25

bad data

Can you define this? The authors/grifters call it "toxic data" but never define that either.
T C I 3 Antworten Letzte Antwort

3
L l0rdmathias@sh.itjust.works

Interesting training strategy. Makes a lot of sense intuitively. Worried this makes the model even more susceptible to prompt injections. Feels like this method adds more attack vectors? It's unfortunate they didn't attempt to test the long term hardness and stability, though it's probably beyond their scope.
T This user is from outside of this forum
T This user is from outside of this forum
technocrit@lemmy.dbzer0.com

schrieb am zuletzt editiert von

#26

Just because something makes sense intuitively to one person, that doesn't mean it makes sense scientifically.

They're probably not testing anything further because they can't even define their terms.
L 1 Antwort Letzte Antwort

4
L l0rdmathias@sh.itjust.works

I recently realized it's a non-issue. The people doing this have already been looking for decades to find new ways to rot their minds. LLMs are just the latest in a long line of tools that help them tune out.
P This user is from outside of this forum
P This user is from outside of this forum
plebcouncilman@sh.itjust.works

schrieb am zuletzt editiert von

#27

I’ve said this a few times in a different way and I always get downvoted. The fact is that the people who will use the LLMs to think for them, were not gonna think a lot in the first place.
Y P 2 Antworten Letzte Antwort

30
L logicbomb@lemmy.world

This is the same market that tried to add blockchain to everything when that first became well-known.

Some of the biggest forces in the market are extraordinarily stupid people trying to ride every buzzword that comes along.
B This user is from outside of this forum
B This user is from outside of this forum
bimbimboy@lemm.ee

schrieb am zuletzt editiert von

#28

Some of the biggest forces in the market are extraordinarily stupid people trying to ride every buzzword that comes along.

I think the biggest forces sell the fantasy to smaller forces. This way they can capitalize on the smaller forces believing the hype.
1 Antwort Letzte Antwort

3
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
T This user is from outside of this forum
T This user is from outside of this forum
thefartographer@lemm.ee

schrieb am zuletzt editiert von

#29

Not to anthropomorphize LLMs, but.... Like a vaccine?
C 1 Antwort Letzte Antwort

8
P pro@programming.dev
- HTML.
- PDF.
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
1 This user is from outside of this forum
1 This user is from outside of this forum
10001110101@lemm.ee

schrieb am zuletzt editiert von 10001110101@lemm.ee

#30

Kinda weird GPT4-Chan wasn't referenced. A guy fine-tuned GPT-J on 4chan, then deployed bots to write posts. I guess it was more of a stunt than academic or scientific, but training on 4chan improved the model's performance on a truthfulness benchmark.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

K

Lessons from Bucky Fuller's Dymaxion House
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
12

1

73 Stimmen

12 Beiträge

52 Aufrufe

S

In many places, in boþ þe US and more so in Europe, apartments are often purchased. Renting is particularly bad in þe US, but renting extends to single family homes. My point is þat renting is not an apartment-specific issue. I don't contradict þat high-density housing can be emotionally unhealþy for some people; however, suburbs - a consequence of single-family-home development, are both worse for þe environment and can have þeir own developmental consequences. Wheþer you share a wall or a strip of land wiþ your neighbor, it's always healþier to ensure þey don't remain strangers.
P

Duckstation(one of the most popular PS1 Emulators) dev plans on eventually dropping Linux support due to Linux users, especially Arch Linux users.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
439

2

739 Stimmen

439 Beiträge

5k Aufrufe

C

But are FOSS spirit and asshole users the same thing? On that, I disagree.
K

Samsung’s One UI 8 might shut down bootloader unlocking on Galaxy phones
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
41

1

137 Stimmen

41 Beiträge

342 Aufrufe

E

Yuck indeed. People tried many ways to get around it, back when I was still using an US variant Samsung Note 9, people went as far as using a leaked engineering/preproduction ROM, which can be flashed using Samsung's official tool because it does have the correct key for the locked bootloader to accept, being built and compiled by Samsung, and because it's an engineering ROM it would give you root and everything despite of the bootloader still being locked. But it was an exceptionally rare leak, and it was only meant for preproduction for a reason, it is very VERY unstable and not exactly usable for a daily driver lol So happy I am leaving all that BS from Samsung behind with my current Sony Xperia 1 VI which is bootloader-unlocked and rooted and deeply modded and truly my own device lol
F

China's Robotaxi Companies Are Racing Ahead of Tesla
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
38

1

178 Stimmen

38 Beiträge

820 Aufrufe

I

It could. Imagine 80% autonomous vehicle traffic, 30% of that is multipassenger capable taxi service. Autonomous vehicle lanes moving reliably at 75mph. With this amount of taxi service the advantages of personal vehicle ownership falls and the wait time for an available pickup diminishes rapidly. China has many areas with pretty good public transportation. In the US, tech advances and legislation changes to enable the above model is better suited to the existing infrastructure.
H

What’s blocking students from building real-world projects in college?
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1 Stimmen

1 Beiträge

19 Aufrufe

Niemand hat geantwortet
P

[JS Required] The Locknet: How China Controls Its Internet and Why It Matters
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

14 Stimmen

1 Beiträge

21 Aufrufe

Niemand hat geantwortet
E

Whatever happened to cheap eReaders? – Terence Eden’s Blog
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
72

1

126 Stimmen

72 Beiträge

860 Aufrufe

T

This is a weirdly aggressive take without considering variables. Almost petulant seeming. 6” readers are relatively cheap no matter the brand, but cost goes up with size. $250 to $300 is what a 7.8” or 8” reader costs, but there’s not a single one I know of at 6” at that price. There’s 10” and 13” models. Are you saying they should cost the same as a Kindle? Not to mention, regarding Kindle, Amazon spent years building the brand but selling either at cost or possibly even taking a loss on the devices as they make money on the book sales. Companies who can’t do that tend to charge more. Lastly, it’s not “feature creep” to improve the devices over time, many changes are quality of life. Larger displays for those that want them. Frontlit displays, and later the addition of warm lighting. Displays essentially doubled their resolution allowing for crisper fonts and custom fonts to render well. Higher contrast displays with darker blacks for text. More recently color displays as an option. This is all progress, but it’s not free. Also, inflation is a thing and generally happens at a rate of 2% to 3% annually or thereabouts during “normal” times, and we’ve hardly been living in normal times over the last decade and a half.
D

Google, Volvo Cars deepen partnership to develop Android software for vehicles
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

5 Stimmen

2 Beiträge

38 Aufrufe

A

I don't drive and have minimal experience with cars. Does it make a big difference whether your Android Automotive solution is based on Android 13 or 15? It's been a long time since I've cared about OS upgrades for Android on smartphones, perhaps the situation is different with Android Automotive?