Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
My Books Were Used to Train AI (theatlantic.com)
36 points by grzm on Aug 23, 2023 | hide | past | favorite | 50 comments


Everybody's everything is being used to train "AI".

That was the reason for twitter going logged-in-only. Multiple companies were trying to download all of twitter at the same time to train their models.

Reddit put walls up for the same reason, though as usual they lied about why they were making reddit worse.

That's just the state of the Internet right now: If you have a large dataset of reasonably quality content, it's going to get slurped up to train models.


Which is vaguely funny, and feels like a gut response from the business side of the house without understanding the real tech.

If you shut your data off, but then sell it to be used to generate an LLM, you have then given your data away again.

In the simplest case, as the weights directly.

In the harder cases, as the use case it's being applied to solve... ex: if you make an excellent gardening help chat bot, I can clone yours by training my chat bot on your chat bot (which is what a ton of the open/free models are doing right now).

I now have the "jpg" version of your dataset. Is it as good? Meh, probably not. Is it good enough? Almost certainly.

So basically my addendum to your comment is really:

> That's just the state of the Internet right now: If you have a large dataset of reasonably quality content, it's going to get slurped up to train models.

And then those models are getting slurped up to train open models. There IS NO MOAT!!!


>Reddit put walls up for the same reason, though as usual they lied about why they were making reddit worse.

So it was just a convenient side effect that it killed off all the good third party apps to push people towards their app.


> That was the reason for twitter going logged-in-only.

It's probably one of the reasons for why they did it initially (although now they reverted it so viewing individual tweets can be done by guests), but I think it's guaranteed that they wanted to squeeze out more user registrations and create an appearance of growth.


So perhaps we need to require that commercial models trained on public data be also made public?


Definitely if it's data we paid for with taxes.

But if it's just publicly accessible data from a private (non-government) entity, I still like the idea, but it's a harder sell to some.


Steven King writes "I have said in one of my few forays into nonfiction (On Writing) that you can’t learn to write unless you’re a reader, and unless you read a lot. AI programmers have apparently taken this advice to heart."

Reminds me of: After his father angrily asks him how he learned to use drugs, the son shouts, "You, alright?! I learned it by watching you!" As the father recoils from realizing the error of his own ways, a narrator then intones, "Parents who use drugs, have children who use drugs."

https://en.wikipedia.org/wiki/I_learned_it_by_watching_you!


I do find it interesting how little in the larger AI art conversation the idea of borrowing already exists in various art forms and each art forms unique culture of accepting this comes up.

In folk music for example, performing someone else's work is not disparaged at all historically. It was common for most musicians to travel around and teach each other songs.

In pop music we see sampling has increasingly become a major part of the art, and only really controversial when someone tries to avoid paying the original artist their fair share (which is itself a topic of debate).

In the visual arts it's much different. There is of course the famous Picasso quote: 'Good artists copy, great artists steal.' But even in the tone of that quote you can tell there's a different cultural acceptance of reusing work.

Writing is tricky because referencing or quoting another author without attribution is a common feature of "high brow" works. T.S Eliot for example borrows heavily form past authors, often without attribution, but assuming an audience which knows enough to not require attribution. However at the other extreme you can get in trouble for plagiarizing your own work.

Personally I think Diffusion models and LLMs are just bringing up an issue in the visual and literary arts that was already forced to the attention of musical artists during the rise of early hiphop djs who showed that you can genuinely create something new nearly entirely by cut and pasting existing works.

One thing I would point out: It seems to me that copying/borrowing/reusing are equally present in all creative fields, they just are willing to admit it to different degrees.


Like a lot of the problems with AI, this one is very familiar but not quite the same as problems we've faced before. It's also really frustrating to even reason about because it's so nebulous and new.

But say you're a digital artist. Maybe you make a webcomic. What if a bunch of individuals started drawing their own art of your characters? Generally, that'd be seen as a big success for you and your art. If someone started reposting your art and claiming credit, that's clearly a problem.

Something about the scale of it trips people up. What would you do if an entire army of thousands of artists suddenly started doing detailed re-draws of your whole comic with little changes? That's an insane hypothetical, and I think a lot of people would be unsure how to answer it. Now, what does it mean when I can spend a couple thousand dollars on a machine that can produce millions of copies of your work with any change I want?

It's the same problem but fundamentally different in a way that most people aren't prepared to handle. I can't make up my mind about it, but I'm sure that however we solve this problem, it will be a huge change for society.


I think the frustrating thing about this when people being up that "good artists steal", they don't really understand why artists are upset by this technology. Artists are happy to teach other people how to make art. They're generally fine with people looking at their work to develop their own skills because that's how the field sustains itself. Art is a skill and a culture. You become an artist by participating.

AI art systems are alienating in comparison. A group of people who are not artists basically showed up without talking to anyone and is trying to displace the original culture, by using work that they extracted from that culture. The entire attitude from the AI art community is hostile. "Why would you keep drawing now that we've automated it?" It's a gross viewpoint that devalues an entire cultural enterprise and a major portion of human experience.


> They're generally fine with people looking at their work to develop their own skills because that's how the field sustains itself. Art is a skill and a culture. You become an artist by participating.

I think you're overemphasizing just how personal the connection is with the "artists stealing" from others. A more relevant example isn't some person giving personal lessons to another artist, but for example, me drawing a fox and pulling up some reference photos and pictures to use. When a person does this, in general, nobody bats an eye - as far as I know, this practice is exceedingly common.

> A group of people who are not artists basically showed up without talking to anyone and is trying to displace the original culture, by using work that they extracted from that culture. The entire attitude from the AI art community is hostile.

You're characterizing AI enthusiasts in a very simplistic fashion. Ask yourself this - are some of them that angry because they're "just bad people"? Was the technology created with a nefarious villainous plan to push out artists? Because from my perspective, while some people will inevitably be mean-spirited, a lot of the anger stems from the pushback on some of the angrier anti-AI proponents, ones that commonly discredit others' work and advocate for draconian IP laws.


> When a person does this, in general, nobody bats an eye - as far as I know, this practice is exceedingly common.

That's right.

> Was the technology created with a nefarious villainous plan to push out artists?

Almost certainly why it's gotten so much investment recently. We've heard various people in business leadership get excited about firing their design and illustration teams. I've seen business owners on this forum talk about how they've shifted to using AI images. And then there's the writer's strike in Hollywood, where the use of AI is currently a major point of negotiation that the studios don't want to budge on. It is very naive to not consider the impact of this technology on creative labor, even if it's an inferior product.

> a lot of the anger stems from the pushback on some of the angrier anti-AI proponents, ones that commonly discredit others' work and advocate for draconian IP laws.

The pushback is warranted. I don't know what the solution here is, but it seems to me that these AI models are generating a huge amount of value off of unacknowledged labor, at the expense of the people who made that labor. I think people are rightly fed up over yet another example of rent seeking tech platforms getting the money for labor they didn't do, open models like stable diffusion notwithstanding.


> Almost certainly why it's gotten so much investment recently. We've heard various people in business leadership get excited about firing their design and illustration teams. I've seen business owners on this forum talk about how they've shifted to using AI images.

My question was whether generative AI was created with ill intent, not whether there are parties that can profit off of it. You're conflating the computer scientists, enthusiasts and hobbyists (that likely represent the vast majority of the audience here) with business owners and other people whose job is to maximize profit. Back when generative AI wasn't nearly as coherent, only the first group cared much about it.

> these AI models are generating a huge amount of value off of unacknowledged labor, at the expense of the people who made that labor

The original discussion compared human artists to generative AI, so the question here is that, if human artists are allowed to use non-AI tools to do the same thing, what makes AI tools different? Say, someone puts in a lot of time analyzing the works of the most popular contemporary artists, and then becomes capable of replicating their styles, while charging a fraction of the price. Are they in the wrong here?


> You're conflating the computer scientists, enthusiasts and hobbyists (that likely represent the vast majority of the audience here) with business owners and other people whose job is to maximize profit. Back when generative AI wasn't nearly as coherent, only the first group cared much about it.

The computer scientists and enthusiasts are the ones who wrote the programs that scraped the data. They are just as responsible for the situation as the business owners, and this comment goes back to my earlier statement that people working in the valley don't think about the consequences of the things they build nearly enough.

> Say, someone puts in a lot of time analyzing the works of the most popular contemporary artists, and then becomes capable of replicating their styles, while charging a fraction of the price. Are they in the wrong here?

There are not. This is the exact same argument that you made before. If you can't understand the distinction between artists learning the trade and a gigantic VC funcded company producing a system that can generate thousands of images that vaguely resemble your work then I'm not sure what else I can say.

When an artist learns from another artist, they cannot exactly replicate that artist's work. Their own experiences and even muscular structure changes the result, and typically, one artist is not really a threat to the market for the orginal artist's work. New artists contribute to the field and maintain the knowledge of skills so that those can be passed on to other artists.

AI systems are practically built to digest entire corpuses of people's work then generate huge amounts of similar work by the truckload. Nobody is learning anything from the production of these images, we've automated the production of cultural artifacts which is totally perverse, and we've made it harder for artists to make a living which impacts the field negatively.

Shallow arguments like "The AI system learns just like humans" are a slap in the face.


> The entire attitude from the AI art community is hostile.

That's a gross mis-generalization. I expect the community response breakdown to be the same to most other examples in history: most people don't care, a small minority are hostile, and a small minority are supportive.

If you like making art, keep doing it. From an economic perspective however, you really should consider if your job is at risk from being eliminated by automation, if you depend on it to survive. No hate, just pragmatic advice.


> If you like making art, keep doing it. From an economic perspective however, you really should consider if your job is at risk from being eliminated by automation, if you depend on it to survive. No hate, just pragmatic advice.

I mean you've basically described why people are upset over this. These systems could displace them, using their own work to do it. It's perverse.

What happens to art if people can't practice it professionally? Isn't valuable to our society to have people who are able to master this skill and make a living doing it? The AI community seems to not think so.


That's what economics is for. If you can't make a living doing something, then no, it isn't valuable to society.


> In pop music we see sampling has increasingly become a major part of the art

As I understand it the reasons for the rise of sampling is exactly the opposite of what you argue here. Artists explicitly sample (with the legal paperwork) because it provides protection from claims that they copied portions of the song from elsewhere. Claims that you are much more likely to face if you instead try to create something on your own (and implicitly influenced by what you've heard and liked).

The easiest way to sidestep all of that is to explicitly focus on remixes, that and record every conversation and creative moment so that you can show it during discovery.

See for example the lawsuit over Ed Sheeran's Shape of You [0].

[0]: https://en.wikipedia.org/wiki/Shape_of_You#Copyright_trial


Sampling is widely accepeted to have originated from the 1970-80s Hiphop scene, where traditional DJs (playing other people's music, like the do today) started to get clever with how they were combining different tracks. Rappers would then start reciting lyrics over these remixed version of well known funk and soul tracks. Although the wikipedia does a good job tracing it back even further to the 1940s[0]

Copyright issues only came up later once Hiphop had begun to see more mainstream acceptance (i.e. make more money) and influence other musicians.

They're may be cases where sampling is used to avoid other legal issues, but it's origins are most definitely in reusing existing music to create something new.

0. https://en.wikipedia.org/wiki/Sampling_(music)


I was not commenting on the origins of sampling, but rather on why it is becoming "a major part of" of pop music today.


This subject always makes me think of that most prescient novel, Colossus, by D. F. Jones. In it, the world-spanning computer does become sentient and tells its creator, Forbin, that in time, humanity will come to love and respect it. (The way, I suppose, many of us love and respect our phones.) Forbin cries, “Never!” But the narrator has the last word, and a single word is all it takes:

“Never?”


The reverse would have been worrisome




This is frustrating, I'm stuck in a captcha loop on the site and can't get past it.


Same.


Not sure if it ties into a captcha loop specifically, but I've seen a couple allegations that the site owner deliberately sabotages users of 1.1.1.1 and 9.9.9.9.


Same here. This poor site is getting pounded. Any alternatives available in case archive goes down?


information wants to be free


so does money, yet we dont go around stealing it


Have any of you guys actually read a whole novel that was written by AI?


I've read whole "stories" written by AI (in the low thousands of words).

They're usually pretty terrible. I can't imagine a novel's length of it.

That said, having played with the LLMs - they are super helpful for doing things like bouncing conversations off a "character", or describing a scene for me, or just being whacky idea generators.

I think it's similar to the stable-diffusion side of things, it's going to make some authors more productive, but it won't necessarily replace the need for an author. (whether that leads to more or less authors long term is a related, but tangential, question and influenced by a lot of non-ai society factors).


Beyond a few thousand words (32k tokens for GPT-4), it seems the beginning of the story will be outside of the range that can influence the output, though perhaps there are techniques to retain some memory, maybe you could first generate an outline and then fill in the chapters, scenes and so forth, so the limited memory becomes less of an issue.


The problem is less that the context runs out (although for a full novel, it does become a challenge, since 32k tokens is only about half a book).

The problem is more than the AI has no concept of actually telling a coherent story. It doesn't tell anything exciting or novel or cohesive - it just spits out similar literary sentences, and the more it spits out sentences that look like random novel content, the less cohesive the story gets as it matches more an more strongly on previous content that looks like random novel.

Honestly, it's less like a real novel, and more like a panorama of small chunks of all sorts of different stories - Mostly unrelated and uninteresting.


That's the impression I've been getting. AI lowered the barrier to entry of just generating a large body of text and self-publishing, but there are more subtle market dynamics at play here than people seem to realize.


> AI lowered the barrier to entry of just generating a large body of text and self-publishing

Honestly, most of the generated stuff I've seen at length (as in, more than 1 or 2 page lengths of content) just devolves to gibberish. The words and sentences are syntactically correct, but mostly irrelevant or meaningless in the context of the story as a whole. It's like reading a collage of stories jammed together with no real artistry.

You can self publish that (people do) but it's not really much improved on the old scam of just publishing copy-pasta bullshit.

I think there can be space for LLMs to do things here, but right now they are absolutely not a replacement for an author with a story they want to tell.

They are more like a calculator: If you have a clear picture of a scene/character/setting, the LLM can flesh out a lot of options for you to pick and choose from to best match your intent. It does mundane work in support of your goal - it has no goal itself.


Wow. Stephen King wrote about my work. I’m a little star struck.

I didn’t have this on my 2023 bingo card.


He now have the perfect opportunity to write an story of sentient book souls being devoured by evil AIs.


[flagged]


Then let the people who are so much smarter make their own art and writings, and train their creations on that. Crazy idea, huh?

As Bill Hicks said: play from your fucking heart. It doesn't have to be deep, but it has to be something you express. It doesn't matter, at all, if someone else said the same thing before. If a human says or does something, I potentially care. If a model outputs something I find interesting, I am interesting in my thoughts about it, not in the model, because I know for a fact there is nobody on the other side.

Anything can create novel output. If anything, the opposite is impossible: even if you make an exact copy of something, it will not be in the same place at the same time. So that's no useful criterion for anything, and it's rich to call "sentience" a nebulous concept -- yeah, we can't define it, but we all experience it, and having a name and a complete definition for it would not make it one iota more real than it already is -- while talking about "novel output" as if that means anything interesting or useful whatsoever.


People will inevitably continue to train models as they see fit. Nothing can stop that now. Creativity is a set of algorithms that can be mimicked, learned, and reproduced by a machine. Holding onto a deep-seated preference for humanism is limiting your perspective.

Works should be judged on their own terms, not their origins. And in the near future, the lines between human-created art, collaborative human-AI art, and solely AI-generated art will blur. The point is, soon enough, you will not be able to tell the difference between art created solely by humans, art that is a product of the combined efforts of humans and AI, and art that was made solely by AI.


> Creativity is a set of algorithms

This, following on the heels of mocking the usage of "sentience"? Heh.

But more importantly, I judge things as I see fit, and I like what I like. What is it to you? If I say I only like to be fondled by my partner, and you say soon enough every place will be so crowded that I can't possibly tell if the hand on my butt is that of my partner, of a random stranger, or of a machine, I can only wonder why anyone would find that desirable, if that is supposed to be some sort of threat, and whether you'll enjoy me "reacting as I see fit" to it.


It's not a threat. It's a descriptive claim about what the world is going to be like once this technology matures. In due time, most art will be the result of humans and AI. Unless content has some kind of label stating that AI was involved in its creation, you simply aren't going to notice. Sure, there will be a niche of people who obstinately refuse to use any tech that employs machine learning. And we will view those people the same way we view people who believe that taking a photo of someone steals their soul.


You, like everyone, skipped the question why the people who program the models can't also make the effort to create what they teach their creations, and hand-curate it from stuff given with consent.

Yes, it can be done. People can just take it because they can, and hope to create facts on the ground that way. Kudos to them.

> And we will view those people the same way we view people who believe that taking a photo of someone steals their soul.

For me, art, or what I like about art, is an act of communication -- be that of talking to yourself, trying to communicate with others, screaming pain or joy incoherently into the void. It doesn't have to be words. But I care about something because it's from a human.

> It would not be much of a universe if it wasn't home to the people you love.

-- Stephen Hawking

If people will learn to fake and mass-produce that without oversight, A/B testing for whatever gains they want to get out of it, great, kudos. Seems shitty to me, like something people who aren't creative themselves might do, or who like to fuck with other people en masse, etc. Like destroying something you can't have or make. It's like putting poison into the water supply is possible, but that's not the point.

People can come up with hilarious or interesting prompts, and that is great. And then there is no need to "hide" it, either. But when it comes to the "art" aspect of it, it's still 100% the humans that tune the output (which is an act of judgement, versus calculation) and/or the humans from whom the input was taken with consent, or stolen without.

And last but not least, I take things how I want to, anyway. How I think the author meant it is something I acknowledge and take into consideration, but no more. And I often have several including contradictory readings of something. The beholder always applies the final touch to everything that enters their ken, if you will.

Oh, but it'll be also great for actual AI generated ElsaGate shit (like people used to pretend with EG, to not face how sick that phenomenon was... "oh, it's just AI run amok, tuned for views and toddlers obviously like that stuff, so hey") and for scamming the uneducated and elderly, I'm sure. Just gonna call that right now.


You are veering into territory that is interesting, but outside the scope of the initial topic. All I'm saying is you don't need a machine to be "sentient" or "self aware" to generate novel outputs. Whether or not these models are created in an ethical way is another matter. Even if it was unethical, that doesn't really negate what I'm saying.


> This basically reaffirms my belief that Stephen King a midwit

Good argument against training AI on his books then, I guess.


On the contrary, you absolutely want midwits in your training data set.


He's a midwit because he hasn't solved the problem of what makes humanity humanity?

What does that make you?


No, he's a midwit because his argument implies that he has solved that problem, when in reality, he's just talking around the problem instead of confronting it.


As long as people are reacting to the metaphor, there will be authors who will unscrupulously use and abuse it.

Actually, it has nothing to do with AI. It's a statement by the author that they believe the consciousness is a product of the brain and nothing more. That view reduces humans to machines, the only difference being in the material.


Agree it has nothing to do with AI. Even if we, for the sake of argument, agree that human beings are machines, why should the appearance of human-mimicking reasoning imply sentience above and beyond any other (complex?) computation? Why are LLMs possibly sentient and not, say, pocket calculators?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: