If you want to do this kind of thing, let authors opt-in (or publishers).
Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.
I suspect the authors are put-off by a couple things:
- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.
- the use of these scans in a commercial product
The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.
> If you want to do this kind of thing, let authors opt-in (or publishers).
If it's fair use, why should you have to do that? The same copyright law protecting author's ownership rights over their art also provide "fair use" to other people. Someone may disagree with current fair use laws (and I suspect many outraged here do not), but that's a broader issue not related to this particular tool. It just 100% seems like misdirected AI outrage.
> the text of the works scanned seems like it may be from pirated sources.
Do you have a source for this? I didn't see that mentioned in the article.
> Do you have a source for this? I didn't see that mentioned in the article.
The person who runs prosecraft says "I looked to the internet for more text that I could analyze, and I used web crawlers to find more books." [0]
I'm just inferring, but if they had, say, purchased each of these books, or borrowed them from the library, or only sourced from sites that ensure the copyright is satisfied, then they might have mentioned it.
(FWIW, the blog post says the other source for the 25K works was their personal library, so I'm assuming the bulk of the 25K come from the internet, though I know some people have prodigious personal libraries.)
"How much of someone else's work can I use without getting permission?
Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports."
> The Gizmodo article has a ridiculously wrong “fair use” analysis, saying “Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’” Except… it almost certainly does? Again, we’ve gone through this with the Google Book scanning case, and the courts said that you can absolutely do that because it’s transformative.
That's ludicrous. It's counting words in a book. You can't copyright facts and that is all the tool is doing. Pages that are reproduced are only excerpts which falls squarely under fair use.
It's no different than you checking out the book from the library and counting all the words.
Copyright pertains to reproduction of the work. The statistics this tool provided are not reproductions at all. It did also provide quotes, which were not extensive and certainly not the entire work.
> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.
Going off of some of the tweets about this that initially whipped up the outrage about this…it’s not like they were making a nuanced case about their concerns, they were basically just stomping their feet and shouting.
I would think so. If someone is shouting & stomping their feet in the public town square about my project, but I never go anywhere near the town square anyway, I don’t think I’m going to shutdown my project. It’s just too bad the person who created this tool happened to walk through the town square.
> If you want to do this kind of thing, let authors opt-in (or publishers).
"This kind of thing" is factual information about the book, such as page or word count, ly-adverb count, etc. Small snippets, something permissible under copyright law today, that were heavily editorialized and commented on were displayed.
To suggest that counting words and pages is something that should not be allowed is silly.
> The article itself is clueless…
Says the person making stuff up to force a narrative.
The person doing this had the rights to do this, and was very clearly within his rights to do this under copyright law. Counting words is not a crime.
> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.
The authors quotes speak for themselves. They very clearly and ignorantly claimed that this was an "AI training project" when it was nothing of the sort.
Statistical analysis is only useful if you have enough data to analyse, so there is in fact a threshold of number of books to cross before the tool can even really exist. If you read his post, the initial goal was to get stats about typical word count, typical amount of passive speech, etc. Requiring opt-in for these broad statistics, through outrage only since this project is CLEARLY legal in the United States, means that tools like this will never exist. Which seems net bad to me.
If you are saying it should be opt-in only for the pages analyzing specific books, like the instigator of this outrage screen-shotted, well that seems to fall squarely into the critical analysis bucket, so that is also quite ridiculous.
I understand some folks being unhappy that a portion of the works were pirated, but it seems like most of the outraged would be outraged even if he personally purchased each and every ebook.
Also, if you read through the Twitter thread a lot of the authors (not 100%, but a LOT) are doing a really great job portraying themselves as "stoopid AI-fearful luddites". Many of them think the site is somehow like ChatGPT and they don't bother to dig any deeper, or really at all.
Yeah, the article represents the voice of the authors in two tweets, from authors not apparently notable enough to have a wikipedia page. One I couldn't even find on Goodreads. It's obvious there's more to this than just the tweets presented. The article is unhelpful in this regard.
While I would agree in theory that a project like this would be best with opt-in, in reality that would just not work. Publishers would never opt-in to it, if they even respond to your requests at all.
Or, if you do it, do it privately and don't share it on the internet?
I'm not sure why this is a difficult idea; if asking for something and getting permission to do it is so difficult that 'would just not work. Publishers would never opt-in to it'
...then, it seems really obvious that even if you want to do it, technically can do it and you could maybe make a legal argument to doing it doesn't violate any laws...
...why would you do it? Why would you post about doing it?
Come on, that's literally being a selfish dick; spitting in people's faces and waving a 'too bad, you can't sue me' flag.
There are so many things, so many mannnny things that you could work on, why would you choose to pick something that you knew would upset people and you knew you wouldn't get permission to do if you asked?
Why ask permission to do something that doesn’t require permission? I see no more reason why an author should be upset about someone counting the words in their book & assigning sentiment than a builder should get upset about someone counting the # of bricks in a building and assigning subtle color shade differences to them. Neither the author nor the builder has lost anything by it.
Should I need the publisher's permission to write a review of a book? Personally, I find that idea abhorrent. This sounds like an interesting project, unambiguously protected under fair use doctrine, both as analysis and as transformative, and the authors got their knickers in a twist because they are scared of that which they do not understand.
Because copyright in fact is not that strict (Google Books does far more) and you don’t need to respect someone’s boundaries when they don’t have a legal right to those boundaries. Why should we sympathize with people who want far stricter control over the cultural commons?
Authors are not demigods, they don’t have a right to control the use of their works, only the reproduction.
When you publish a book you “consent” to the fact that people are going to take it apart, talk about it, review it, quote from it, and yes run statistics on it. If an author doesn’t want that to happen then they shouldn’t publish a book. Just keep it private, only distribute it to people you trust after they sign an NDA.
As far as anyone knows, no piracy has occurred. In the US you are allowed to scan books, index them, and post excerpts - it’s called Google Books and there was a big case that affirmed that it is legal. Downloading a book from a pirate website for the purpose of indexing by a computer program is not piracy, you have simply outsourced the scanning stage to someone else. It is only an issue if you download from some p2p protocol (such as a torrent) that also uploads and shares the book.
Because the authors were AI-fearful luddites. From "Book" to "Program that judges books" lies well beyond any argument that the use of the derivative work could supersede the original. It's such clear cut transformative use that the authors come across as grossly misinformed about copyright law as a whole.
Perhaps there is an argument for generative AI possibly superseding the original, in that people might start asking an AI to generate them stories "in the style of x" instead of buying the author's books, but this wasn't that. It was just some fun data analysis of books.
Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.
I suspect the authors are put-off by a couple things:
- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.
- the use of these scans in a commercial product
The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.