Page 1
Page 1
Started By
Message

NYT vs. ChatGPT and the Common Crawl argument

Posted on 12/28/23 at 12:23 pm
Posted by rickgrimes
Member since Jan 2011
4255 posts
Posted on 12/28/23 at 12:23 pm
One of the key controversies in the NYT v OpenAI case is ChatGPT's most weighted dataset: Common Crawl.

The problem is - it's unclear whether ChatGPT is really using that crawl "in the right way."

This chart in the lawsuit analyzed the top sites Common Crawl's dataset.

The third site is the NYT:



Here's a damning exhibit from the NYT on word for word copies by ChatGPT:



There are a total of 220K pages of exhibits in the NYT complaint and here's another example among hundreds:



Link to full Twitter thread for those interested

Link to a second Twitter thread on the same topic
This post was edited on 12/28/23 at 12:32 pm
Posted by TigerinATL
Member since Feb 2005
62433 posts
Posted on 12/28/23 at 1:02 pm to
I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.

AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.
This post was edited on 12/28/23 at 1:03 pm
Posted by TheOcean
#honeyfriedchicken
Member since Aug 2004
44317 posts
Posted on 12/28/23 at 1:21 pm to
frick the NYT
Posted by rickgrimes
Member since Jan 2011
4255 posts
Posted on 12/28/23 at 6:24 pm to
I don't think this is a partisan issue. This has the potential to be a landmark copyright case. Would you have the same reaction if it were Breitbart instead of NYT?
This post was edited on 12/29/23 at 1:06 am
Posted by TheOcean
#honeyfriedchicken
Member since Aug 2004
44317 posts
Posted on 12/29/23 at 5:50 am to
Yes
Posted by GurleyGirl
Georgia
Member since Nov 2015
14169 posts
Posted on 12/29/23 at 8:37 am to
If the information is available to the public at various websites, then I don't see how there is copyright infringement if the source is cited the ChatGPT search results.
The AI engine is just citing the information instead of providing a bunch of links to the various sources which is a definite improvement over a conventional internet search engine result.
Posted by skrayper
21-0 Asterisk Drive
Member since Nov 2012
33131 posts
Posted on 12/29/23 at 10:10 am to
quote:

I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.

AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.



AI art was one of the first, and because most people don't know the artists it was a "who cares" mentality while others were trying to warn that even if you didn't care about art it was a slippery slope in the worst way.

My wife is an artist, but not a digital one, but follows a few. At least one, Loish, who is pretty successful, had her art funneled into the dataset for these AI creations (without consent or compensation). It is actually important for those two points as the art is sometimes used for commercial purposes. Some people claim that it's no different than artist studying another artist, but it's apples and oranges due to the timescale involved (a person studying her art along with other artists might be good enough to compete in that space years later, and won't generate competing art with the click of a button).

It's still the same slippery slope. Just because what you do can't be automated YET doesn't mean it never can be. Soon they'll have AI that is automated to create other AI.
Posted by TigerinATL
Member since Feb 2005
62433 posts
Posted on 12/29/23 at 10:29 am to
They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.

quote:

The iPhone maker has floated multiyear deals worth at least $50 million to license the archives of news articles, according to the report, which cited people familiar with the discussions.

The news organizations contacted by Apple include Condé Nast, publisher of Vogue and the New Yorker; NBC News; and IAC, which owns People, the Daily Beast and Better Homes and Gardens, the New York Times said.

Some of the publishers contacted by Apple were lukewarm on the overture, according to the report.

LINK /
Posted by skrayper
21-0 Asterisk Drive
Member since Nov 2012
33131 posts
Posted on 12/29/23 at 10:32 am to
quote:

They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.


That would be good - I'm not opposed to AI getting access to content so that it can grow, but I am opposed to them freely plucking without any consideration of the creator.

Just like I'm not okay with the idea of AI taking over truck driving, but I do like the idea of AI-assisted driving.
Posted by BigPerm30
Member since Aug 2011
29375 posts
Posted on 12/31/23 at 10:55 am to
quote:

AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house


This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.
Posted by TigerinATL
Member since Feb 2005
62433 posts
Posted on 1/1/24 at 3:43 pm to
quote:

This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.


You can be the smartest computer in the world but you don't know anything if you don't have data. For example, I've seen people post ChatGPT responses where because of this lawsuit ChatGPT is not answering certain questions that might be sourced by the NYT to avoid potential infringement.

Now instead of court contested data, pretend that data never existed in the first place. How is ChatGPT 7 going to answer questions about the iPhone 18? The manufacturer produced content like user manuals and commercials only give you so much to work with. It's all of the bloggers, influencers and youtubers that are putting out content to fill in the gaps.

My comment about OpenAI robot influencers was half joking but half serious. SOMEBODY is going to have to actually experience things to tell everyone else about them. An LLM alone can't create that kind of experiential data. Pop it in a robot body and it can.
Posted by LSUnation78
Northshore
Member since Aug 2012
13374 posts
Posted on 1/1/24 at 7:54 pm to
Tell me about all the times companies protected consumer data instead of using it and selling it like a two-cent whore?

But now im supposed to feel sympathy for companies getting their published data used?


Nah, frick all that
Posted by Ace Midnight
Between sanity and madness
Member since Dec 2006
92465 posts
Posted on 1/5/24 at 12:28 pm to
quote:

Pop it in a robot body and it can.




Do you want Skynet? Because that's how you get Skynet.
first pageprev pagePage 1 of 1Next pagelast page
refresh

Back to top
logoFollow TigerDroppings for LSU Football News
Follow us on X, Facebook and Instagram to get the latest updates on LSU Football and Recruiting.

FacebookXInstagram