One of the key controversies in the NYT v OpenAI case is ChatGPT's most weighted dataset: Common Crawl.

The problem is - it's unclear whether ChatGPT is really using that crawl "in the right way."

This chart in the lawsuit analyzed the top sites Common Crawl's dataset.

The third site is the NYT:

Here's a damning exhibit from the NYT on word for word copies by ChatGPT:

There are a total of 220K pages of exhibits in the NYT complaint and here's another example among hundreds:

Link to full Twitter thread for those interested

Link to a second Twitter thread on the same topic

This post was edited on 12/28/23 at 12:32 pm

3 ...

Report Post

Posted by TigerinATL

Member since Feb 2005

62446 posts

Posted on 12/28/23 at 1:02 pm to rickgrimes

I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.

AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.

This post was edited on 12/28/23 at 1:03 pm

3 ...

Report Post

Posted by TheOcean

#honeyfriedchicken

Member since Aug 2004

45814 posts

Posted on 12/28/23 at 1:21 pm to rickgrimes

frick the NYT

1 ...

Report Post

Posted by rickgrimes

Member since Jan 2011

4340 posts

Posted on 12/28/23 at 6:24 pm to TheOcean

I don't think this is a partisan issue. This has the potential to be a landmark copyright case. Would you have the same reaction if it were Breitbart instead of NYT?

This post was edited on 12/29/23 at 1:06 am

1 ...

Report Post

Posted by TheOcean

#honeyfriedchicken

Member since Aug 2004

45814 posts

Posted on 12/29/23 at 5:50 am to rickgrimes

Yes

0 ...

Report Post

Posted by GurleyGirl

Georgia

Member since Nov 2015

14547 posts

Posted on 12/29/23 at 8:37 am to rickgrimes

If the information is available to the public at various websites, then I don't see how there is copyright infringement if the source is cited the ChatGPT search results.
The AI engine is just citing the information instead of providing a bunch of links to the various sources which is a definite improvement over a conventional internet search engine result.

0 ...

Report Post

Posted by skrayper

21-0 Asterisk Drive

Member since Nov 2012

35272 posts

Posted on 12/29/23 at 10:10 am to TigerinATL

quote:
I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.

AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.

AI art was one of the first, and because most people don't know the artists it was a "who cares" mentality while others were trying to warn that even if you didn't care about art it was a slippery slope in the worst way.

My wife is an artist, but not a digital one, but follows a few. At least one, Loish, who is pretty successful, had her art funneled into the dataset for these AI creations (without consent or compensation). It is actually important for those two points as the art is sometimes used for commercial purposes. Some people claim that it's no different than artist studying another artist, but it's apples and oranges due to the timescale involved (a person studying her art along with other artists might be good enough to compete in that space years later, and won't generate competing art with the click of a button).

It's still the same slippery slope. Just because what you do can't be automated YET doesn't mean it never can be. Soon they'll have AI that is automated to create other AI.

1 ...

Report Post

Posted by TigerinATL

Member since Feb 2005

62446 posts

Posted on 12/29/23 at 10:29 am to skrayper

They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.

quote:
The iPhone maker has floated multiyear deals worth at least $50 million to license the archives of news articles, according to the report, which cited people familiar with the discussions.

The news organizations contacted by Apple include Condé Nast, publisher of Vogue and the New Yorker; NBC News; and IAC, which owns People, the Daily Beast and Better Homes and Gardens, the New York Times said.

Some of the publishers contacted by Apple were lukewarm on the overture, according to the report.

LINK

1 ...

Report Post

Posted by skrayper

21-0 Asterisk Drive

Member since Nov 2012

35272 posts

Posted on 12/29/23 at 10:32 am to TigerinATL

quote:
They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.

That would be good - I'm not opposed to AI getting access to content so that it can grow, but I am opposed to them freely plucking without any consideration of the creator.

Just like I'm not okay with the idea of AI taking over truck driving, but I do like the idea of AI-assisted driving.

0 ...

Report Post

Posted by BigPerm30

Member since Aug 2011

31913 posts

Posted on 12/31/23 at 10:55 am to TigerinATL

quote:
AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house

This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.

1 ...

Report Post

Posted by TigerinATL

Member since Feb 2005

62446 posts

Posted on 1/1/24 at 3:43 pm to BigPerm30

quote:
This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.

You can be the smartest computer in the world but you don't know anything if you don't have data. For example, I've seen people post ChatGPT responses where because of this lawsuit ChatGPT is not answering certain questions that might be sourced by the NYT to avoid potential infringement.

Now instead of court contested data, pretend that data never existed in the first place. How is ChatGPT 7 going to answer questions about the iPhone 18? The manufacturer produced content like user manuals and commercials only give you so much to work with. It's all of the bloggers, influencers and youtubers that are putting out content to fill in the gaps.

My comment about OpenAI robot influencers was half joking but half serious. SOMEBODY is going to have to actually experience things to tell everyone else about them. An LLM alone can't create that kind of experiential data. Pop it in a robot body and it can.

1 ...

Report Post

Posted by LSUnation78

Northshore

Member since Aug 2012

14219 posts

Posted on 1/1/24 at 7:54 pm to TigerinATL

Tell me about all the times companies protected consumer data instead of using it and selling it like a two-cent whore?

But now im supposed to feel sympathy for companies getting their published data used?

Nah, frick all that

0 ...

Report Post

Posted by Ace Midnight