- My Forums
- Tiger Rant
- LSU Recruiting
- SEC Rant
- Saints Talk
- Pelicans Talk
- More Sports Board
- Fantasy Sports
- Golf Board
- Soccer Board
- O-T Lounge
- Tech Board
- Home/Garden Board
- Outdoor Board
- Health/Fitness Board
- Movie/TV Board
- Book Board
- Music Board
- Political Talk
- Money Talk
- Fark Board
- Gaming Board
- Travel Board
- Food/Drink Board
- Ticket Exchange
- TD Help Board
Customize My Forums- View All Forums
- Show Left Links
- Topic Sort Options
- Trending Topics
- Recent Topics
- Active Topics
Started By
Message
NYT vs. ChatGPT and the Common Crawl argument
Posted on 12/28/23 at 12:23 pm
Posted on 12/28/23 at 12:23 pm
One of the key controversies in the NYT v OpenAI case is ChatGPT's most weighted dataset: Common Crawl.
The problem is - it's unclear whether ChatGPT is really using that crawl "in the right way."
This chart in the lawsuit analyzed the top sites Common Crawl's dataset.
The third site is the NYT:
Here's a damning exhibit from the NYT on word for word copies by ChatGPT:
There are a total of 220K pages of exhibits in the NYT complaint and here's another example among hundreds:
Link to full Twitter thread for those interested
Link to a second Twitter thread on the same topic
The problem is - it's unclear whether ChatGPT is really using that crawl "in the right way."
This chart in the lawsuit analyzed the top sites Common Crawl's dataset.
The third site is the NYT:
Here's a damning exhibit from the NYT on word for word copies by ChatGPT:
There are a total of 220K pages of exhibits in the NYT complaint and here's another example among hundreds:
Link to full Twitter thread for those interested
Link to a second Twitter thread on the same topic
This post was edited on 12/28/23 at 12:32 pm
Posted on 12/28/23 at 1:02 pm to rickgrimes
I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.
AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.
AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.
This post was edited on 12/28/23 at 1:03 pm
Posted on 12/28/23 at 6:24 pm to TheOcean
I don't think this is a partisan issue. This has the potential to be a landmark copyright case. Would you have the same reaction if it were Breitbart instead of NYT?
This post was edited on 12/29/23 at 1:06 am
Posted on 12/29/23 at 8:37 am to rickgrimes
If the information is available to the public at various websites, then I don't see how there is copyright infringement if the source is cited the ChatGPT search results.
The AI engine is just citing the information instead of providing a bunch of links to the various sources which is a definite improvement over a conventional internet search engine result.
The AI engine is just citing the information instead of providing a bunch of links to the various sources which is a definite improvement over a conventional internet search engine result.
Posted on 12/29/23 at 10:10 am to TigerinATL
quote:
I know a lot of people are going to want to side with the AI companies to not stifle innovation, that's kind of how we handled the early Internet. However, I think we actually need to side with the content creators to not stifle innovation. AI is only as good as its data set and AI is in the process of breaking the current internet economy where websites create worthwhile content because it gets people to visit their site where they monetize that traffic.
AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house.
AI art was one of the first, and because most people don't know the artists it was a "who cares" mentality while others were trying to warn that even if you didn't care about art it was a slippery slope in the worst way.
My wife is an artist, but not a digital one, but follows a few. At least one, Loish, who is pretty successful, had her art funneled into the dataset for these AI creations (without consent or compensation). It is actually important for those two points as the art is sometimes used for commercial purposes. Some people claim that it's no different than artist studying another artist, but it's apples and oranges due to the timescale involved (a person studying her art along with other artists might be good enough to compete in that space years later, and won't generate competing art with the click of a button).
It's still the same slippery slope. Just because what you do can't be automated YET doesn't mean it never can be. Soon they'll have AI that is automated to create other AI.
Posted on 12/29/23 at 10:29 am to skrayper
They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.
LINK /
quote:
The iPhone maker has floated multiyear deals worth at least $50 million to license the archives of news articles, according to the report, which cited people familiar with the discussions.
The news organizations contacted by Apple include Condé Nast, publisher of Vogue and the New Yorker; NBC News; and IAC, which owns People, the Daily Beast and Better Homes and Gardens, the New York Times said.
Some of the publishers contacted by Apple were lukewarm on the overture, according to the report.
LINK /
Posted on 12/29/23 at 10:32 am to TigerinATL
quote:
They just need to find a middle ground so the data AI needs to grow will continue to be made. Apple is currently working on an agreement with news publishers that I imagine would set some precedents in this area if they are able to strike a deal.
That would be good - I'm not opposed to AI getting access to content so that it can grow, but I am opposed to them freely plucking without any consideration of the creator.
Just like I'm not okay with the idea of AI taking over truck driving, but I do like the idea of AI-assisted driving.
Posted on 12/31/23 at 10:55 am to TigerinATL
quote:
AI will bypass a lot of these website visits, so if we want AI to stay up to date we need a new way for people to get paid to create content, at least until the OpenAI robot influencers start making their own unboxing videos to create that data in house
This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.
Posted on 1/1/24 at 3:43 pm to BigPerm30
quote:
This is based on the premise that we haven’t or will not reach AGI. I think we are there already and that’s what the big spectacle with Sam being ousted a few weeks back.
You can be the smartest computer in the world but you don't know anything if you don't have data. For example, I've seen people post ChatGPT responses where because of this lawsuit ChatGPT is not answering certain questions that might be sourced by the NYT to avoid potential infringement.
Now instead of court contested data, pretend that data never existed in the first place. How is ChatGPT 7 going to answer questions about the iPhone 18? The manufacturer produced content like user manuals and commercials only give you so much to work with. It's all of the bloggers, influencers and youtubers that are putting out content to fill in the gaps.
My comment about OpenAI robot influencers was half joking but half serious. SOMEBODY is going to have to actually experience things to tell everyone else about them. An LLM alone can't create that kind of experiential data. Pop it in a robot body and it can.
Posted on 1/1/24 at 7:54 pm to TigerinATL
Tell me about all the times companies protected consumer data instead of using it and selling it like a two-cent whore?
But now im supposed to feel sympathy for companies getting their published data used?
Nah, frick all that
But now im supposed to feel sympathy for companies getting their published data used?
Nah, frick all that
Posted on 1/5/24 at 12:28 pm to TigerinATL
quote:
Pop it in a robot body and it can.
Do you want Skynet? Because that's how you get Skynet.
Popular
Back to top
