There was a paper that was just released called Consent in Crisis: The Rapid Decline of the AI Data Commons. It highlights exactly what we have discussed, going into detail about the dwindling data commons that we are facing.
It is a situation that arose out of the locking down of websites, prohibiting the likes of robot.txt, crawlers that scour the Internet for data.
Over the years, this was open since it was to the advantage of a website. The biggest users of these were the search engines, which scraped the data to organize in a fashion which could be helpful to users. Since sites were interested in getting in front of people, this was no problem.
That all changed with the introduction of chatbots. Suddenly, the data being taken was not of benefit to the websites themselves. Instead, it went to training the models that we see emerging.
With the lockdown taking place, we could be confronted with a future where AI expansion is hindered. It also could become even more centralized.
Image generated by Ideogram
The Data Crisis
As sites get locked down, this raises a number of issues.
To start, it hinders progress in other areas. Many researchers and academics use the bot technology to get the data required for their models. This is not generative AI but, rather, math, science, economics, and other fields.
As they are separated from that data, their research suffers.
Another issue is the fact that effective AI models require up-to-date data. We want the best answers possible. This is an issue if real time events are not being integrated in. News is very important in the responses that are given.
When the crawlers are banned, especially from the top sites, this means we are looking at delays in the information provided. A model that is 9 months old is not a step forward.
The final glaring issue is centralization of the data. While there are lots of sites with it, only a few will be able to pay for it. This means that OpenAi and others who either have deep pockets or are backed by them will gain access. Reddit selling its data to Google for $60 million is a prime example.
The issue is we lack democratization. Start ups do not typically have the resources to enter in this manner. Keep in mind they still have to buy the hardware, processors that are not cheap.
We are the situation forming like this:
When we looked at the top 2,000 websites in this C4 data set—these are the top 2,000 by size, and they’re mostly news, large academic sites, social media, and well-curated high-quality websites—25 percent of the data in that top 2,000 has since been revoked.
That tells it all. The access is revoked on 25% of the top 2,000 sites, or 500 of them. Expect this trend to continue.
What this means is that the distribution of training data for models that respect robots.txt is rapidly shifting away from high-quality news, academic websites, forums, and social media to more organization and personal websites as well as e-commerce and blogs.
Hive = Data Democratization
The companies with social media platforms will fare well in this situation.
Google, X, and Meta have built in data feeds. That is what social media is turning into. It is a pathway for more data, which results in more AI services that can be offered, generating more data.
However, as we know, these companies are not willing to share their data with everyone. X took heat earlier in the year for limiting what could be scraped. This caused an uproar but we can see how it was far from unique.
Hive answers this problem due to being a blockchain database. When something is permissionless, anyone can write to the chain. The benefit is that anything on chain can be pulled.
We can think of Hive as a data commons, providing the data the world needs. Since it is open, starts, academics, or anyone else can use it. There is no way to restrict access. Any entity is free to set up an API if they desire.
This is the essence of Web 3.0.
Artificial intelligence requires data, algorithms, and compute. We have to get as much tied to blockchain as we can. The data is rather obvious and straightforward. Algorithms built that are tied to blockchain will mean there will be models. All of this helps to spread things out and accelerate develop.
Blockchain models will be a combination of open and closed source. However, those that are open can help to foster faster innovation.
Open source data allows for transparency and collaboration.
Posted Using InLeo Alpha