Sometimes it does not take long for things to start going as we project.
What is Hive's main utility? This is something that many will dispute yet I think it is clear:
Hive's main role is the democratization of data. This is something that a decentralized database that natively stores text data provides.
Why is this important?
There is a war taking place and it is over data. companies, especially start ups, are out there looking for data. They are turning to scraping of sites, something that platform owners are fighting.
This is only going to get worse as time passes.
In this article we will discuss another move along why something like Hive is becoming crucial.
Image generated by Ideogram
The Democratization of Data
Many thought Elon was an idiot for spending $40 billion on Twitter. There were roars of laughter when advertisers left and people put estimates on the value. He lost a cool $20 billion (estimated) in about a year.
What was overlooked was the database he acquired. Since 2006, the company has been saving every tweet that was posted on its site. This has added up to an enormous trove of data.
In a world of LLMs, this is like gold (or, if you prefer, Bitcoin).
Every month, roughly half a billion people keep adding to Musk's treasure chest. They do the same for Zuckerberg (Meta), Google, and Reddit.
Here is where we have to be clear: the data we provide is theirs.
Many like to claim they stole people's data. They did not.
People making this assertion refuse to take responsibility. The truth is individuals opt to give their data away. Individuals were not forced to post on Twitter or Facebook. Instagram photo sharing is a choice. Nobody was pressured into uploading a video on YouTube.
We give the data to these companies willingly.
Of course, now that people are learning about AI and how it comes about, what are they doing? Still feeding the same beast by heading to the same platforms on a daily basis.
When it comes to "data being the new oil", however accurate that assessment might be, it is obvious the public wants to keep providing to Big Tech.
Unfortunately, this is creating a system that is closed off more than what we presently have.
Reddit Fighting AI Crawlers
Reddit is another site that was fed by users over the last 2 decades. Since 2005, people voluntarily keep adding to the database.
This is now being monetized by the company, which only recently went public. There was a deal with Google for $60 million to give them access to the data.
Again, this belongs to the company even though millions of people provided it.
We see Reddit is taking steps that unauthorized players are not pulling from their oil field.
Reddit announced on Tuesday that it’s updating its Robots Exclusion protocol (robots.txt file), which tells automated web bots whether they are permitted to crawl a site.
It is also taking some further measures:
Along with the updated robots.txt file, Reddit will continue rate-limiting and blocking unknown bots and crawlers from accessing its platform. The company told TechCrunch that bots and crawlers will be rate-limited or blocked if they don’t abide by Reddit’s Public Content policy and don’t have an agreement with the platform.
This is the future.
Only those who can pay are gong to be able to access the data.
Consider the business model employed. These platforms gathered the data over the decades, which was provided by millions of users. As technology advanced, the value of this data kept reproducing, at it become more than just a tool to target advertise. Suddenly, companies needed it.
Now it is up for sale (rental) to feed into the LLM training.
Of course, to a company like Google, this is a nothing hurdle. To that entity, $60 million is a rounding error. The same is true for many of the other major players.
Where this is a problem is with start ups. What about those companies that have the ability to train these models, perhaps using a different approach, yet lack the access to data?
Basically, they are screwed.
The Freeing of Information
The Internet was a massive step towards freedom.
Pre-Internet, we lived in a time where there were purveyors of information. Companies were actually the ones who doled it out. Examples of this are news, encyclopedias, and road maps. Entities actually published the information that people had to pay for.
It all changed with the Internet. No longer did you require the newspaper to tell you what was happening. People were posting news all over the place.
This went on for a while until we realized there was a new sheriff in town. We went from one set of corporations to Big Tech.
The aforementioned companies along with the likes of Amazon, Spotify, and PayPal took over.
When it comes to information, i.e. data, we see the same situation, This is being sold, just not to the general public. Actually, it is the same public that is providing it and companies are selling it to AI firms.
Can anyone see how catastrophic this could be?
Hive Provides An Answer
With Hive, we have a decentralized blockchain that is a text database. Anything can be posted and stored on the servers. Unlike Web 2.0 platforms, the servers are not controlled by any single entity. Also, the data is available for anyone to utilize.
This is what is meant by the "democratization of data".
A start up is free to set up an API and engage with the data however desired. It all can be scraped and used by for any purpose. Nobody owns the data, ergo cannot prohibit the use of it.
Over the next couple years, this is going to take on added importance.
Presently, data is just one barrier to entry. The biggest obstacle is the fact that the amount of compute required to train something as Llama3 is enormous. We see the orders that X.ai is placing for NVIDIA H200. The amount of money quickly runs into the billions.
That said, many predict that something like Llama3 will cost around $10K in a couple years. Consider the impact of a start up constructing a LLM of this nature for that type of money. Suddenly, a company that raises a few million can be in the game.
At least that is the case from the processing standpoint. But what about data?
Here is where we circle back to the democratization of data. If the data that is on the Internet is locked down by the different entities, we are looking at a situation where these start ups are dead on arrival. Even with the compute, if there is nothing to feed, it goes hungry.
This is a very important point to consider.
Posted Using InLeo Alpha