- Leading AI training tool C4 depends heavily on crypto platforms for data.
- Analysis reveals that C4 extracts snippets from crypto-based websites.
- The presence of crypto sites in C4’s dataset could affect its level of bias.
Colossal Clean Crawled Corpus (C4) is one of the leading AI training tools, and it depends heavily on multiple crypto platforms for a significant portion of its data. An analysis shows that the platform extracts millions of text snippets from crypto-based websites or web systems connected to cryptocurrency.
As reported, the U.S. Securities and Exchange Commission (SEC), which now contains a great deal of crypto-related information, contributes 36 million C4 tokens, which is equivalent to 0.02% of the platform’s dataset. The SEC’s website (sec.gov) ranks as the 39th most engaged website by C4.
The website of Satoshi Nakamoto, Bitcointalk.org, accounted for 6.1 million C4 tokens, equal to 0.004% of the total tokens. It is ranked as the 780th website engaged by the platform.
Apart from the SEC and Bitcointalk.org, other crypto platforms engaged by C4 for data acquisition include the crypto news website, Cointelegraph, and the tokens aggregation platform, CoinmarketCap. These and six more related websites contributed 0.008% of all C4 tokens, while websites related to particular cryptocurrencies formed a negligible part of the representation.
IPFS (ipfs.io) and Steemit (steemit.com) also featured significantly in C4’s dataset. IPFS ranked 16th, while Steemit ranked in the 594th position. Both these sites are not directly involved in crypto but have strong connections to the crypto industry.
The integration of crypto-related platforms in C4’s AI training process demonstrates cryptocurrency’s penetration into the mainstream. Crypto websites’ extent of representation is sufficient to influence the result of C4, despite the fact that mainstream websites like Google (NASDAQ:GOOGL) and Facebook (NASDAQ:META) outrank them significantly.
C4 has been subject to criticism over pirated data and hate speech, despite reports of the dataset being “cleaned”. With only 400 words in its list for censoring specific content, it suggests there could still be controversial content within C4. The inclusion of crypto sites in its dataset could also affect its level of bias.
The post Crypto Platforms Acquired by Leading AI Training Tool appeared first on Coin Edition.
See original on CoinEdition
Get The News You Want
Read market moving news with a personalized feed of stocks you care about.
Get The App