


Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'
Counting sounds simple, but it is very difficult to implement in practice.
Imagine that you are sent to a pristine tropical rainforest to conduct a wildlife census. Whenever you see an animal, take a photo.
Digital cameras only record the total number of animals tracked, but if you are interested in the number of unique animals, there is no statistics.
So, what is the best way to obtain this unique animal population?
At this point, you must say, start counting from now on, and finally compare each new species from the photo to the list.
However, this common counting method is sometimes not suitable for information amounts up to billions of entries.
Computer scientists from the Indian Statistical Institute, UNL, and the National University of Singapore have proposed a new algorithm - CVM.
It can approximate the number of different entries in a long list, and only needs to remember a small number of entries.
##Paper address: https://arxiv.org/pdf/2301.10191
This algorithm works for any list in which an item appears one at a time, such as text in a speech, merchandise on a conveyor belt, or cars on the interstate.
The CVM algorithm is named after the first letters of the three authors and has made significant progress in solving the "different elements problem".
This problem has troubled computer scientists for more than 40 years.
It requires an efficient way to monitor a stream of elements (the total number of which may exceed available memory) and estimate the number of unique elements in it.
So, how does the CVM algorithm solve the problem?
Pioneering CVM algorithm, the secret lies in "randomization"
Suppose you are listening to the audiobook of "Hamlet".
This drama has a total of 30,557 words. How many are different?
To find the answer, you can pause while listening and write each word in alphabetical order, then skip words already on the list, and finally, just count the list on each word count.
This method is feasible, but it tests one's "memory" too much.
Researcher Vinodchandran Variyam said, "In a typical data flow situation, there may be millions of items to track. You may not want to store all the information.
This is where cloud server algorithms can provide a simpler approach."
The trick is "randomization".
Vinodchandran Variyam helped invent a CVM algorithm for estimating the number of distinct elements in a data stream
How many unique words are there in "Hamlet"? Coin Flip Challenge
Go back to "Hamlet" and assume that your "effective memory" can only hold 100 words.
Once the audio starts playing, you write down the first 100 words you hear and skip any repeated words.
When you have finished recording 100 words, all that’s left is to toss a coin for each word –
Heads, keep word. If it is the reverse side, delete it.
After this preliminary round, you will be left with about 50 different words.
Now you continue with what the team calls Round 1, continuing to read Hamlet and adding new words.
If you encounter a word again that is already on the list, flip the coin again until you have 100 words in your memory whiteboard.
Then, roughly half of the words are randomly deleted again based on the results of 100 coin tosses. Round 1 ends here.
Next, enter the second round, Round 2.
Like the first round, we're going to increase the difficulty of a word - when you encounter a repeated word, flip the coin again.
The condition is, if it's the other side, delete it like before. But if it’s heads, flip the coin again. The word is retained only when it appears heads for the second time.
Once the memory board is full, end the round and then delete about half of the words again based on the 100 tosses.
In Round 3, you need to flip a coin heads three times in a row to retain a word.
In the fourth round, keep one word on the front four times in a row, and so on.
Finally, in round k, you will listen to the entire play of "Hamlet".
The point of this exercise is to ensure that each word has the same probability of occurrence: 1/2 (k).
Suppose, at the end of the Hamlet audio, you have 61 words in your list and it took six rounds to complete.
You can estimate the number of different words by dividing 61 by the probability 1/2 (6) - the final result in this game is 3904.
The accuracy of the algorithm is proportional to the amount of memory
Researchers Chakraborty, Variyam and Meel mathematically proved that the accuracy of the CVM algorithm is proportional to the amount of memory Proportional to the size of the quantity.
And "Hamlet" has exactly 3967 unique words. (By ordinary counting method)
In the experiment using 100 word memory, the average estimate of the results of the 5 rounds of experiments is 3955 words.
With 1,000 words in memory, the average memory capacity increased to 3,964.
Variyam said, "If (the amount of memory) is large enough to accommodate all words, then we can achieve 100% accuracy."
William Kuszmau of Harvard University said, "This is a good example of how even very basic and widely studied problems can sometimes have simple but not obvious answers. Solutions are still to be discovered."
The above is the detailed content of Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'. For more information, please follow other related articles on the PHP Chinese website!

HiddenLayer's groundbreaking research exposes a critical vulnerability in leading Large Language Models (LLMs). Their findings reveal a universal bypass technique, dubbed "Policy Puppetry," capable of circumventing nearly all major LLMs' s

The push for environmental responsibility and waste reduction is fundamentally altering how businesses operate. This transformation affects product development, manufacturing processes, customer relations, partner selection, and the adoption of new

The recent restrictions on advanced AI hardware highlight the escalating geopolitical competition for AI dominance, exposing China's reliance on foreign semiconductor technology. In 2024, China imported a massive $385 billion worth of semiconductor

The potential forced divestiture of Chrome from Google has ignited intense debate within the tech industry. The prospect of OpenAI acquiring the leading browser, boasting a 65% global market share, raises significant questions about the future of th

Retail media's growth is slowing, despite outpacing overall advertising growth. This maturation phase presents challenges, including ecosystem fragmentation, rising costs, measurement issues, and integration complexities. However, artificial intell

An old radio crackles with static amidst a collection of flickering and inert screens. This precarious pile of electronics, easily destabilized, forms the core of "The E-Waste Land," one of six installations in the immersive exhibition, &qu

Google Cloud's Next 2025: A Focus on Infrastructure, Connectivity, and AI Google Cloud's Next 2025 conference showcased numerous advancements, too many to fully detail here. For in-depth analyses of specific announcements, refer to articles by my

This week in AI and XR: A wave of AI-powered creativity is sweeping through media and entertainment, from music generation to film production. Let's dive into the headlines. AI-Generated Content's Growing Impact: Technology consultant Shelly Palme


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version
Visual web development tools

Notepad++7.3.1
Easy-to-use and free code editor

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6
Visual web development tools
