Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'-AI-php.cn

Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'

王林

Jun 07, 2024 pm 03:44 PM

aialgorithmmath

Counting sounds simple, but it is very difficult to implement in practice.

Imagine that you are sent to a pristine tropical rainforest to conduct a wildlife census. Whenever you see an animal, take a photo.

Digital cameras only record the total number of animals tracked, but if you are interested in the number of unique animals, there is no statistics.

So, what is the best way to obtain this unique animal population?

At this point, you must say, start counting from now on, and finally compare each new species from the photo to the list.

However, this common counting method is sometimes not suitable for information amounts up to billions of entries.

Computer scientists from the Indian Statistical Institute, UNL, and the National University of Singapore have proposed a new algorithm - CVM.

It can approximate the number of different entries in a long list, and only needs to remember a small number of entries.

Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for Hamlet

##Paper address: https://arxiv.org/pdf/2301.10191

This algorithm works for any list in which an item appears one at a time, such as text in a speech, merchandise on a conveyor belt, or cars on the interstate.

The CVM algorithm is named after the first letters of the three authors and has made significant progress in solving the "different elements problem".

This problem has troubled computer scientists for more than 40 years.

It requires an efficient way to monitor a stream of elements (the total number of which may exceed available memory) and estimate the number of unique elements in it.

So, how does the CVM algorithm solve the problem?

Pioneering CVM algorithm, the secret lies in "randomization"

Suppose you are listening to the audiobook of "Hamlet".

This drama has a total of 30,557 words. How many are different?

To find the answer, you can pause while listening and write each word in alphabetical order, then skip words already on the list, and finally, just count the list on each word count.

Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for Hamlet

This method is feasible, but it tests one's "memory" too much.

Researcher Vinodchandran Variyam said, "In a typical data flow situation, there may be millions of items to track. You may not want to store all the information.

This is where cloud server algorithms can provide a simpler approach."

The trick is "randomization".

Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for Hamlet

Vinodchandran Variyam helped invent a CVM algorithm for estimating the number of distinct elements in a data stream

How many unique words are there in "Hamlet"? Coin Flip Challenge

Go back to "Hamlet" and assume that your "effective memory" can only hold 100 words.

Once the audio starts playing, you write down the first 100 words you hear and skip any repeated words.

When you have finished recording 100 words, all that’s left is to toss a coin for each word –

Heads, keep word. If it is the reverse side, delete it.

After this preliminary round, you will be left with about 50 different words.

Now you continue with what the team calls Round 1, continuing to read Hamlet and adding new words.

If you encounter a word again that is already on the list, flip the coin again until you have 100 words in your memory whiteboard.

Then, roughly half of the words are randomly deleted again based on the results of 100 coin tosses. Round 1 ends here.

Next, enter the second round, Round 2.

Like the first round, we're going to increase the difficulty of a word - when you encounter a repeated word, flip the coin again.

The condition is, if it's the other side, delete it like before. But if it’s heads, flip the coin again. The word is retained only when it appears heads for the second time.

Once the memory board is full, end the round and then delete about half of the words again based on the 100 tosses.

In Round 3, you need to flip a coin heads three times in a row to retain a word.

In the fourth round, keep one word on the front four times in a row, and so on.

Finally, in round k, you will listen to the entire play of "Hamlet".

The point of this exercise is to ensure that each word has the same probability of occurrence: 1/2 (k).

Suppose, at the end of the Hamlet audio, you have 61 words in your list and it took six rounds to complete.

You can estimate the number of different words by dividing 61 by the probability 1/2 (6) - the final result in this game is 3904.

The accuracy of the algorithm is proportional to the amount of memory

Researchers Chakraborty, Variyam and Meel mathematically proved that the accuracy of the CVM algorithm is proportional to the amount of memory Proportional to the size of the quantity.

And "Hamlet" has exactly 3967 unique words. (By ordinary counting method)

In the experiment using 100 word memory, the average estimate of the results of the 5 rounds of experiments is 3955 words.

With 1,000 words in memory, the average memory capacity increased to 3,964.

Variyam said, "If (the amount of memory) is large enough to accommodate all words, then we can achieve 100% accuracy."

William Kuszmau of Harvard University said, "This is a good example of how even very basic and widely studied problems can sometimes have simple but not obvious answers. Solutions are still to be discovered."

The above is the detailed content of Groundbreaking CVM algorithm solves more than 40 years of counting problems! Computer scientist flips coin to figure out unique word for 'Hamlet'. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

One Prompt Can Bypass Every Major LLM's SafeguardsApr 25, 2025 am 11:16 AM

HiddenLayer's groundbreaking research exposes a critical vulnerability in leading Large Language Models (LLMs). Their findings reveal a universal bypass technique, dubbed "Policy Puppetry," capable of circumventing nearly all major LLMs' s

5 Mistakes Most Businesses Will Make This Year With SustainabilityApr 25, 2025 am 11:15 AM

The push for environmental responsibility and waste reduction is fundamentally altering how businesses operate. This transformation affects product development, manufacturing processes, customer relations, partner selection, and the adoption of new

H20 Chip Ban Jolts China AI Firms, But They've Long Braced For ImpactApr 25, 2025 am 11:12 AM

The recent restrictions on advanced AI hardware highlight the escalating geopolitical competition for AI dominance, exposing China's reliance on foreign semiconductor technology. In 2024, China imported a massive $385 billion worth of semiconductor

If OpenAI Buys Chrome, AI May Rule The Browser WarsApr 25, 2025 am 11:11 AM

The potential forced divestiture of Chrome from Google has ignited intense debate within the tech industry. The prospect of OpenAI acquiring the leading browser, boasting a 65% global market share, raises significant questions about the future of th

How AI Can Solve Retail Media's Growing PainsApr 25, 2025 am 11:10 AM

Retail media's growth is slowing, despite outpacing overall advertising growth. This maturation phase presents challenges, including ecosystem fragmentation, rising costs, measurement issues, and integration complexities. However, artificial intell

'AI Is Us, And It's More Than Us'Apr 25, 2025 am 11:09 AM

An old radio crackles with static amidst a collection of flickering and inert screens. This precarious pile of electronics, easily destabilized, forms the core of "The E-Waste Land," one of six installations in the immersive exhibition, &qu

Google Cloud Gets More Serious About Infrastructure At Next 2025Apr 25, 2025 am 11:08 AM

Google Cloud's Next 2025: A Focus on Infrastructure, Connectivity, and AI Google Cloud's Next 2025 conference showcased numerous advancements, too many to fully detail here. For in-depth analyses of specific announcements, refer to articles by my

Talking Baby AI Meme, Arcana's $5.5 Million AI Movie Pipeline, IR's Secret Backers RevealedApr 25, 2025 am 11:07 AM

This week in AI and XR: A wave of AI-powered creativity is sweeping through media and entertainment, from music generation to film production. Let's dive into the headlines. AI-Generated Content's Growing Impact: Technology consultant Shelly Palme

See all articles