February 1, 2025 at 3:49 pm

Computer Scientists Discover A Simple Solution That Dramatically Improves The Efficiency Of Computing Large Amounts Of Data

by Michael Levanduski

Source: Shutterstock

Computers are great a processing large amounts of data very quickly. What computers aren’t nearly as good at (though the average person wouldn’t know it) is determining how many of a given object is within a group. This problem is known by computer scientists as the distinct elements problem.

It is not very intuitive that this is an issue because for humans, grouping like objects together is almost second nature. We do it without thinking throughout our day in a variety of ways.

If you order a bunch of different food from a fast food location, for example, you can look into the bag to confirm you got your 3 hamburgers, 4 cheeseburgers, 2 chicken burgers, and 5 French fries in seconds. There is no need to mentally count each item up in most cases.

Computers, however, don’t have that natural ability. In addition, computers are often dealing with massive numbers. Rather than one bag of food, a computer might be analyzing data from millions of users, or even billions of entries for things like analyzing the human genome or something like that.

Source: Shutterstock

The recent growth in artificial intelligence (AI) using large language models made this issue obvious. Since this technology requires the system to analyze enormous amounts of data in order to produce a human-like response, it has to identify like objects millions of times.

Most systems currently use what is essentially a brute force system of counting up each item. This method was called hashing. Vinodchandran Variyam is a professor at the University of Nebraska-Lincoln’s School of Computing. In a statement he explained this, saying:

“Earlier known algorithms all were ‘hashing based,’ and the quality of that algorithm depended on the quality of hash functions that algorithm chooses.”

He, along with Sourav Chakraborty of the Indian Statistical Institute and Kuldeep Meel of the University of Toronto worked together to find a new solution to this distinct elements problem.

The solution they came up with is surprisingly simple, yet effective.

The method, which is being called the CVM algorithm after the name of the inventors, is able to dramatically cut down on the amount of memory that is required for this type of work. With the new demands from AI and other computing work, anything that can significantly reduce the memory demand is critical.

While this system may seem complicated to a human, computers actually find it much easier. It involves gathering together a set number of entries from a group up to the capacity of memory assigned. For example, if you want to find the number of unique numbers in a large string of millions of numbers, the system might first gather in the first 100,000 numbers.

Once those numbers are in place, the system will ‘flip a coin’ for each number. If the coin is heads, it remains in the memory. If it is tails, it forgets it. The result will be somewhere around 50,000 numbers in the memory.

Next, the memory is filled back up with the next set of numbers (something like number 100,001 – 150,000 in this example). The coin flipping process is repeated, except this numbers are only retained if the system flips heads twice in a row.

The process is repeated, with the number of required heads flips required to retain going up by one with each iteration, until all numbers have been analyzed.

When working with small amounts of data, this system can produce pretty accurate results. When the amount of data is massive, however, the results are spot on.

A paper on this method was published on the ArXiv server for review. While it has not yet been formally reviewed, many people have looked it over and even used it to prove that it works as intended.

This is such a massive leap forward that it is even starting to be taught in some classrooms.

Source: Shutterstock

Donald Knuth, who is the author of The Art of Computer Programming, wrote a paper in support of this method. In it he said:

“We believe that this will be a mainstream algorithm that is taught in the first computer science course on algorithms in general and probabilistic algorithm in particular. It’s wonderfully suited to teaching students who are learning the basics of computer science. I’m pretty sure that something like this will eventually become a standard textbook topic.”

Over time, this method will likely be adopted into popular systems, and while the end user may not notice a huge difference, the computer scientists working on them will love the simplicity and reduction of resources needed to operate. The system is being considered a new way of counting for computers, and its total impact may be immeasurable.

Who knew that finding distinct elements was even a problem.

If you thought that was interesting, you might like to read about a quantum computer simulation that has “reversed time” and physics may never be the same.