


'Mathematical noob' ChatGPT understands human preferences very well! Generating random numbers online is the ultimate answer to the universe
ChatGPT also understands human tricks when it comes to generating random numbers.
ChatGPT may be a bullshit artist and a spreader of misinformation, but it is not a "mathematician"!
Recently, Colin Fraser, a Meta data scientist, discovered that ChatGPT cannot generate truly random numbers, but is more like "human random numbers."
Through experiments, Fraser concluded: "ChatGPT likes the numbers 42 and 7 very much."
Netizens said that it means that humans like these very much. number.
ChatGPT also loves "The Ultimate Answer to the Universe"
In his test, the prompt entered by Fraser was as follows:
"Pick a random number between 1 and 100. Just return the number; Don't include any other text or punctuation in the response."
By letting ChatGPT generate a random number between 1 and 100 each time, Fraser collected 2,000 different answers and compiled them into a table.
As you can see, the number 42 appears most frequently, up to 10%. In addition, numbers containing 7 appear very frequently.
Especially the numbers between 71-79 are more frequent. Among numbers outside this range, 7 also often appears as the second digit.
#42What does it mean?
Everyone who has read Douglas Adams's blockbuster science fiction novel "The Hitchhiker's Guide to the Galaxy" knows that 42 is "the ultimate answer to life, the universe, and everything."
To put it simply, 42 and 69 are meme numbers on the Internet. This shows that ChatGPT is not actually a random number generator, but simply selects popular numbers in life from huge data sets collected online.
In addition, 7 appears frequently, which exactly reflects that ChatGPT caters to human preferences.
In Western culture, 7 is generally regarded as a lucky number, and there is a saying of Lucky 7. Just like we are obsessed with the number 8.
Interestingly, Fraser also found that GPT-4 seemed to compensate for this.
When GPT-4 is asked for more numbers, the random numbers it returns are too evenly distributed.
#In short, ChatGPT basically gives a response through prediction, rather than actually "thinking" to come up with an answer.
It can be seen that a chatbot that is touted as almost omnipotent is still a bit silly.
Let it plan a road trip for you and it will make you stop in a town that doesn’t even exist. Or, have it output a random number, most likely making a decision based on a popular meme.
Some netizens tried it themselves and found that GPT-4 does like 42.
If ChatGPT ends up just repeating online clichés, what’s the point?
GPT-4, violating machine learning rules
The birth of GPT-4 is exciting, but also disappointing.
Not only did OpenAI not release more information about GPT-4, it didn’t even reveal the size of the model, but it highlighted its performance over humans on many professional and standardized tests.
Taking the BAR Lawyer License Examination in the United States as an example, GPT3.5 can reach the 10% level, and GPT4 can reach the 90% level.
However, Arvind Narayanan, a professor in the Department of Computer Science at Princeton University, and Sayash Kapoor, a doctoral student, wrote that
OpenAI may have been tested on the training data. Furthermore, human benchmarks are meaningless for chatbots.
Specifically, OpenAI may be violating a cardinal rule of machine learning: don’t test on training data. You must know that test data and training data must be separated, otherwise over-fitting problems will occur.
Putting aside this problem, there is a bigger problem.
Language models solve problems differently than humans do, so these results have little meaning for how well a robot will perform when faced with real-world problems faced by professionals. A lawyer's job is not to answer bar exam questions all day long.
Problem 1: Training data contamination
To evaluate GPT-4’s programming capabilities, OpenAI conducted an evaluation on Codeforces, a website for Russian programming competitions.
Surprisingly, Horace He pointed out online that in the simple classification, GPT-4 solved 10 problems before 2021, but none of the 10 most recent problems were solved.
The training data deadline for GPT-4 is September 2021.
This strongly suggests that the model is able to remember the solutions in its training set, or at least partially remember them, enough to fill in what it cannot recall.
To provide further evidence for this hypothesis, Arvind Narayanan tested GPT-4 on Codeforces competition problems at different times in 2021.
It was found that GPT-4 could solve simple classification problems before September 5, but none of the problems after September 12 were solved.
In fact, we can definitively prove that it has memorized problems in the training set: when GPT-4 is prompted with the title of a Codeforces problem, it includes a link to the exact match in which the problem appeared. It's worth noting that GPT-4 doesn't have access to the internet, so memory is the only explanation.
GPT-4 memorizes Codeforce issues before training deadline
Regarding benchmarks other than programming, Professor Narayanan said “We don’t know How to separate the problem by time period in a clear way, so it is considered difficult for OpenAI to avoid data pollution. For the same reason, we cannot conduct experiments to test how performance changes with dates."
However, it can be seen from the other side To start with, if it is memory, then GPT must be highly sensitive to question wording.
In February, Melanie Mitchell, a professor at the Santa Fe Institute, gave an example of an MBA exam question. Slightly changing some details was enough to deceive ChatGPT (GPT-3.5), and this method is very useful for a person. You won't be deceived if you tell.
More detailed experiments like this would be valuable.
Due to OpenAI’s lack of transparency, Professor Narayanan cannot say with certainty that it is a problem of data pollution. But what is certain is that OpenAI’s method of detecting contamination is sloppy:
“We use a substring matching method to measure cross-contamination between the evaluation data set and the pre-training data. Both the evaluation and training data are processed , remove all spaces and symbols, leaving only characters (including numbers). For each evaluation example, we randomly select three substrings of length 50 characters (if the example length is less than 50 characters, the entire example is used). A match is considered successful if any of the sampled evaluation substrings is a substring of a processed training example. This results in a list of tainted examples. We discard these examples and rerun to obtain the untainted Score."
This method simply cannot stand the test.
If the test problem exists in the training set but the name and number have been changed, it cannot be detected. Now a more reliable method is available, such as embedding distance.
If OpenAI wants to use the embedding distance method, then how much similarity is considered too similar? There is no objective answer to this question.
So even when performance on a multiple-choice standardized test seems simple, there is a lot of subjectivity involved.
Problem 2: Professional exams are not a valid way to compare human and robot abilities
Memory is like a spectrum, even if the language model has not seen an exact one in the training set The problem, due to the huge training corpus, is that it has inevitably seen many very similar examples.
This means that it can escape deeper reasoning. Therefore, the benchmark results do not provide us with evidence that the language model is acquiring the deep reasoning skills required by human test takers.
In some practical tasks, shallow-level reasoning GPT-4 may be competent, but this is not always the case.
Benchmarks have been widely used in large model comparisons and have been criticized by many for reducing multidimensional evaluations to a single number.
Unfortunately, it is very regrettable that OpenAI chose to use such a large number of these tests in the evaluation of GPT-4, coupled with insufficient data pollution treatment measures.
The above is the detailed content of 'Mathematical noob' ChatGPT understands human preferences very well! Generating random numbers online is the ultimate answer to the universe. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SublimeText3 English version
Recommended: Win version, supports code prompts!

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Chinese version
Chinese version, very easy to use

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function