{"id":91892,"date":"2024-07-10T13:43:05","date_gmt":"2024-07-10T13:43:05","guid":{"rendered":"https:\/\/news.talkwithrattan.com\/index.php\/2024\/07\/10\/ais-understanding-and-reasoning-skills-cant-be-assessed-by-current-tests\/"},"modified":"2024-07-10T13:43:05","modified_gmt":"2024-07-10T13:43:05","slug":"ais-understanding-and-reasoning-skills-cant-be-assessed-by-current-tests","status":"publish","type":"post","link":"https:\/\/news.talkwithrattan.com\/index.php\/2024\/07\/10\/ais-understanding-and-reasoning-skills-cant-be-assessed-by-current-tests\/","title":{"rendered":"AI&#8217;s understanding and reasoning skills can&#8217;t be assessed by current tests"},"content":{"rendered":"<div style=\"text-align:center\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_mobile.jpg?fit=680%2C255&amp;ssl=1\" class=\"attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"AI&#8217;s understanding and reasoning skills can&#8217;t be assessed by current tests\" title=\"AI&#8217;s understanding and reasoning skills can&#8217;t be assessed by current tests\" \/><\/div> \r\n<br><div style=\"clear:both\">\n<style><![CDATA[\n.subscribe-cta {\n  color: black;\n  margin-top: 0px;\n  background-color: #D5DDEE;\n  background-size: cover;\n  padding: 20px;\n  border: 1px solid black;\n  border-top: 5px solid black;\n  clear: both;\n}\n\n.centered {\n  text-align:center;\n  margin:auto;\n}\n\n]]><\/style>\n<!-- \/wp:html -->\n\n<!-- wp:group {\"className\":\"subscribe-cta\"} -->\n<div id=\"subscribeConversion\" class=\"wp-block-group subscribe-cta\"><!-- wp:heading {\"textAlign\":\"center\",\"style\":{\"typography\":{\"fontSize\":\"2em\"}}} -->\n<h2 class=\"wp-block-heading has-text-align-center\" style=\"font-size:2em\">Take our AI Survey<\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph {\"align\":\"center\",\"style\":{\"typography\":{\"fontSize\":\"1.1em\"}}} -->\n<p class=\"has-text-align-center\" style=\"font-size:1.1em\"><em>Science News <\/em>has partnered with <a href=\"https:\/\/trustingnews.org\/about-us\/\" target=\"_blank\" rel=\"noopener\">Trusting News<\/a> to gather feedback on the potential use of AI in journalism. <strong>Currently, we do not publish any content produced by generative AI<\/strong> (see our <a href=\"https:\/\/www.sciencenews.org\/about-science-news\/journalism-standards-practices#ai-policy\" target=\"_blank\" rel=\"noopener\">policy<\/a>). We do want to hear your views on how <em>Science News<\/em> could use AI responsibly.  Let us know by participating in a short 10 question survey.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:spacer {\"height\":\"20px\"} -->\n\n<!-- \/wp:spacer -->\n\n<!-- wp:buttons {\"className\":\"centered\",\"layout\":{\"type\":\"flex\",\"justifyContent\":\"center\"}} -->\n\n<!-- \/wp:buttons --><\/div>\n<!-- \/wp:group -->\n\n\n<p>Consider, for example, <a href=\"https:\/\/github.com\/hendrycks\/test?tab=readme-ov-file\" target=\"_blank\" rel=\"noopener\">Massive Multitask Language Understanding, or MMLU<\/a>, a popular benchmark for assessing the knowledge acquired by LLMs. MMLU includes some 16,000 multiple-choice questions covering 57 topics, including anatomy, geography, world history and law. Benchmarks such as BIG-bench (the BIG stands for Beyond the Imitation Game) consist of a more varied collection of tasks. <a href=\"https:\/\/aclanthology.org\/N19-1246\/\" target=\"_blank\" rel=\"noopener\">Discrete Reasoning Over Paragraphs, or DROP<\/a>, claims to test reading comprehension and reasoning. WinoGrande and HellaSwag purport to test commonsense reasoning. Models are pitted against each other on these benchmarks, as well as against humans, and models sometimes perform better than humans.<\/p>\n\n\n\n<p>But \u201cAI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability,\u201d computer scientist Melanie Mitchell pointed out in a <a href=\"https:\/\/aiguide.substack.com\/p\/ai-now-beats-humans-at-basic-tasks\" target=\"_blank\" rel=\"noopener\">May edition of her Substack newsletter<\/a>.<\/p>\n\n\n\n<p>These evaluations don\u2019t necessarily deliver all that they claim, and they might not be a good match for today\u2019s AI. One study posted earlier this year at arXiv.org tested 11 LLMs and found that just <a href=\"https:\/\/arxiv.org\/pdf\/2402.01781\" target=\"_blank\" rel=\"noopener\">changing the order of the multiple-choice answers<\/a> in a benchmark like MMLU can affect performance.<\/p>\n\n\n\n<p>Still, industry leaders tend to conflate impressive performance on the tasks LLMs are trained to do, like engaging in conversation or summarizing text, with higher-level cognitive capabilities like understanding, knowledge and reasoning, which are hard to define and harder to evaluate. But for LLMs, <a href=\"https:\/\/openreview.net\/forum?id=CF8H8MS5P8\" target=\"_blank\" rel=\"noopener\">generating content is not dependent on understanding it<\/a>, researchers reported in a study presented in May in Vienna at the International Conference on Learning Representations. When the researchers asked GPT-4 and other AI models to answer questions based on AI-generated text or images, they frequently couldn\u2019t answer correctly.<\/p>\n\n\n\n<p>Nouha Dziri, a research scientist studying language models at the Allen Institute for AI in Seattle and coauthor on that study, calls that \u201ca paradox compared to how humans actually operate.\u201d For humans, she says, \u201cunderstanding is a prerequisite for the ability to generate the correct text.\u201d<\/p>\n\n\n\n<p>What\u2019s more, as Mitchell and colleagues note in a paper in <em>Science<\/em> last year, <a href=\"https:\/\/www.science.org\/doi\/10.1126\/science.adf6369\" target=\"_blank\" rel=\"noopener\">benchmark performance is often reported with aggregate metrics<\/a> that \u201cobfuscate key information about where systems tend to succeed or fail.\u201d Any desire to look deeper is thwarted because specific details of performance aren\u2019t made publicly available.<\/p>\n\n\n\n<p>Researchers are now imagining how better assessments might be designed. \u201cIn practice, it\u2019s hard to do good evaluations,\u201d says Yanai Elazar, also working on language models at the Allen Institute. \u201cIt\u2019s an active research area that many people are working on and making better.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why cognitive benchmarks don\u2019t always work<\/h2>\n\n\n\n<p>Aside from transparency and inflated claims, there are underlying issues with benchmark evaluations.<\/p>\n\n\n\n<p>One of the challenges is that benchmarks are good for only a certain amount of time. There\u2019s a concern that today\u2019s LLMs have been <a href=\"https:\/\/aclanthology.org\/2024.eacl-long.5\/\" target=\"_blank\" rel=\"noopener\">trained on the testing data<\/a> from the very benchmarks intended to evaluate them. The benchmark datasets are available online, and the training data for LLMs are typically scraped from the entire Web. For instance, a technical report from <a href=\"https:\/\/arxiv.org\/pdf\/2303.08774\" target=\"_blank\" rel=\"noopener\">OpenAI, which developed ChatGPT, acknowledged<\/a> that portions of benchmark datasets including BIG-bench and DROP were part of GPT-4\u2019s training data. There\u2019s some evidence that GPT-3.5, which powers the free version of ChatGPT, <a href=\"https:\/\/arxiv.org\/abs\/2311.09783\" target=\"_blank\" rel=\"noopener\">has encountered the MMLU benchmark dataset<\/a>.<\/p>\n\n\n\n\n\n<p>But much of the training data is not disclosed. \u201cThere\u2019s no way to prove or disprove it, outside of the company just purely releasing the training datasets,\u201d says Erik Arakelyan of the University of Copenhagen, who studies natural language understanding.<\/p>\n\n\n\n<p>Today\u2019s LLMs might also rely on shortcuts to arrive at the correct answers without performing the cognitive task being evaluated. \u201cThe problem often comes when there are things in the data that you haven\u2019t thought about necessarily, and basically the model can cheat,\u201d Elazar says. For instance, a study reported in 2019 found evidence of <a href=\"https:\/\/arxiv.org\/abs\/2311.09783\" target=\"_blank\" rel=\"noopener\">such statistical associations<\/a> in the Winograd Schema Challenge dataset, a commonsense reasoning benchmark that predates WinoGrande.<\/p>\n\n\n\n<p>The <a href=\"https:\/\/cdn.aaai.org\/ocs\/4492\/4492-21843-1-PB.pdf\" target=\"_blank\" rel=\"noopener\">Winograd Schema Challenge<\/a>, or WSC, was proposed in 2011 as a test for intelligent behavior of a system. Though many people are familiar with <a href=\"https:\/\/www.sciencenews.org\/article\/mind-math\">the Turing test<\/a> as a way to evaluate intelligence, researchers had begun to propose modifications and alternatives that weren\u2019t as subjective and didn\u2019t require the AI to engage in deception to pass the test (<em>SN: 6\/15\/12<\/em>).<\/p>\n\n\n\n<p>Instead of a free-form conversation, WSC features pairs of sentences that mention two entities and use a pronoun to refer to one of the entities. Here\u2019s an example pair:<\/p>\n\n\n\n<p>Sentence 1: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it removed.<\/p>\n\n\n\n<p>Sentence 2: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it repaired.<\/p>\n\n\n\n<p>A language model scores correctly if it can successfully match the pronoun (\u201cit\u201d) to the right entity (\u201cthe roof\u201d or \u201cthe tree\u201d). The sentences usually differ by a special word (\u201cremoved\u201d or \u201crepaired\u201d) that when exchanged changes the answer. Presumably only a model that relies on commonsense world knowledge and not linguistic clues could provide the correct answers.<\/p>\n\n\n\n<div class=\"wp-block-sciencenews-content-sidebar\">\n<h3 class=\"wp-block-heading\">Superior skills?<\/h3>\n\n\n\n<p>In recent years, AI has started to outperform humans on tests of image classification, language understanding, reading comprehension and more (skills that surpass the human baseline in the graph below). But some experts warn that current benchmarks are not up to the job of evaluating an AI model\u2019s understanding and reasoning.<\/p>\n\n\n<iframe loading=\"lazy\" title=\"Embed\" class=\"sn-responsive-iframe\" id=\"sn-responsive-iframe-47\" src=\"https:\/\/flo.uri.sh\/visualisation\/18678036\/embed\" width=\"100%\" height=\"600\" layout=\"responsive\" frameborder=\"0\" allowfullscreen=\"\">\n\t<\/iframe><\/div>\n\n\n\n<p>But it turns out that in WSC, there are statistical associations that offer clues. Consider the example above. Large language models, trained on huge amounts of text, would have encountered many more examples of a roof being repaired than a tree being repaired. A model might select the statistically more likely word among the two options rather than rely on any kind of commonsense reasoning.<\/p>\n\n\n\n<p>In a study reported in 2021, Elazar and colleagues gave nonsensical modifications of WSC sentences to <a href=\"https:\/\/arxiv.org\/pdf\/1907.11692\" target=\"_blank\" rel=\"noopener\">RoBERTa<\/a>, an LLM that has <a href=\"https:\/\/paperswithcode.com\/sota\/coreference-resolution-on-winograd-schema\" target=\"_blank\" rel=\"noopener\">scored more than 80 percent on the WSC benchmark in some cases<\/a>. The model got it right at least 60 percent of the time even though humans wouldn\u2019t be expected to answer correctly. Since random guessing couldn\u2019t yield more than a 50 percent score, <a href=\"https:\/\/aclanthology.org\/2021.emnlp-main.819\/\" target=\"_blank\" rel=\"noopener\">spurious associations<\/a> must have been giving away the answer.<\/p>\n\n\n\n<p>To be good measures of progress, benchmark datasets cannot be static. They must be adapted alongside state-of-the-art models and rid of any specious shortcuts, Elazar and other evaluation researchers say. In 2019, after the WSC shortcuts had come to light, another group of researchers released the now commonly used WinoGrande as a harder commonsense benchmark. The benchmark dataset has more than 43,000 sentences with an accompanying algorithm that can filter out sentences with spurious associations.<\/p>\n\n\n\n<p>For some researchers, the fact that LLMs are passing benchmarks so easily simply means that <a href=\"https:\/\/hai.stanford.edu\/news\/ai-benchmarks-hit-saturation\" target=\"_blank\" rel=\"noopener\">more comprehensive benchmarks<\/a> need developing. For instance, researchers might turn to a collection of varied benchmark tasks that tackle different facets of common sense such as conceptual understanding or the ability to plan future scenarios. \u201cThe challenge is how do we come up with a more adversarial, more challenging task that will tell us the true capabilities of these language models,\u201d Dziri says. \u201cIf the model is scoring 100 percent on them, it might give us a false illusion about their capabilities.\u201d<\/p>\n\n\n\n<p>But others are more skeptical that models performing great on the benchmarks necessarily possesses the cognitive abilities in question. If a model tests well on a dataset, it just tells us that it performs well on that particular dataset and nothing more, Elazar says. Even though WSC and WinoGrande are considered tests for common sense, they just test for pronoun identification. <a href=\"https:\/\/allenai.org\/data\/hellaswag\" target=\"_blank\" rel=\"noopener\">HellaSwag<\/a>, another commonsense benchmark, tests how well a model can pick the most probable ending for a given scenario.<\/p>\n\n\n\n<p>While these individual tasks might require common sense or understanding if constructed correctly, they still don\u2019t make up the entirety of what it means to have common sense or to understand. Other forms of commonsense reasoning, involving social interactions or comparing quantities, <a href=\"https:\/\/arxiv.org\/pdf\/2302.04752\" target=\"_blank\" rel=\"noopener\">have been poorly explored<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Taking a different approach to testing<\/h2>\n\n\n\n<p>Systematically digging into the mechanisms required for understanding may offer more insight than benchmark tests, Arakelyan says. That might mean testing <a href=\"https:\/\/arxiv.org\/pdf\/2206.14187\" target=\"_blank\" rel=\"noopener\">AI\u2019s underlying grasp of concepts<\/a> using what are called counterfactual tasks. In these cases, the model is presented with a twist on a commonplace rule that it is unlikely to have encountered in training, say an alphabet with some of the letters mixed up, and asked to solve problems using the new rule.<\/p>\n\n\n\n<p>Other approaches include analyzing the AI\u2019s ability to generalize from simple to more complex problems or directly probing under what circumstances AI fails. There might also be ways to test for commonsense reasoning, for example, by ruling out unrelated mechanisms like memorization, pattern-matching and shortcuts.<\/p>\n\n\n\n<p>In a study reported in March, Arakelyan and colleagues tested if six LLMs that have scored highly on language understanding benchmarks and thus are said to understand the overall meaning of a sentence <a href=\"https:\/\/aclanthology.org\/2024.eacl-long.27\/\" target=\"_blank\" rel=\"noopener\">can also understand<\/a> a slightly paraphrased but logically equivalent version of the same sentence.<\/p>\n\n\n\n<p>Language understanding is typically evaluated using a task called natural language inference. The LLM is presented with a premise and hypothesis and asked to choose if the premise is implied by, contradicts or is neutral toward the hypothesis. But as the models become bigger, trained on more and more data, more carefully crafted evaluations are required to determine whether the models are relying on shortcuts that, say, focus on single words or sets of words, Arakelyan says.<\/p>\n\n\n\n<p>To try to get a better sense of language understanding, the team compared how a model answered the standard test with how it answered when given the same premise sentence but with slightly paraphrased hypothesis sentences. A model with true language understanding, the researchers say, would make the same decisions as long as the slight alteration preserves the original meaning and logical relationships. For instance, the premise sentence \u201cThere were beads of perspiration on his brow\u201d implies the hypothesis \u201cSweat built up upon his face\u201d as well as the slightly altered \u201cThe sweat had built up on his face.\u201d<\/p>\n\n\n\n<p>The team used a separate LLM, called flan-t5-xl and released by Google, to come up with variations of hypothesis sentences from three popular English natural language inference datasets. The LLMs under testing had encountered one of the datasets during training but not the other two. First, the team tested the models on the original datasets and picked only those sentences that the models classified correctly to be paraphrased. This ensured that any performance difference could be attributed to the sentence variations. On top of that, the researchers fed the original hypothesis sentences and their variations to language models identical to ones tested and capable of evaluating if the pairs were equivalent in meaning. Only those deemed equal by both the model and human evaluators were used to test language understanding.<\/p>\n\n\n\n<p>But for a sizable number of sentences, the models tested changed their decision, sometimes even switching from \u201cimplies\u201d to \u201ccontradicts.\u201d When the researchers used sentences that did not appear in the training data, the LLMs changed as many as 58 percent of their decisions.<\/p>\n\n\n\n<p>\u201cThis essentially means that models are very finicky when understanding meaning,\u201d Arakelyan says. This type of framework, unlike benchmark datasets, can better reveal whether a model has true understanding or whether it is relying on clues like the distribution of the words.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to evaluate step by step<\/h2>\n\n\n\n<p>Tracking an LLM\u2019s step-by-step process is another way to systematically assess whether it uses reasoning and understanding to arrive at an answer. In one approach, Dziri\u2019s team tested the ability of LLMs including GPT-4, GPT-3.5 and GPT-3 (a predecessor of both) to carry out multidigit multiplication. A model has to break down such a task into sub-steps that researchers can examine individually.<\/p>\n\n\n\n<p>After giving the LLMs a problem, like 7 x 29, the researchers checked the answers at each sub-step \u2014 after single-digit multiplication, after carrying over and after summation. While the models were perfect at multiplication of single and two-digit numbers, accuracy deteriorated as the number of digits increased. For multiplication problems with four- and five-digit numbers, <a href=\"https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/deb3c28192f979302c157cb653c15e90-Paper-Conference.pdf\" target=\"_blank\" rel=\"noopener\">the models hardly got any answers right<\/a>. Lower-digit problems \u201ccan be easily memorized,\u201d Dziri says, but the LLMs\u2019 performance \u201cstarts degrading when we increase the complexity.\u201d<\/p>\n\n\n\n<p>Perhaps the models hadn\u2019t encountered enough examples in the training data to learn how to solve more complex multiplication problems. With that idea, Dziri and colleagues further fine-tuned GPT-3 by training it on almost all the multiplication problems up to four-digits by two-digits, as well as providing step-by-step instructions on how to solve all the multiplication problems up to three-digits by two-digits. The team reserved 20 percent of multiplication problems for testing.<\/p>\n\n\n\n<p>Without access to the models\u2019 original training data and process, the researchers don\u2019t know how the models might be tackling the task, Dziri says. \u201cWe have this simple assumption that if we humans follow this algorithm, it should be quite intuitive for the model to follow it, because it\u2019s been trained on human language and human reasoning tasks.\u201d<\/p>\n\n\n\n<p>For humans, carrying out five- or six-digit multiplication is fairly straightforward. The underlying approach is no different from multiplying fewer digits. But though the model performed with near-perfect accuracy on examples it had encountered during training, it stumbled on unseen examples. These results indicate that the model was unable to learn the underlying reasoning needed for multidigit multiplication and apply these steps to new examples.<\/p>\n\n\n\n<p>Surprisingly, when the researchers investigated the models\u2019 answers at each sub-step, they found that even when the final answers were right, the underlying calculations and reasoning \u2014 the answers at each sub-step \u2014 could be completely wrong. This confirms that the model sometimes relies on memorization, Dziri says. Though the answer might be right, it doesn\u2019t say anything about the LLM\u2019s ability to generalize to harder problems of the same nature \u2014 a key part of true understanding or reasoning.<\/p>\n\n\n\n<div class=\"wp-block-sciencenews-content-sidebar\">\n<h3 class=\"wp-block-heading\">Counterfactual testing<\/h3>\n\n\n\n<p>One way to assess AI\u2019s grasp of concepts is to use counterfactual tasks, which add a twist on a common rule that the AI is unlikely to have seen in training. Researchers recently presented GPT-4 with several such problems. A couple examples are shown here.<\/p>\n\n\n<figure class=\"wp-block-image \"><picture class=\"sn-responsive-image\"><source srcset=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_desktop.jpg?w=680&amp;ssl=1 680w, https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_desktop.jpg?resize=330%2C111&amp;ssl=1 330w\" width=\"680\" height=\"229\" media=\"(min-width: 600px)\" sizes=\"(max-width: 1023px) 100vw, 680px\"><source srcset=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_mobile.jpg?w=680&amp;ssl=1 680w, https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_mobile.jpg?resize=330%2C124&amp;ssl=1 330w\" width=\"680\" height=\"255\"><\/source><\/source><\/picture><figcaption><span class=\"credit mobile-credit wp-credit-3141301\">Z. Wu <em>et al<\/em>\/arXiv.org 2024<\/span><span class=\"credit desktop-credit wp-credit-3141300\">Z. Wu <em>et al<\/em>\/arXiv.org 2024&#13;\n<\/span><\/figcaption><\/figure>\n\n\n<p>In a test of numerical reasoning, GPT-4 had to add 27 + 62; in the counterfactual version of the task, it had to solve the same problem using a base-9 numerical system. The model scored higher than chance (represented by the dashed line) in both tasks but did far better on the default (pink bar) versus counterfactual (green) version.<\/p>\n\n\n<figure class=\"wp-block-image \"><picture class=\"sn-responsive-image\"><source srcset=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_desktop.jpg?w=680&amp;ssl=1 680w, https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_desktop.jpg?resize=330%2C112&amp;ssl=1 330w\" width=\"680\" height=\"231\" media=\"(min-width: 600px)\" sizes=\"(max-width: 1023px) 100vw, 680px\"><source srcset=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?w=680&amp;ssl=1 680w, https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?resize=330%2C180&amp;ssl=1 330w\" width=\"680\" height=\"371\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?fit=680%2C371&amp;ssl=1\" class=\"attachment-large size-large\" alt=\"A chart showing GPT-4's performance completing a logic problem\" srcset=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?w=680&amp;ssl=1 680w, https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?resize=330%2C180&amp;ssl=1 330w\" sizes=\"(max-width: 1023px) 100vw, 680px\" data-attachment-id=\"3141303\" data-permalink=\"https:\/\/www.sciencenews.org\/071324_ai_sidebar_logic_mobile\" data-orig-file=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?fit=680%2C371&amp;ssl=1\" data-orig-size=\"680,371\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"071324_ai_sidebar_logic_mobile\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?fit=680%2C371&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_logic_mobile.jpg?fit=680%2C371&amp;ssl=1\"\/><\/source><\/source><\/picture><figcaption><span class=\"credit mobile-credit wp-credit-3141303\">Z. Wu <em>et al<\/em>\/arXiv.org 2024<\/span><span class=\"credit desktop-credit wp-credit-3141302\">Z. Wu <em>et al<\/em>\/arXiv.org 2024<\/span><\/figcaption><\/figure>\n\n\n<p>In a test of logical reasoning, GPT-4 had to solve a logic problem that depended on factually accurate information. In the counter\u00adfactual task, the logic problem required the AI to assume inaccurate information to solve the problem rather than rely on its training.<\/p>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\">New tests of generative AI will be hard<\/h2>\n\n\n\n<p>Even though interest in such nuanced evaluations is gaining steam, it\u2019s challenging to create rigorous tests because of the sheer scale of data and training, plus the proprietary nature of LLMs.<\/p>\n\n\n\n<p>For instance, trying to rule out memorization may require checking millions of data points in huge training datasets to see if the LLM has encountered the example before. It\u2019s harder still when training data aren\u2019t available for scrutiny. \u201cWe have to make lots of assumptions, and we have to pick our task very carefully,\u201d Dziri says. Sometimes researchers trying to do an evaluation can\u2019t get access to the training methodology or a version of the model itself (let alone the most updated version).<\/p>\n\n\n\n<p>The cost of computation is another constraint. For instance, Dziri and colleagues found that including five-digit by five-digit multiplication problems in their fine-tuning of GPT-3 would require about 8.1 billion question-and-answer examples, costing a total of over $12 million.<\/p>\n\n\n\n<p>In truth, a perfect AI evaluation might never exist. The more language models improve, the harder tests will have to get to provide any meaningful assessment. The testers will always have to be on their toes. And it\u2019s likely even the latest, greatest tests will uncover only some specific aspects of AI\u2019s capabilities, rather than assessing anything akin to general intelligence.<\/p>\n\n\n\n<p>For now, researchers are hoping at least for more consistency and transparency in evaluations. \u201cMapping the model\u2019s ability to human understanding of a cognitive capability is already a vague statement,\u201d Arakelyan says. Only evaluation practices that are well thought out and can be critically examined will help us understand what\u2019s actually going on inside AI.<\/p>\n\n\n\n\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/www.sciencenews.org\/article\/ai-understanding-reasoning-skill-assess\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Take our AI Survey Science News has partnered with Trusting News to gather feedback on the potential use of AI in journalism. Currently, we do not publish any content produced by generative AI (see our policy). We do want to hear your views on how Science News could use AI responsibly. Let us know by [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":91893,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"tdm_status":"","tdm_grid_status":"","fifu_image_url":"https:\/\/i0.wp.com\/www.sciencenews.org\/wp-content\/uploads\/2024\/07\/071324_ai_sidebar_arithmetic_mobile.jpg?fit=680%2C255&ssl=1","fifu_image_alt":"","footnotes":""},"categories":[606],"tags":[35828,78706,14811,21179,10092,7795,4478],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/posts\/91892"}],"collection":[{"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/comments?post=91892"}],"version-history":[{"count":1,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/posts\/91892\/revisions"}],"predecessor-version":[{"id":91894,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/posts\/91892\/revisions\/91894"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/media\/91893"}],"wp:attachment":[{"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/media?parent=91892"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/categories?post=91892"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/news.talkwithrattan.com\/index.php\/wp-json\/wp\/v2\/tags?post=91892"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}