Measuring reasoning capabilities of ChatGPT
Dată
2023-09-15
Autori
Titlul Jurnalului
ISSN-ul Jurnalului
Titlul Volumului
Editura
Rezumat
I shall quantify the logical faults generated by ChatGPT when applied to reasoning tasks.
For experiments, I use the 144 puzzles
from the library https://users.utcluj.ro/∼agroza/puzzles/maloga [1].
The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth-telling puzzles, grid puzzles, strange numbers, or self-
reference puzzles. The correct solutions for these puzzles were checked
using the theorem prover Prover9 [2] and the finite models finder
Mace4 [3] based on human-modelling in Equational First Order Logic.
A first output of this study is the benchmark of 100 logical
puzzles. For this dataset ChatGPT provided both correct answer
and justification for 7% only. Since the dataset seems challenging,
the researchers are invited to test the dataset on more advanced
or tuned models than ChatGPT3.5 with more crafted prompts.
A second output is the classification of reasoning faults conveyed
by ChatGPT. This classification forms a basis for a taxonomy of
reasoning faults generated by large language models. I have identified 67 such logical faults, among which: inconsistencies, implication does not hold, unsupported claim, lack of commonsense, wrong
justification. The 100 solutions generated by ChatGPT contain 698
logical faults. That is on average, 7 fallacies for each reasoning task.
A third output is the annotated answers of the ChatGPT with
the corresponding logical faults. Each wrong statement within
the ChatGPT answer was manually annotated, aiming to quantify the amount of faulty text generated by the language model.
On average, 26.03% from the generated text was a logical fault