How sloppy is your slop?

Friday 26th September, 2025 - Bruce Sterling

https://arxiv.org/pdf/2509.19163

REFERENCES

Meta AI. Introducing llama 4: Advancing multimodal intelligence, 2024. URL https://ai.
meta.com/blog/llama-4-multimodal-intelligence/.

Anthropic. Claude 3 model card addendum. Technical report, 2024. URL https://www-cdn.
anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_
Card_Claude_3_Addendum.pdf. Accessed: 2024-12-30.

Kyrtin Atreides and David J Kelley. Cognitive biases in natural language: Automatically detecting,
differentiating, and measuring bias in text. Cognitive Systems Research, 88:101304, 2024.
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Ma-
jumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated
machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.

Deborah L Bandalos. Measurement theory and applications for the social sciences. Guilford Pub-
lications, 2018.

Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, and Mark Yatskar. Flattery, fluff, and
fog: Diagnosing and mitigating idiosyncratic biases in preference models. arXiv preprint
arXiv:2506.05339, 2025.

Su Lin Blodgett, Solon Barocas, Hal Daum´ e III, and Hanna Wallach. Language (technology) is
power: A critical survey of” bias” in nlp. arXiv preprint arXiv:2005.14050, 2020.
Cati Brown, Tony Snodgrass, Susan J Kemper, Ruth Herman, and Michael A Covington. Auto-
matic measurement of propositional idea density from part-of-speech tagging. Behavior research
methods, 40(2):540–545, 2008.

Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu.
Art or artifice? large language models and the false promise of creativity. In Proceedings of the
2024 CHI Conference on Human Factors in Computing Systems, pp. 1–34, 2024.
Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. Ai-slop to ai-polish? aligning lan-
guage models through edit-based writing rewards and test-time computation. arXiv preprint
arXiv:2504.07532, 2025a.

Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. Can ai writing be salvaged? mitigat-
ing idiosyncrasies and improving human-ai alignment in the writing process through edits. In
Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–33,
2025b.

10
Preprint. Under Review.
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan
Shan, and Kevin Wadman. How people use chatgpt. Working Paper 34255, National Bureau of
Economic Research, September 2025. URL http://www.nber.org/papers/w34255.

Charles LA Clarke and Laura Dietz. Llm-based relevance assessment still can’t replace human
relevance assessment. arXiv preprint arXiv:2412.17156, 2024.

Edgar Dale and Jeanne S. Chall. A formula for predicting readability. Educational Research Bul-
letin, 27(1):11–28, 1948.

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
2025. URL https://arxiv.org/abs/2501.12948.

Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and
Sanmi Koyejo. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177, 2025.

Rudolf Flesch. A new readability yardstick. Journal of Applied Psychology, 32(3):221–233, 1948.
doi: 10.1037/h0057532.

Sian Gooding, Lucia Lopez-Rivilla, and Edward Grefenstette. Writing as a testbed for open ended
agents. arXiv preprint arXiv:2503.19711, 2025.

Robert Gunning. The Technique of Clear Writing. McGraw-Hill, 1952.

Kunal Handa, Alex Tamkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, Jared
Mueller, Jerry Hong, Stuart Ritchie, Tim Belonax, et al. Which economic tasks are performed
with ai? evidence from millions of claude conversations. arXiv preprint arXiv:2503.04761, 2025.
Abhimanyu Hans et al. Binoculars: Scalable detection of machine-generated text. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, 2024.

Dirk Hovy. The enemy in your own camp: How well can we detect statistically-generated fake
reviews–an adversarial study. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pp. 351–356. Association for Computa-
tional Linguistics, 2016. doi: 10.18653/v1/P16-2057.

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec
Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv
preprint arXiv:2412.16720, 2024.

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap-
lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier,
L´ elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril,
Thomas Wang, Timoth´ ee Lacroix, and William El Sayed. Mistral 7b, 2023. URL https:
//arxiv.org/abs/2310.06825.

J. Peter Kincaid, Robert P. Jr. Fishburne, Richard L. Rogers, and Brad S. Chissom. Derivation of
new readability formulas (automated readability index, fog count and flesch reading ease formula)
for navy enlisted personnel. Technical Report RBR-8-75, Naval Technical Training Command
Millington TN Research Branch, 1975.

Md Tahmid Rahman Laskar, Cheng Chen, Shashi Bhushan Tn, et al. Are large language models
reliable judges? a study on the factuality evaluation capabilities of llms. In Proceedings of the
Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pp. 310–316,
2023.

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun
Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint
arXiv:2412.05579, 2024.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pp. 74–81, 2004.

11
Preprint. Under Review.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg
evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.

Arwa Mahdawi. Ai-generated slop is slowly killing the internet, and nobody is trying to stop it.
The Guardian, 8 Jan 2025. Available at: https://www.theguardian.com/global/
commentisfree/2025/jan/08/ai-generated-slop-slowly-killing-/
internet-nobody-trying-to-stop-it (Accessed: March 25, 2025).

Marian Marchal, Merel Scholman, Frances Yung, and Vera Demberg. Establishing annotation qual-
ity in multi-label annotations. In Proceedings of the 29th international conference on computa-
tional linguistics, pp. 3659–3668, 2022.

Philipp Mayring. Qualitative content analysis. Forum Qualitative Sozialforschung / Forum:
Qualitative Social Research, 1(2), Jun. 2000. doi: 10.17169/fqs-1.2.1089. URL https:
//www.qualitative-research.net/index.php/fqs/article/view/1089.

Clara Meister, Tiago Pimentel, Patrick Haller, Lena J¨ ager, Ryan Cotterell, and Roger Levy. Revisit-
ing the uniform information density hypothesis. arXiv preprint arXiv:2109.11635, 2021.

Cade Metz. A.i. search engines are better at answers than finding them. The New York
Times, 11 Jun 2024. Available at: https://www.nytimes.com/2024/06/11/style/
ai-search-slop.html (Accessed: March 25, 2025).

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. De-
tectgpt: Zero-shot machine-generated text detection using probability curvature. In Proceedings
of the International Conference on Machine Learning, 2023.

Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana
Negreanu, Chris Parnin, and Advait Sarkar. Evaluating the evaluator: Measuring llms’ adherence
to task evaluation instructions, 2024. URL https://arxiv.org/abs/2408.08781.

Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, and
Junyi Jessy Li. Qudsim: Quantifying discourse similarities in llm-generated text. arXiv preprint
arXiv:2504.09373, 2025.

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bha-
gia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord,
Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha
Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William
Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Py-
atkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm,
Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2
olmo 2 furious, 2025. URL https://arxiv.org/abs/2501.00656.

OpenAI. How people are using chatgpt. https://openai.com/index/
how-people-are-using-chatgpt/, September 2025. Accessed: 2025-09-17.

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red
Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Moham-
mad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher
Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
man, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann,
Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis,
Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey
Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux,
Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila
Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
Sim´ on Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gib-
son, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan
Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hal-
lacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan
12

Preprint. Under Review.

Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu,
Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka-
mali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook
Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel
Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen
Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel
Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez,
Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv
Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney,
Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick,
Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel
Mossing, Tong Mu, Mira Murati, Oleg Murk, David M´ ely, Ashvin Nair, Reiichiro Nakano, Ra-
jeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe,
Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel
Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe
de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny,
Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl,
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra
Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders,
Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Sel-
sam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor,
Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky,
Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang,
Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Pre-
ston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cer´ on Uribe, Andrea Vallone, Arun Vi-
jayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan
Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng,
Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Work-
man, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming
Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL
https://arxiv.org/abs/2303.08774.

Vishakh Padmakumar and He He. Does writing with language models reduce content diversity? In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL),
Toronto, Canada, 2023. Association for Computational Linguistics. URL https://arxiv.
org/abs/2309.05196.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association
for Computational Linguistics, pp. 311–318, 2002.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Sanjana Ramprasad and Byron C Wallace. Do automatic factuality metrics measure factuality? a
critical evaluation. arXiv preprint arXiv:2411.16638, 2024.

Jenna Russell, Marzena Karpinska, and Mohit Iyyer. People who frequently use chatgpt for writing
tasks are accurate and robust detectors of ai-generated text. arXiv preprint arXiv:2501.15654,
2025.

Nikita Salkar, Thomas Trikalinos, Byron Wallace, and Ani Nenkova. Self-repetition in abstractive
neural summarizers. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang (eds.),
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computa-
tional Linguistics and the 12th International Joint Conference on Natural Language Processing
(Volume 2: Short Papers), pp. 341–350, Online only, November 2022. Association for Compu-
tational Linguistics. doi: 10.18653/v1/2022.aacl-short.42. URL https://aclanthology.
org/2022.aacl-short.42/.

13
Preprint. Under Review.
A. O. Scott. A.i. is annoying now. the future may be worse. The New York Times,
24 Jul 2024. Available at: https://www.nytimes.com/2024/07/24/opinion/
ai-annoying-future.html (Accessed: March 25, 2025).

Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F Siu, Byron C Wallace, and Ani Nenkova. Stan-
dardizing the measurement of text diversity: A tool and a comparative analysis of scores. arXiv
preprint arXiv:2403.00553, 2024a.

Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and Byron C Wallace. Detection and measurement of
syntactic templates in generated text. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen
(eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro-
cessing, pp. 6416–6431, Miami, Florida, USA, November 2024b. Association for Computational
Linguistics. doi: 10.18653/v1/2024.emnlp-main.368. URL https://aclanthology.org/
2024.emnlp-main.368/.

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu-
patiraju, L´ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´ e, Johan Fer-
ret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Char-
line Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin,
Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur,
Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchi-
son, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge,
Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar,
Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein-
berger, Dimple Vijaykumar, Dominika Rogozi´ nska, Dustin Herbison, Elisa Bandy, Emma Wang,
Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin,
Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci´ nska, Harleen
Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha
Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van
Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mohamed, Kar-
tikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia,
Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago,
Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel
Reid, Manvinder Singh, Mark Iverson, Martin G¨ orner, Mat Velloso, Mateo Wirth, Matt Davidow,
Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moyni-
han, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao,
Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil
Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culli-
ton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni,
Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin,
S´ ebastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ron-
strom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee
Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei
Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan
Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli
Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dra-
gan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Fara-
bet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy,
Robert Dadashi, and Alek Andreev. Gemma 2: Improving open language models at a practical
size, 2024. URL https://arxiv.org/abs/2408.00118.

Guy Tevet and Jonathan Berant. Evaluating the evaluation of diversity in natural language genera-
tion. arXiv preprint arXiv:2004.02990, 2020.

Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, and Siddharth Suri. Working with ai:
Measuring the occupational implications of generative ai. arXiv preprint arXiv:2507.07935, 2025.
Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell, and Melanie Martin. Learning sub-
jective language. Computational Linguistics, 30(3):277–308, 09 2004. ISSN 0891-2017. doi:
10.1162/0891201041850885. URL https://doi.org/10.1162/0891201041850885.

14
Preprint. Under Review.
Haoyan Yang, Yixuan Wang, Xingyin Xu, Hanyuan Zhang, and Yirong Bian. Can we trust
llms? mitigate overconfidence bias in llms through knowledge transfer. arXiv preprint
arXiv:2405.16856, 2024.

Yusen Zhang, Sarkar Snigdha Sarathi Das, and Rui Zhang. Verbosity ̸= veracity: Demystify ver-
bosity compensation behavior of large language models, 2024. URL https://arxiv.org/
abs/2411.07858.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and
chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023