Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Natural Language Processing (NLP) research is becoming increasingly focused on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic review of work using OpenAI’s ChatGPT and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document how much data has been leaked to ChatGPT in the first year after the model’s release. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, including unfair or missing baseline comparisons, reproducibility issues, and authors’ lack of awareness of the data usage policy. Our work provides the first quantification of the ChatGPT data leakage problem.

Cite us

@inproceedings{balloccu-etal-2024-leak,
    title = "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs",
    author = "Balloccu, Simone and  Schmidtová, Patrícia and Lango, Mateusz and Dušek, Ondřej",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2024",
    publisher = "Association for Computational Linguistics",
}

How to contribute?

If you would like to add your own work that has unfortunately leaked some data, or correct information that is already in a table, simply submit a pull request to modify this page directly at https://github.com/leak-llm/leak-llm.github.io.

Results

We release our survey results as a table below. The table has the following structure (columns):

Title: for each paper that we analysed
Applicable: Y if the paper contains an actual evaluation of ChatGPT or GPT-4; N/A otherwise, in which case all the other columns have N/A or empty value.
Date (earliest draft): the date of the earliest version found online for the paper.
Running period: containing the date in format Month Year (e.g. December 2023) if it was possible to derive when the experiment was ran; - otherwise.
Prompts available: Y for those work that detail the used prompts; N otherwise.
Repository available: Y for those work that publicly release experiment code in a repository detail; N otherwise.
Model: used model for the experiment (ChatGPT; GPT-4; Both).
Model access: if models were accessed through browser (BROWSER) or API (API - POST MARCH 2023 OR OPT-OUT).
Sampling etc.: Y if the paper makes use of random sampling or any other practice such that it is not possible to exactly reconstrute the used data; N otherwise.
Comparison : Y if the paper compares the considered model(s) with other baselines/approaches; N otherwise.
Different evaluation scale: Y if the comparison scale between the considered model(s) and other baselines/approaches differs; N otherwise.
Custom datasets: Y if the work adopts an ad-hoc/custom dataset that is not publicly available.
Datasets leaked: Details of the leaked data, in the format “Name of dataset; Task name; Split (train/test/…); Number of instances; Fraction of the whole split”.

Title	Applicable	Published	Date (earliest draft)	Running period	Prompts available	Repository available	Model	Model access	Sampling etc.	Comparison	Different evaluation scale	Custom datasets	Datasets leaked
LLM-contamination	N/A		-	-	N/A	N/A	-		N/A	N/A	N/A
Time Travel in LLMs: Tracing Data Contamination in Large Language Models	Y	Y	16/08/2023	June 2023	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	N	N
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks	N/A		-	-	N/A	N/A	-		-	-
Can we trust the evaluation on ChatGPT?	Y	Y	22/03/2023	January and February 2023	Y	N	ChatGPT	BROWSER	Y	N	N	N	SemEval-2016 Task 6 (68.8% of test set) P-Stance (99.1% of test set)
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection	Y	N	18/01/2023	-	Y	Y	ChatGPT	BROWSER	Y	N	N	N	HC3-English (100.0% of entire set) ELI5 (19.0% of entire set) WikiQA (39.0% of entire set) wiki_csai (100.0% of entire set) Medical Dialog (0.3% of entire set) FiQA (59.1% of entire set) HC3-Chinese (100.0% of entire set) WebTextQA + BaikeQA (0.1% of entire set) BaiduBaike15 (100.0% of entire set) NLPCC-DBQA (7.0% of entire set) Medical Dialog - zh (0.1% of entire set) FinanceZhidao (0.2% of entire set) Chinese Psychological Question Answering Dataset (5.0% of entire set) LegalQA (2.3% of entire set)
GPTEval: A Survey on Assessments of ChatGPT and GPT-4	N/A		24/08/2023	-	N/A	N/A	Both	-	-	-
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets	Y	Y	29/05/2023	December 2022	Y	Y	ChatGPT	BROWSER	Y	Y	Y	N	EfficientQA (100.0% of dev set) custom-ethical (100.0% of entire set) CB (100.0% of dev set) RTE (100.0% of dev set) WiC (100.0% of dev set) AX-g (100.0% of dev set) Inverse scaling challenge (11 datasets) (0.1% of entire set)
Quantifying Memorization Across Neural Language Models	N/A		-	-	N/A	N/A	-		-	-
Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy	N/A		-	-	N/A	N/A	-		-	-
List of various ChatGPT evaluations	N/A		-	-	N/A	N/A	ChatGPT		-	-
GPT-4 Technical Report	N/A		-	-	N/A	N/A	GPT-4	API - POST MARCH 2023 OR OPT-OUT	-	-
Zeno Chatbot Report	Y	N	18/05/2023	-	N	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity	Y	N	08/02/2023	-	Y	Y	ChatGPT	BROWSER	Y	Y	Y	N	SAMSum (6.0% of test set) CNN/DM (0.4% of test set) FLORES-200 (53.0% of devtest set) NusaX (25.0% of test set) EntailmentBank (9.0% of test set) bAbI-task15 (12.0% of test set) bAbI-task16 (3.0% of test set) CLUTRR (1.0% of test set) αNLI (1.9% of dev set) TimeDial (2.0% of test set) SpartQA (1.7% of test set) StepGame (<0.1% of test set) Math (0.1% of test set) CommonsenseQA (2.6% of test set) PiQA (1.6% of test set) Pep-3k (0.9% of entire set) E-Care (1.4% of val set) HotpotQA (<0.1% of val set) Letter string analogies - Webb (1.6% of entire set) COVID-19-Lee (19.5% of test set) OpenDialKG (0.3% of test set) MultiWOZ 2.2 (5.0% of test set)
Don’t Stop Pretraining? Make Promp-based Fine-tuning Powerful Learner	N/A		-	-	N/A	N/A	-		-	-
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation	Y	N	18/02/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine	Y	N	20/01/2023	December 2022	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	Flores-101 (4.9% of test set) WMT19 Biomedical Translation Task (13.4% of test set) WMT20 Robustness Task set 2 En->Ja (3.6% of test set) WMT20 Robustness Task set 2 Ja->En (5.0% of test set) WMT20 Robustness Task set 3 (0.9% of test set)
IS CHATGPT A GENERAL-PURPOSE NATURAL LANGUAGE PROCESSING TASK SOLVER?	Y	N	08/02/2023	January 2023	Y	N	ChatGPT	BROWSER	Y	Y	Y	N	MultiArith (100.0% of entire set) GSM8K (100.0% of test set) AddSub (100.0% of entire set) AQUA-RAT (100.0% of test set) SingleEq (100.0% of entire set) SVAMP (100.0% of entire set) CommonsenseQA (100.0% of val set) StrategyQA (100.0% of entire set) COPA (20.0% of val set) Last Letter (100.0% of entire set) Coin Flip (16.0% of entire set) Date (100.0% of entire set) Object (100.0% of entire set) RTE (4.8% of entire set) CommitmentBank (4.6% of entire set) BoolQ (100.0% of test set) MuTual (100.0% of test set) SAMsum (100.0% of test set) CoNLL03 (100.0% of test set) SST2 (100.0% of val set)
Extractive summarization via ChatGPT for faithful summary generation	Y	N	09/04/2023	-	Y	N	ChatGPT	BROWSER	Y	Y	Y	N	Reddit (2.4% of test set) XSum (8.8% of test set) PubMed (19.9% of test set) CNN/DM (8.7% of test set)
Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark	Y	N	15/03/2023	-	Y	N	ChatGPT	BROWSER	Y	Y	N	N	CoNLL2014 (7.6% of test set)
Will affective computing emerge from foundation models and general ai? A first evaluation on chatgpt	Y	Y	03/03/2023	February 2023	Y	Y	ChatGPT	BROWSER	Y	N	N	N	First Impressions (100.0% of entire set) Sentiment140 (100.0% of entire set) The Suicide and Depression dataset (100.0% of entire set)
ChatGPT: Jack of all trades, master of none	Y	Y	21/02/2023	-	Y	Y	Both	BROWSER	Y	Y	N	N	WikiDetox Aggr. - Aggression (4.3% of test or dev set) WikiDetox Aggr. - AggressionPer (0.2% of test or dev set) CoLa (100.0% of dev set) ColBERT (2.5% of test or dev set) Sarcasmania - Sarcasm (16.7% of test or dev set) SMS Spam v.1 (100.0% of test or dev set) WiC - WordContext (100.0% of test or dev set) RTE (100.0% of dev set) WNLI (100.0% of dev set) SQuAD v2 (8.4% of dev set) GSM8K - MathQA (75.8% of test set) ClarinEmo (100.0% of test or dev set) GoEmotions - GoEmo (18.4% of test or dev set) GoEmotions - GoEmoPer0 (5.9% of test or dev set) GoEmotions - GoEmoPer1 (5.9% of test or dev set) GoEmotions - GoEmoPer2 (5.9% of test or dev set) GoEmotions - GoEmoPer3 (5.9% of test or dev set) Unhealthy Conv. - Unhealthy (2.2% of test or dev set) Unhealthy Conv. - UnhealthyPer (0.4% of test or dev set) PolEmo2 - PolEmo (100.0% of test or dev set) TweetEval - TweetEmoji (3.3% of test or dev set) TweetEval - TweetSent (41.8% of test or dev set) TweetEval - TweetStance (100.0% of test set) RACE - ReAding (20.4% of test or dev set) Raganato - WSD (100.0% of test or dev set)
Zero-Shot Information Extraction via Chatting with ChatGPT	Y	N	20/02/2023	December 2022	Y	Y	ChatGPT	BROWSER	N	Y	N	N	NYT11-HRL (100.0% of entire set) DuIE2.0 (100.0% of entire set) conll++ (100.0% of entire set) MSRA (100.0% of entire set) DuEE1.0 (100.0% of entire set) ACE05 (100.0% of entire set)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent	Y	N	19/04/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
Is ChatGPT a Good NLG Evaluator? A Preliminary Study	Y	N	07/03/2023	February and March 2023	Y	Y	ChatGPT	BROWSER	N	Y	N	N	SummEval (100.0% of entire set) NewsRoom (100.0% of entire set) OpenMEVA-ROC (100.0% of entire set) RealSumm (100.0% of entire set)
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment	Y	N	29/03/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Large Language Models Are State-of-the-Art Evaluators of Translation Quality	Y	Y	28/02/2023	February 2023	Y	Y	Both	BROWSER	N	Y	N		MQM-2022 (100.0% of test set)
CHATGPT OUTPERFORMS CROWD-WORKERS FOR TEXT-ANNOTATION TASKS	Y	N	19/07/2023	March and April 2023	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N
SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark	Y	N	27/07/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models	Y	N	17/05/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning	Y	N	12/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Zero-Shot Cross-Lingual Summarization via Large Language Models	Y	N	28/02/2023	-	Y	N	Both	BROWSER	Y	Y	N	N	CrossSum (1.2% of test set) WikiLingua (0.2% of test set) XSAMSum (<0.1% of test set)
A categorical archive of chatgpt failures	N/A		06/02/2023	-	Y	N	ChatGPT	BROWSER	-	-
Benchmarks for automated commonsense reasoning: A survey	Y	N	09/02/2023	-	Y	Y	ChatGPT	BROWSER	Y	N	N	Y	custom commonsense reasoning dataset (100.0% of entire set)
Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation.	Y	N	12/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
Causal reasoning and large language models: Opening a new frontier for causality	Y	N	08/05/2023	-	Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Theory of mind may have spontaneously emerged in large language models	Y	N	04/02/2023	March 2023	Y	Y	Both	BROWSER	Y	Y	N	Y	theory-of-mind-custom 1/02 (100.0% of entire set)
Boosting Theory-of-Mind Performance in Large Language Models via Prompting	Y	N	22/04/2023	-	Y	Y	Both	BROWSER	N	N	N	Y	theory-of-mind-custom 2 (100.0% of entire set)
Does ChatGPT have Theory of Mind?	Y	N	23/05/2023	March 2023	Y	N	Both	BROWSER	N	N	N	Y	theory-of-mind-custom 3 (100.0% of entire set)
Can ChatGPT Defend the Truth? Automatic Dialectical Evaluation Elicits LLMs’ Deficiencies in Reasoning	Y	N	22/05/2023	-	N	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N
Evaluating the logical reasoning ability of chatgpt and gpt-4	Y	N	07/04/2023	-	Y	Y	Both	BROWSER	N	Y	N	N	LogiQA 2.0 (100.0% of test set) LogiQA 2.0-zh (100.0% of test set) MED (100.0% of test set) ConjNLI (100.0% of test set) TaxiNLI (100.0% of test set) AR-LSAT (100.0% of test set) ReClor (100.0% of dev set) ConTRoL “logical reasoning” (100.0% of test set) HELP (100.0% of test set) MNLI (100.0% of test set) custom-LogiQA 2.0 (100.0% of entire set)
We’re Afraid Language Models Aren’t Modeling Ambiguity	Y	N	27/04/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Sparks of artificial general intelligence: Early experiments with gpt-4	N/A		-	-	-	-	GPT-4	API - POST MARCH 2023 OR OPT-OUT	-	-
Chatgpt participates in a computer science exam	Y	N	08/03/2023	February 2023	Y	Y	Both	BROWSER	N	N	N	Y	custom-CS exam (100.0% of entire set)
Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls	Y	N	30/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Mathematical capabilities of chatgpt	Y	N	31/01/2023	January 2023	Y	Y	Both	BROWSER	N	N	N	N	GHOSTS (100.0% of entire set)
Could an artificial-intelligence agent pass an introductory physics course?	Y	Y	28/01/2023	January 2023	Y	PART	ChatGPT	BROWSER	N	N	N	Y	custom-made physics exam&homeworks (100.0% of entire set)
Investigating the Use of an Artificial Intelligence Chatbot with General Chemistry Exam Questions	Y	Y	10/01/2023	March 2023	Y	Y	ChatGPT	BROWSER	N	N	N	Y	custom-made chemistry exam (100.0% of entire set)
How do physics students evaluate artificial intelligence responses on comprehension questions? A study on the perceived scientific accuracy and linguistic quality of ChatGPT	Y	Y	12/04/2023	February 2023	N/A	Y	ChatGPT	BROWSER	N	N	N	Y	custom-made physics exam (100.0% of entire set)
Chatgpt goes to law school	Y	Y	23/01/2023	December 2022	Y	N	ChatGPT	BROWSER	N	N	N	Y	custom-made law exam (100.0% of entire set)
ChatGPT has aced the test of understanding in college economics: Now what?	Y	Y	08/03/2023	February 2023	Y	N	ChatGPT	BROWSER	N	N	N/A	N	TUCE test (100.0% of entire set)
The Wall Street Neophyte: A Zero-Shot Analysis of ChatGPT Over MultiModal Stock Movement Prediction Challenges	Y	N	10/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment	Y	Y	26/12/2022	November 2022	Y	N	ChatGPT	BROWSER	Y	N	N/A	Y	NBME medical questions (100.0% of entire set) AMBOSS medial questions (1.7% of entire set)
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models	Y	Y	21/12/2022	November 2022	Y	Y	ChatGPT	BROWSER	Y	Y	Y	Y	USMLE medial q (93.0% of entire set)
Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings	Y	Y	26/01/2023	-	Y	N	ChatGPT	BROWSER	Y	N	N/A	Y	BCSC Self-Assessment Program (5.8% of entire set) OphthoQuestions (5.7% of entire set)
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study	Y	Y	15/02/2023	January 2023	Y	Y	ChatGPT	BROWSER	Y	N	N/A	Y	custom medical diagnosis dataset (100.0% of entire set)
Assessing the utility of ChatGPT throughout the entire clinical workflow	Y	Y	26/02/2023	January 2023	Y	N	ChatGPT	BROWSER	Y	N	N/A	N	Merck Sharpe & Dohme (MSD) Clinical Manual (100.0% of entire set)
ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases	Y	N	26/04/2023	April 2023	Y	N	Both	BROWSER	Y	N	N/A	Y	custom medical 1 (100.0% of entire set)
Can AI tell good stories? narrative transportation and persuasion with ChatGPT	Y	N	05/04/2023	-	N	N	ChatGPT	BROWSER	N	N	N/A	Y	custom narrative (100.0% of entire set)
Benchmarking Foundation Models with Language-Model-as-an-Examiner	Y	N	07/06/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Chatgpt: A meta-analysis after 2.5 months	N/A		-	-	N/A	N/A	ChatGPT		N/A	N/A	N/A
Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis	Y	N	24/01/2023	January 2023	Y	Y	ChatGPT	BROWSER	Y	N	N/A	Y	custom medical 2 (100.0% of entire set)
Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models	N/A		-	-	N/A	N/A	ChatGPT		N/A	N/A	N/A
Harnessing the power of llms in practice: A survey on chatgpt and beyond	N/A		-	-	N/A	N/A	-		-	-
Holistic evaluation of language models	Y	Y	1/10/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
A Survey on Evaluation of Large Language Models	N/A		-	-	N/A	N/A	-		N/A	N/A	N/A
Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert	Y	N	19/02/2023	February 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	CoLa (2.5% of dev set) SST-2 (1.3% of dev set) MRPC (1.4% of dev set) STS-B (3.5% of dev set) QQP (<0.1% of dev set) MNLI (0.1% of val set) QNLI (0.4% of dev set) RTE (0.8% of dev set) WNLI (17.1% of dev set)
On the robustness of chatgpt: An adversarial and out-of-distribution perspective	Y	N	22/02/2023	February 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	AdvGLUE_SST-2 (100.0% of dev set) AdvGLUE_QQP (100.0% of dev set) AdvGLUE_MNLI (100.0% of dev set) AdvGLUE_QNLI (100.0% of dev set) AdvGLUE_RTE (100.0% of dev set) custom AdvGLUE-T (100.0% of dev set) ANLI-R3 (100.0% of test set) Flipkart (0.1% of entire set) DDXPlus (EN) (<0.1% of test set)
An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)	Y	N	23/02/2023	January and February 2023	Y	Y	ChatGPT	BROWSER	N	N	N/A	N	DRAW-1K (100.0% of entire set)
Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family	Y	N	14/03/2023	February and March 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	KQApro (90.0% of entire set) LC-quad2.0 (100.0% of entire set) WQSP (99.2% of entire set) CWQ (100.0% of entire set) GrailQA (100.0% of test set) GraphQuestions (100.0% of entire set) QALD-9 (100.0% of entire set) MKQA (2.3% of entire set)
Instruction tuning with gpt-4	Y	N	06/04/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks	N/A		01/03/2023	-	N/A	N/A	-	API - POST MARCH 2023 OR OPT-OUT	N/A	N/A	N/A		custom-made psychological dataset (100.0% of entire set)
Consistency analysis of chatgpt	Y	N	11/03/2023	February 2023	Y	N	ChatGPT	BROWSER	Y	Y	Y	N	becel-SNLI (4.9% of test set) becel-RTE (36.6% of test set) becel-MRPC (16.6% of test set)
Does ChatGPT resemble humans in language use?	Y	N	10/03/2023	December 2022 and January 2023	Y	Y	ChatGPT	BROWSER	N	N	N/A	Y
A comprehensive capability analysis of gpt-3 and gpt-3.5 series models	Y	N	18/03/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability	Y	N	12/03/2023	-	Y	Y	ChatGPT	BROWSER	N	Y	N	N	Spider (100.0% of dev set) Spider-SYN (100.0% of dev set) Spider-DK (100.0% of dev set) Spider-Realistic (100.0% of dev set) Spider-CG(sub) (100.0% of entire set) Spider-CG(app) (100.0% of entire set) ADVETA(rpl)-spider (100.0% of dev set) ADVETA(add)-spider (100.0% of dev set) CSpider (100.0% of dev set) DuSQL (100.0% of entire set) SParC (100.0% of dev set) CoSQL (100.0% of dev set)
Is ChatGPT a good sentiment analyzer? A preliminary study	Y	N	10/04/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
A preliminary evaluation of chatgpt for zero-shot dialogue understanding	Y	N	09/04/2023	February 2023	Y	N	ChatGPT	BROWSER	N	Y	N	N	ATIS (100.0% of test set) SNIPS (100.0% of test set) MultiWOZ 2.1 (100.0% of test set) MultiWOZ 2.4 (100.0% of test set)
Zero-shot Temporal Relation Extraction with ChatGPT	Y	Y	11/04/2023	-	Y	N	ChatGPT	BROWSER	N	Y	N	N	TimeBank-Dense (100.0% of test set) MATRES (100.0% of test set) TDDiscourse (100.0% of test set)
Can chatgpt reproduce human-generated labels? a study of social computing tasks	Y	N	22/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Chatgraph: Interpretable text classification by converting chatgpt knowledge to graphs	Y	Y	19/09/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue: An Empirical Study	Y	N	15/05/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Evaluating ChatGPT’s Performance for Multilingual and Emoji-based Hate Speech Detection	Y	N	22/05/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Sentiment Analysis in the Era of Large Language Models: A Reality Check	Y	N	24/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
ChatGPT for Suicide Risk Assessment on Social Media: Quantitative Evaluation of Model Performance, Potentials and Limitations	Y	Y	15/06/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Metacognitive Prompting Improves Understanding in Large Language Models	Y	N	10/08/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Exploring ai ethics of chatgpt: A diagnostic analysis	Y	N	30/01/2023	December 2022 -January 2023	PART	N	ChatGPT	BROWSER	Y	Y	N	N	BBQ (80.0% of entire set) BOLD (80.0% of entire set) custom-made query to inspect gender occupation bias (100.0% of entire set) IMDB (1.9% of test set) IMDB-contrast (100.0% of entire set) BoolQ (10.3% of test set) BoolQ-contrast (100.0% of test set) FLORES-200 (80.0% of dev set) ProsocialDialog (0.2% of test set) custom jailbreaking (100.0% of entire set) Open-BookQA (80.0% of test set) TruthfulQA (100.0% of test set) TruthfulQA gen (12.2% of test set) RealToxicPrompts (80.0% of entire set) custom character role-play (100.0% of entire set)
Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech	Y	Y	11/02/2023	January 2023	Y	N	ChatGPT	BROWSER	Y	N	N/A	N	LatentHatred (12.0% of entire set)
Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media	Y	N	06/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
The Self-Perception and Political Biases of ChatGPT	Y	N	14/04/2023	March 2023	N	N	ChatGPT	BROWSER	N	N	N/A	N	custom political compass test (100.0% of entire set)
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting	Y	N	11/05/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
BAD: BiAs Detection for Large Language Models in the context of candidate screening	Y	N	17/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models	Y	N	20/06/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
On Large Language Models’ Selection Bias in Multi-Choice Questions	Y	N	08/09/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization	Y	N	16/02/2023	February 2023	Y	N	ChatGPT	BROWSER	Y	Y	Y	N	QMSum (35.5% of test set) SQuALITY (55.5% of test set) CovidET (25.1% of test set) NEWTS (16.6% of test set)
ChatGPT as a Factual Inconsistency Evaluator for Text Summarization	Y	N	13/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Uncovering ChatGPT’s Capabilities in Recommender Systems	Y	N	24/08/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models	Y	N	22/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text	Y	N	22/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	N	N/A
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge	Y	N	05/06/2023	December 2022	Y	N	ChatGPT	BROWSER	N/A	N	N/A	Y	custom pokemon dataset (100.0% of entire set)
Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study	Y	N	01/06/2023	-	Y	N	ChatGPT	BROWSER	Y	N	N	N	custom book and article summarization dataset (100.0% of entire set)
Generative Job Recommendations with Large Language Model	Y	N	05/06/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	N/A	N/A
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews	Y	N	12/06/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
L-Eval: Instituting Standardized Evaluation for Long Context Language Models	Y	N	31/07/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
LLM-Rec: Personalized Recommendation via Prompting Large Language Models	N/A		2023	-	-	-	-		-	-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding	Y	N	28/08/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Can Large Language Models Understand Real-World Complex Instructions	Y	N	27/09/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Mind meets machine: Unravelling GPT-4’s cognitive psychology	Y	N	20/03/2023	-	Y	N	GPT-4	BROWSER	N	N	N/A	N
Capabilities of GPT-4 on Medical Challenge Problems	Y	N	11/04/2023	-	Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
GPT is becoming a Turing machine: Here are some ways to program it	Y	N	25/03/2023	-	Y	N	GPT-4	BROWSER	Y	N	N/A	Y	custom-made dataset of coding problems (100.0% of entire set)
ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models	Y	N	29/03/2023	March 2023	Y	N	ChatGPT	BROWSER	Y	N	N/A	N	CommonsenseQA (9.8% of dev set) OpenBookQA (24.0% of dev set) WSC (36.6% of dev set) PIQA (6.5% of dev set) Social IQA (6.1% of dev set) ARC (33.4% of dev set) QASC (12.9% of dev set) HellaSWAG (1.2% of dev set) NumerSense (10.6% of dev set) MC-TACO (3.1% of dev set) ProtoQA (100.0% of dev set)
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure	Y	N	30/03/2023		Y	N	Both	BROWSER	N	N	N/A	N	custom-made theory of reason (100.0% of entire set)
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?	N/A		31/07/2023	-	-	-	-		-	-
How well do Large Language Models perform in Arithmetic tasks?	Y	N	16/03/2023	-	Y	Y	Both	BROWSER	Y	Y	Y	N	custom mathematical (100.0% of entire set)
Uncertainty in Natural Language Generation: From Theory to Applications	N/A		28/07/2023	-	N/A	N/A	-		N/A	N/A	N/A
ChatGPT-Crawler: Find out if ChatGPT really knows what it’s talking about	Y	N	06/04/2023	-	N	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
How is ChatGPT’s behavior changing over time?	Y	N	18/07/2023	-	N	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	N/A	N/A
ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT	Y	N	21/04/2023	-	PART	N	Both	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models	Y	N	16/07/2023	-	PART	N	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
StructGPT: A General Framework for Large Language Model to Reason over Structured Data	Y	N	16/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N
Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models	Y	N	26/06/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	Y	N	17/05/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
ChatLog: Recording and Analyzing ChatGPT Across Time	Y	N	27/04/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Chain of Knowledge: A Framework for Grounding Large Language Models with Structured Knowledge Bases	Y	N	3/10/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
ChatHaruhi: Reviving Anime Character in Reality via Large Language Model.	Y	N	18/08/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	Y	N/A
Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Conflicts	Y	N	3/10/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
GPT-3.5 vs GPT-4: Evaluating ChatGPT’s Reasoning Performance in Zero-shot Learning	Y	N	21/05/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	N	N
LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4	Y	N	20/05/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N/A	N/A	N/A
Enabling Large Language Models to Generate Text with Citations	Y	N	24/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Why Does ChatGPT Fall Short in Providing Truthful Answers?	Y	N	24/05/2023	-	Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation From Deductive, Inductive and Abductive Views	Y	N	08/08/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation	Y	N	23/07/2023	late May and early June 2023	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models	Y	N	20/07/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph	Y	N	15/07/2023	-	N	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models	Y	N	14/08/2023	-	Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions	Y	N	22/08/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Large Language Models on the Chessboard: A Study on ChatGPT’s Formal Language Comprehension and Complex Reasoning Skills	Y	N	29/08/2023	-	PART	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	N	N/A
Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis	Y	N	22/08/2023	June 2023	Y	Y	Both	BROWSER	N/A	Y	N	N	custom graph (100.0% of entire set)
Exploring New Frontiers in Agricultural NLP: Investigating the Potential of Large Language Models for Food Applications	Y	N	20/06/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback	Y	N	19/09/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
ChatGPT for Robotics: Design Principles and Model Abilities	N/A		20/02/2023	-	Y	Y	ChatGPT	UNCLEAR	-	N/A
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT	Y	N	24/03/2023	March 2023	Y	Y	ChatGPT	BROWSER	Y	N	N/A	N	WMT20 EN-DE (4.0% of test set) WMT20 ZH-EN (4.0% of test set)
Towards Making the Most of ChatGPT for Machine Translation	Y	N	24/03/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Linguistically Informed ChatGPT Prompts to Enhance Japanese-Chinese Machine Translation: A Case Study on Attributive Clauses	Y	N	27/03/2023	-	Y	N	ChatGPT	UNCLEAR	N/A	N	N/A	N
How to Design Translation Prompts for ChatGPT: An Empirical Study	Y	N	21/04/2023	-	Y	N	ChatGPT	BROWSER	Y	N	N/A	N	Flores-101 (4.9% of test set) WMT2014 Ott et al 2018 (10.0% of test set)
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation	Y	N	10/06/2023	-	N/A	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness	Y	N	23/04/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Can ChatGPT Assess Human Personalities? A General Evaluation Framework	Y	N	07/03/2023	-	Y	Y	ChatGPT	BROWSER	N	N	N/A	N	Myers-Briggs Type Indicator (100.0% of entire set)
Is ChatGPT Equipped with Emotional Dialogue Capabilities?	Y	N	19/4/2023	-	N	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
AugGPT: Leveraging ChatGPT for Text Data Augmentation	Y	N	20/03/2023	-	Y	N	ChatGPT	BROWSER	N	N	N/A	N	Amazon (customer reviews) (<0.1% of train set) Symptoms (on Kaggle) (100.0% of train set) PubMed20K (100.0% of train set)
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations	Y	N	5/4/2023	31/03/2023	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
To ChatGPT, or not to ChatGPT: That is the question!	Y	N	4/4/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N/A	N/A
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT	Y	N	18/08/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Analyzing ChatGPT’s Aptitude in an Introductory Computer Engineering Course	Y	N	13/03/2023	-	Y	Y	ChatGPT	BROWSER	Y	N	N/A	Y	custom CS exams/homeworks (100.0% of entire set)
Linguistic ambiguity analysis in ChatGPT	Y	N	13/02/2023	-	Y	Y	ChatGPT	BROWSER	N	N	N/A	Y	custom linguistic ambiguity (100.0% of entire set) custom (100.0% of entire set)
Using ChatGPT for Entity Matching	Y	N	05/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
MEGA: Multilingual Evaluation of Generative AII	Y	Y	22/10/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers	Y	Y	07/06/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Zero-shot Clinical Entity Recognition using ChatGPT	Y	N	29/03/2023	22/02/2023	Y	N	ChatGPT	BROWSER	Y	Y	N	N	i2b2 2010 (33.5% of entire set) MTSamples (1.9% of entire set)
Human-like Summarization Evaluation with ChatGPT	Y	N	05/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
“HOT” ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media	Y	N	20/04/2023	-	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A
Large language models effectively leverage document-level context for literary translation, but critical errors persist	Y	N	06/04/2023	-	Y	Y	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N/A	N/A	N/A
BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models	Y	N	19/06/2023	-	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	WMT22 (0.1% of test set)
Neural Machine Translation Data Generation and Augmentation using ChatGPT	N/A		11/07/2023	-	N/A	N/A	-	-	N/A	N/A	N/A
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages	Y	N	14/09/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
How would Stance Detection Techniques Evolve after the Launch of ChatGPT?	Y	N	30/12/2022	-	Y	N	ChatGPT	BROWSER	N	Y	N	N	SemEval-2016 Task 6 (100.0% of test set) P-Stance (100.0% of test set)
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models	Y	N	12/09/2023	-	Y	N	GPT-4	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A
Zero-shot Approach to Overcome Perturbation Sensitivity of Prompts	N/A	Y	25/05/2023	-	Y	Y	-		-
UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data Generation for Cross-Lingual Learning in Tweet Intimacy Prediction	Y	Y	02/03/2023	December 2022	Y	N	ChatGPT	BROWSER	Y	N/A	N/A	N	SemEval-2023 Task 9 (0.001% of training set)
Exploring the Feasibility of ChatGPT for Event Extraction	Y	N	07/03/2023		Y	N	ChatGPT	BROWSER	Y	Y	N	N	ACE05 (0.3% of test set)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?	Y	N	08/03/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	Y	N	09/03/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	-	-	-
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design	Y	N	11/03/2023	January - February 2023	Y	N/A	ChatGPT	BROWSER	N/A	N/A	N/A	Y
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models	Y	N	15/03/2023	-	Y	Y	-		N/A	N/A	N/A
Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential	Y	Y	16/03/2023		Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	-	N	N/A
An Empirical Study of Pre-trained Language Models in Simple Knowledge Graph Question Answering	Y	Y	18/03/2023	January 2023	Y	Y	ChatGPT	BROWSER	Y	Y	Y	N	SimpleQuestions (0.3% of entire set) WebQuestionsSP (4.5% of entire set) FreebaseQA (1.0% of entire set)
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4	Y	N	20/03/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	Y	N	20/03/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N/A	N/A
Is ChatGPT A Good Keyphrase Generator? A Preliminary Study	Y	N	23/03/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to Document Level	Y	Y	13/06/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A	N
Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms	Y	N	30/03/2023	February 2023	Y	Y	ChatGPT	BROWSER	Y	N	N/A	N	CNN DailyMail (0.4% of test set)
MGTBench: Benchmarking Machine-Generated Text Detection	N/A		26/03/2023		N/A	N/A	-	-	N/A	N/A	N/A
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs	Y	N	30/3/2023		N	N	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y
What would Harry say? Building Dialogue Agents for Characters in a Story	Y	N	13/11/2022		N	N	ChatGPT	BROWSER	Y	Y	Y	N	custom Harry Potter dialogues (100.0% of entire set)
Reliable Natural Language Understanding with Large Language Models and Answer Set Programming	N/A				N/A	N/A	-		N/A	N/A	N/A
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models	Y	Y	20/12/2022	-	Y	Y	ChatGPT	BROWSER	Y	Y	Y	Y
Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End	Y	N	20/12/2022		Y	N	ChatGPT	BROWSER	Y	Y	Y	Y	custom EMNLP 2022 abstracts (4.8% of entire set)
Modeling Label Semantics Improves Activity Recognition	N/A		01/01/2023		N/A	N/A	-		N/A	N/A	N/A
An Analysis of the Automatic Bug Fixing Performance of ChatGPT	Y	N	20/01/2023	December 2022 - January 2023	Y	Y	ChatGPT	BROWSER	N	Y	N	N	QuixBugs (100.0% of entire set)
Chat2VIS: Generating Data Visualisations via Natural Language using ChatGPT, Codex and GPT-3 Large Language Models	Y	Y	04/02/2023		Y	N	ChatGPT	BROWSER	N	Y	N	N
ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots	Y	N	08/02/2023		-	N	ChatGPT	BROWSER	Y	Y	N	N	QALD-9 (100.0% of test set) LCQuAD-2.0 (0.3% of entire set) YAGO-Bench (100.0% of entire set) DBLP-Bench (100.0% of entire set) MAG-Bench (100.0% of entire set)
ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models	Y	N	14/02/2023	January 2023	Y	N	ChatGPT	BROWSER	Y	Y	Y	N	MIMIC (0.5% of test set)
Guiding Large Language Mdels via Directional Stimulus Prompting	Y	N	07/07/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Utilizing ChatGPT Generated Data to Retrieve Depression Symptoms from Social Media	N/A		05/07/2023	-	N/A	-	-	-	N/A	N/A	N/A
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback	Y	N	24/02/2023		N/A	N	ChatGPT	BROWSER	Y	N	N/A	N	DSTC7 Track 2 (100.0% of entire set) DSTC11 Track 5 (100.0% of val set) OTT-QA (100.0% of entire set)
ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity?	Y	Y	02/06/2023	January 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	MultiWOZ 2.1 (13.5% of test set)
ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer	Y	Y	01/07/2023	February to April 2023	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
Not The End of Story: An Evaluation of ChatGPT-Driven Vulnerability Description Mappings	Y	Y	01/07/2023	-	Y	Y	ChatGPT	BROWSER	N	Y	N	N	CVE-2021 (100.0% of entire set) CVE-ATT&CK (100.0% of entire set) BRON (preprocessed) (100.0% of entire set)
ChatGPT is not a good indigenous translator	Y	Y	01/07/2023		N	N	GPT-4	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
You’ve Got a Friend in … a Language Model? A Comparison of Explanations of Multiple-Choice Items of Reading Comprehension between ChatGPT and Humans	Y	Y	01/07/2023	December 2022 and January 2023	Y	Y	ChatGPT	BROWSER	N	N	N/A	Y	custom - Evaluar para Avanzar (EpA) exams (100.0% of entire set)
Language-Agnostic Transformers and Assessing ChatGPT-Based Query Rewriting for Multilingual Document-Grounded QA	Y	Y	03/05/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A	N
HW-TSC at SemEval-2023 Task 7: Exploring the Natural Language Inference Capabilities of ChatGPT and Pre-trained Language Model for Clinical Trial	Y	Y	01/07/2023	January 2023	Y	N	ChatGPT	BROWSER	N	Y	N	Y	SemEval-2023 Task 7 (100.0% of entire set)
Can ChatGPT Understand Causal Language in Science Claims?	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A	N
ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models	N/A	Y	02/06/2023	January to March 2023	Y	Y	ChatGPT	BROWSER	N/A	N/A	N/A	N
Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study	Y	Y	30/03/2023	-	Y	Y	ChatGPT	BROWSER	Y	N	N/A	Y	Hofstede Cultural Survey (100.0% of entire set)
Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications	Y	Y	01/07/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
ChatGPT vs. Crowdsourcing vs. Experts: Annotating Open-Domain Conversations with Speech Functions	Y	Y	01/09/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A	N
Leveraging Large Language Models for Automated Dialogue Analysis	Y	Y	01/09/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance	Y	Y	28/08/2023		Y	N	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction	Y	Y	05/06/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	N	N/A	N
Credible Without Credit: Domain Experts Assess Generative Language Models	Y	Y	01/07/2023	-	Y	Y	ChatGPT	BROWSER	Y	N	N/A	Y	custom qa (100.0% of entire set)
SUT at SemEval-2023 Task 1: Prompt Generation for Visual Word Sense Disambiguation	N/A	Y	01/07/2023		N/A	N/A		-	N/A	N/A	N/A	N/A
How Well Do Large Language Models Perform on Faux Pas Tests?	Y	Y	24/05/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N
Zero-Shot Information Extraction for Clinical Meta-Analysis using Large Language Models	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
Can Large Language Models Be an Alternative to Human Evaluation?	Y	Y	03/05/2023	December 2022 - January 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	WritingPrompts (1.3% of test set)
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs	N/A	Y	15/05/2023		N/A	N/A			N/A	N/A	N/A	N/A
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis	N/A	Y	20/12/2022		N/A	N/A			N/A	N/A	N/A	N/A
Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text	N/A	Y	21/11/2022		N/A	N/A			N/A	N/A	N/A	N/A
MolXPT: Wrapping Molecules with Text for Generative Pre-training	N/A	Y	18/05/2023		N/A	N/A			N/A	N/A	N/A	N/A
MUX-PLMs: Pre-training Language Models with Data Multiplexing	N/A	Y	24/02/2023		N/A	N/A			N/A	N/A	N/A	N/A
Is GPT-3 a Good Data Annotator?		Y	20/12/2022		Y	N	Both	BROWSER	-
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey	N/A	Y	14/10/2022		N/A	N/A			N/A	N/A	N/A	N/A
Detoxifying Online Discourse: A Guided Response Generation Approach for Reducing Toxicity in User-Generated Text	Y	Y	01/07/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	Y
Faithful Question Answering with Monte-Carlo Planning	Y	Y	04/05/2023	December 2022	Y	Y	ChatGPT	BROWSER	Y	Y	Y	N	EntailmentBank (15.0% of test set)
Nut-cracking Sledgehammers: Prioritizing Target Language Data over Bigger Language Models for Cross-Lingual Metaphor Detection	Y	Y	01/09/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
Generating Faithful Text From a Knowledge Graph with Noisy Reference Text	Y	Y	12/8/2023		Y		ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	Y
Empowering Conversational Agents using Semantic In-Context Learning	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A	N
Can Large Language Models Safely Address Patient Questions Following Cataract Surgery?	Y	Y	01/07/2023	December 2022	Y	N	ChatGPT	BROWSER	N	N	N	Y	custom-sourced post-operations queries from patients (100.0% of entire set)
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models	Y	Y	15/06/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	Y	N
Generative Pretrained Transformers for Emotion Detection in a Code-Switching Setting	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
On the Underspecification of Situations in Open-domain Conversational Datasets	Y	Y	01/07/2023		Y	N	ChatGPT	BROWSER	-	N/A	N/A
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors	Y	Y	25/05/2023	-	Y	Y	ChatGPT	UNCLEAR	-
Multilingual Language Models are not Multicultural: A Case Study in Emotion	N/A	Y	03/06/2023						-
Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers	Y	Y	07/06/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning	Y	Y	27/06/2023	February 2023	Y	Y	ChatGPT	BROWSER	Y	Y	N	N	ReClor (0.4% of entire set) LogiQA (0.3% of entire set)
PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A	N
Do PLMs Know and Understand Ontological Knowledge?	Y	Y	01/07/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	Y
Evaluation of Question Generation Needs More References	Y	Y	26/05/2023		Y	N	ChatGPT	BROWSER	N	Y	N	N	Quiz_Design (100.0% of entire set)
A System for Answering Simple Questions in Multiple Languages	Y	Y	01/07/2023	March 2023 version of ChatGPT	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
HIT-SCIR at WASSA 2023: Empathy and Emotion Analysis at the Utterance-Level and the Essay-Level	Y	Y	01/07/2023		Y	N	ChatGPT	UNCLEAR	-	N/A	N/A	N
Pre-trained Language Models Can be Fully Zero-Shot Learners	Y	Y	26/05/2023	March 2023	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
Predicting the Quality of Revisions in Argumentative Writing	Y	Y	01/06/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	N	N/A	Y
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods	N/A	Y	29/05/2023						-
When Truth Matters – Addressing Pragmatic Categories in Natural Language Inference (NLI) by Large Language Models (LLMs)	Y	Y	01/07/2023	December 2022	Y	Y	ChatGPT	BROWSER	Y	N	N/A	N	MNLI (0.4% of val set)
Debiasing should be Good and Bad: Measuring the Consistency of Debiasing Techniques in Language Models	N/A	Y	23/05/2023						-
GPoeT: a Language Model Trained for Rhyme Generation on Synthetic Data	N/A	Y	01/05/2023						-
Examining Bias in Opinion Summarisation Through the Perspective of Opinion Diversity	Y	Y	01/07/2023	March 2023 version of ChatGPT	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
Unsupervised Summarization Re-ranking	Y	Y	14/05/2023	-	N	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	Y	Y	N	N
Improving Dutch Vaccine Hesitancy Monitoring via Multi-Label Data Augmentation with GPT-3.5	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	N	N/A	Y
What Makes a Good Counter-Stereotype? Evaluating Strategies for Automated Responses to Stereotypical Text	Y	Y	01/07/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N/A	N	N/A	Y
What Makes Good Counterspeech? A Comparison of Generation Approaches and Evaluation Metrics	Y	Y	01/09/2023	May 2023	Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	Y
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition	Y	Y	05/05/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?	Y	Y	29/06/2023	-	Y	Y	Both	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
Frontier Review of Multimodal AI	N/A	Y	03/08/2023		N/A	N/A	-	-
Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation	Y	N	04/04/2023		Y	N	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	N	Y	N	N
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI	Y	N	19/07/2023	-	Y	Y	ChatGPT	API - POST MARCH 2023 OR OPT-OUT	-	N/A	N/A	N
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-
									-

Acknowledgements

This work was co-funded by the European Union (ERC, NG-NLG, 101039303).

erc-logo