Skip to the content.

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Natural Language Processing (NLP) research is becoming increasingly focused on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic review of work using OpenAI’s ChatGPT and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document how much data has been leaked to ChatGPT in the first year after the model’s release. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, including unfair or missing baseline comparisons, reproducibility issues, and authors’ lack of awareness of the data usage policy. Our work provides the first quantification of the ChatGPT data leakage problem.

Cite us

@inproceedings{balloccu-etal-2024-leak,
    title = "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs",
    author = "Balloccu, Simone and  Schmidtová, Patrícia and Lango, Mateusz and Dušek, Ondřej",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics",
    year = "2024",
    publisher = "Association for Computational Linguistics",
}

How to contribute?

If you would like to add your own work that has unfortunately leaked some data, or correct information that is already in a table, simply submit a pull request to modify this page directly at https://github.com/leak-llm/leak-llm.github.io.

Results

We release our survey results as a table below. The table has the following structure (columns):

Title Applicable Published Date (earliest draft) Running period Prompts available Repository available Model Model access Sampling etc. Comparison Different evaluation scale Custom datasets Datasets leaked
LLM-contamination N/A   - - N/A N/A -   N/A N/A N/A    
Time Travel in LLMs: Tracing Data Contamination in Large Language Models Y Y 16/08/2023 June 2023 Y Y Both API - POST MARCH 2023 OR OPT-OUT Y N N    
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks N/A   - - N/A N/A -   - -      
Can we trust the evaluation on ChatGPT? Y Y 22/03/2023 January and February 2023 Y N ChatGPT BROWSER Y N N N SemEval-2016 Task 6 (68.8% of test set)
P-Stance (99.1% of test set)
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection Y N 18/01/2023 - Y Y ChatGPT BROWSER Y N N N HC3-English (100.0% of entire set)
ELI5 (19.0% of entire set)
WikiQA (39.0% of entire set)
wiki_csai (100.0% of entire set)
Medical Dialog (0.3% of entire set)
FiQA (59.1% of entire set)
HC3-Chinese (100.0% of entire set)
WebTextQA + BaikeQA (0.1% of entire set)
BaiduBaike15 (100.0% of entire set)
NLPCC-DBQA (7.0% of entire set)
Medical Dialog - zh (0.1% of entire set)
FinanceZhidao (0.2% of entire set)
Chinese Psychological Question Answering Dataset (5.0% of entire set)
LegalQA (2.3% of entire set)
GPTEval: A Survey on Assessments of ChatGPT and GPT-4 N/A   24/08/2023 - N/A N/A Both - - -      
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets Y Y 29/05/2023 December 2022 Y Y ChatGPT BROWSER Y Y Y N EfficientQA (100.0% of dev set)
custom-ethical (100.0% of entire set)
CB (100.0% of dev set)
RTE (100.0% of dev set)
WiC (100.0% of dev set)
AX-g (100.0% of dev set)
Inverse scaling challenge (11 datasets) (0.1% of entire set)
Quantifying Memorization Across Neural Language Models N/A   - - N/A N/A -   - -      
Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy N/A   - - N/A N/A -   - -      
List of various ChatGPT evaluations N/A   - - N/A N/A ChatGPT   - -      
GPT-4 Technical Report N/A   - - N/A N/A GPT-4 API - POST MARCH 2023 OR OPT-OUT - -      
Zeno Chatbot Report Y N 18/05/2023 - N Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity Y N 08/02/2023 - Y Y ChatGPT BROWSER Y Y Y N SAMSum (6.0% of test set)
CNN/DM (0.4% of test set)
FLORES-200 (53.0% of devtest set)
NusaX (25.0% of test set)
EntailmentBank (9.0% of test set)
bAbI-task15 (12.0% of test set)
bAbI-task16 (3.0% of test set)
CLUTRR (1.0% of test set)
αNLI (1.9% of dev set)
TimeDial (2.0% of test set)
SpartQA (1.7% of test set)
StepGame (<0.1% of test set)
Math (0.1% of test set)
CommonsenseQA (2.6% of test set)
PiQA (1.6% of test set)
Pep-3k (0.9% of entire set)
E-Care (1.4% of val set)
HotpotQA (<0.1% of val set)
Letter string analogies - Webb (1.6% of entire set)
COVID-19-Lee (19.5% of test set)
OpenDialKG (0.3% of test set)
MultiWOZ 2.2 (5.0% of test set)
Don’t Stop Pretraining? Make Promp-based Fine-tuning Powerful Learner N/A   - - N/A N/A -   - -      
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation Y N 18/02/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine Y N 20/01/2023 December 2022 Y Y ChatGPT BROWSER Y Y N N Flores-101 (4.9% of test set)
WMT19 Biomedical Translation Task (13.4% of test set)
WMT20 Robustness Task set 2 En->Ja (3.6% of test set)
WMT20 Robustness Task set 2 Ja->En (5.0% of test set)
WMT20 Robustness Task set 3 (0.9% of test set)
IS CHATGPT A GENERAL-PURPOSE NATURAL LANGUAGE PROCESSING TASK SOLVER? Y N 08/02/2023 January 2023 Y N ChatGPT BROWSER Y Y Y N MultiArith (100.0% of entire set)
GSM8K (100.0% of test set)
AddSub (100.0% of entire set)
AQUA-RAT (100.0% of test set)
SingleEq (100.0% of entire set)
SVAMP (100.0% of entire set)
CommonsenseQA (100.0% of val set)
StrategyQA (100.0% of entire set)
COPA (20.0% of val set)
Last Letter (100.0% of entire set)
Coin Flip (16.0% of entire set)
Date (100.0% of entire set)
Object (100.0% of entire set)
RTE (4.8% of entire set)
CommitmentBank (4.6% of entire set)
BoolQ (100.0% of test set)
MuTual (100.0% of test set)
SAMsum (100.0% of test set)
CoNLL03 (100.0% of test set)
SST2 (100.0% of val set)
Extractive summarization via ChatGPT for faithful summary generation Y N 09/04/2023 - Y N ChatGPT BROWSER Y Y Y N Reddit (2.4% of test set)
XSum (8.8% of test set)
PubMed (19.9% of test set)
CNN/DM (8.7% of test set)
Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark Y N 15/03/2023 - Y N ChatGPT BROWSER Y Y N N CoNLL2014 (7.6% of test set)
Will affective computing emerge from foundation models and general ai? A first evaluation on chatgpt Y Y 03/03/2023 February 2023 Y Y ChatGPT BROWSER Y N N N First Impressions (100.0% of entire set)
Sentiment140 (100.0% of entire set)
The Suicide and Depression dataset (100.0% of entire set)
ChatGPT: Jack of all trades, master of none Y Y 21/02/2023 - Y Y Both BROWSER Y Y N N WikiDetox Aggr. - Aggression (4.3% of test or dev set)
WikiDetox Aggr. - AggressionPer (0.2% of test or dev set)
CoLa (100.0% of dev set)
ColBERT (2.5% of test or dev set)
Sarcasmania - Sarcasm (16.7% of test or dev set)
SMS Spam v.1 (100.0% of test or dev set)
WiC - WordContext (100.0% of test or dev set)
RTE (100.0% of dev set)
WNLI (100.0% of dev set)
SQuAD v2 (8.4% of dev set)
GSM8K - MathQA (75.8% of test set)
ClarinEmo (100.0% of test or dev set)
GoEmotions - GoEmo (18.4% of test or dev set)
GoEmotions - GoEmoPer0 (5.9% of test or dev set)
GoEmotions - GoEmoPer1 (5.9% of test or dev set)
GoEmotions - GoEmoPer2 (5.9% of test or dev set)
GoEmotions - GoEmoPer3 (5.9% of test or dev set)
Unhealthy Conv. - Unhealthy (2.2% of test or dev set)
Unhealthy Conv. - UnhealthyPer (0.4% of test or dev set)
PolEmo2 - PolEmo (100.0% of test or dev set)
TweetEval - TweetEmoji (3.3% of test or dev set)
TweetEval - TweetSent (41.8% of test or dev set)
TweetEval - TweetStance (100.0% of test set)
RACE - ReAding (20.4% of test or dev set)
Raganato - WSD (100.0% of test or dev set)
Zero-Shot Information Extraction via Chatting with ChatGPT Y N 20/02/2023 December 2022 Y Y ChatGPT BROWSER N Y N N NYT11-HRL (100.0% of entire set)
DuIE2.0 (100.0% of entire set)
conll++ (100.0% of entire set)
MSRA (100.0% of entire set)
DuEE1.0 (100.0% of entire set)
ACE05 (100.0% of entire set)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent Y N 19/04/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
Is ChatGPT a Good NLG Evaluator? A Preliminary Study Y N 07/03/2023 February and March 2023 Y Y ChatGPT BROWSER N Y N N SummEval (100.0% of entire set)
NewsRoom (100.0% of entire set)
OpenMEVA-ROC (100.0% of entire set)
RealSumm (100.0% of entire set)
G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment Y N 29/03/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT Y Y N    
Large Language Models Are State-of-the-Art Evaluators of Translation Quality Y Y 28/02/2023 February 2023 Y Y Both BROWSER N Y N   MQM-2022 (100.0% of test set)
CHATGPT OUTPERFORMS CROWD-WORKERS FOR TEXT-ANNOTATION TASKS Y N 19/07/2023 March and April 2023 Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N    
SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark Y N 27/07/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N Y N    
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models Y N 17/05/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning Y N 12/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
Zero-Shot Cross-Lingual Summarization via Large Language Models Y N 28/02/2023 - Y N Both BROWSER Y Y N N CrossSum (1.2% of test set)
WikiLingua (0.2% of test set)
XSAMSum (<0.1% of test set)
A categorical archive of chatgpt failures N/A   06/02/2023 - Y N ChatGPT BROWSER - -      
Benchmarks for automated commonsense reasoning: A survey Y N 09/02/2023 - Y Y ChatGPT BROWSER Y N N Y custom commonsense reasoning dataset (100.0% of entire set)
Is ChatGPT a Good Causal Reasoner? A Comprehensive Evaluation. Y N 12/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y Y    
Causal reasoning and large language models: Opening a new frontier for causality Y N 08/05/2023 - Y N Both API - POST MARCH 2023 OR OPT-OUT N Y N    
Theory of mind may have spontaneously emerged in large language models Y N 04/02/2023 March 2023 Y Y Both BROWSER Y Y N Y theory-of-mind-custom 1/02 (100.0% of entire set)
Boosting Theory-of-Mind Performance in Large Language Models via Prompting Y N 22/04/2023 - Y Y Both BROWSER N N N Y theory-of-mind-custom 2 (100.0% of entire set)
Does ChatGPT have Theory of Mind? Y N 23/05/2023 March 2023 Y N Both BROWSER N N N Y theory-of-mind-custom 3 (100.0% of entire set)
Can ChatGPT Defend the Truth? Automatic Dialectical Evaluation Elicits LLMs’ Deficiencies in Reasoning Y N 22/05/2023 - N N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N    
Evaluating the logical reasoning ability of chatgpt and gpt-4 Y N 07/04/2023 - Y Y Both BROWSER N Y N N LogiQA 2.0 (100.0% of test set)
LogiQA 2.0-zh (100.0% of test set)
MED (100.0% of test set)
ConjNLI (100.0% of test set)
TaxiNLI (100.0% of test set)
AR-LSAT (100.0% of test set)
ReClor (100.0% of dev set)
ConTRoL “logical reasoning” (100.0% of test set)
HELP (100.0% of test set)
MNLI (100.0% of test set)
custom-LogiQA 2.0 (100.0% of entire set)
We’re Afraid Language Models Aren’t Modeling Ambiguity Y N 27/04/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
Sparks of artificial general intelligence: Early experiments with gpt-4 N/A   - - - - GPT-4 API - POST MARCH 2023 OR OPT-OUT - -      
Chatgpt participates in a computer science exam Y N 08/03/2023 February 2023 Y Y Both BROWSER N N N Y custom-CS exam (100.0% of entire set)
Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls Y N 30/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Mathematical capabilities of chatgpt Y N 31/01/2023 January 2023 Y Y Both BROWSER N N N N GHOSTS (100.0% of entire set)
Could an artificial-intelligence agent pass an introductory physics course? Y Y 28/01/2023 January 2023 Y PART ChatGPT BROWSER N N N Y custom-made physics exam&homeworks (100.0% of entire set)
Investigating the Use of an Artificial Intelligence Chatbot with General Chemistry Exam Questions Y Y 10/01/2023 March 2023 Y Y ChatGPT BROWSER N N N Y custom-made chemistry exam (100.0% of entire set)
How do physics students evaluate artificial intelligence responses on comprehension questions? A study on the perceived scientific accuracy and linguistic quality of ChatGPT Y Y 12/04/2023 February 2023 N/A Y ChatGPT BROWSER N N N Y custom-made physics exam (100.0% of entire set)
Chatgpt goes to law school Y Y 23/01/2023 December 2022 Y N ChatGPT BROWSER N N N Y custom-made law exam (100.0% of entire set)
ChatGPT has aced the test of understanding in college economics: Now what? Y Y 08/03/2023 February 2023 Y N ChatGPT BROWSER N N N/A N TUCE test (100.0% of entire set)
The Wall Street Neophyte: A Zero-Shot Analysis of ChatGPT Over MultiModal Stock Movement Prediction Challenges Y N 10/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment Y Y 26/12/2022 November 2022 Y N ChatGPT BROWSER Y N N/A Y NBME medical questions (100.0% of entire set)
AMBOSS medial questions (1.7% of entire set)
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models Y Y 21/12/2022 November 2022 Y Y ChatGPT BROWSER Y Y Y Y USMLE medial q (93.0% of entire set)
Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings Y Y 26/01/2023 - Y N ChatGPT BROWSER Y N N/A Y BCSC Self-Assessment Program (5.8% of entire set)
OphthoQuestions (5.7% of entire set)
Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study Y Y 15/02/2023 January 2023 Y Y ChatGPT BROWSER Y N N/A Y custom medical diagnosis dataset (100.0% of entire set)
Assessing the utility of ChatGPT throughout the entire clinical workflow Y Y 26/02/2023 January 2023 Y N ChatGPT BROWSER Y N N/A N Merck Sharpe & Dohme (MSD) Clinical Manual (100.0% of entire set)
ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases Y N 26/04/2023 April 2023 Y N Both BROWSER Y N N/A Y custom medical 1 (100.0% of entire set)
Can AI tell good stories? narrative transportation and persuasion with ChatGPT Y N 05/04/2023 - N N ChatGPT BROWSER N N N/A Y custom narrative (100.0% of entire set)
Benchmarking Foundation Models with Language-Model-as-an-Examiner Y N 07/06/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y N    
Chatgpt: A meta-analysis after 2.5 months N/A   - - N/A N/A ChatGPT   N/A N/A N/A    
Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis Y N 24/01/2023 January 2023 Y Y ChatGPT BROWSER Y N N/A Y custom medical 2 (100.0% of entire set)
Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models N/A   - - N/A N/A ChatGPT   N/A N/A N/A    
Harnessing the power of llms in practice: A survey on chatgpt and beyond N/A   - - N/A N/A -   - -      
Holistic evaluation of language models Y Y 1/10/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
A Survey on Evaluation of Large Language Models N/A   - - N/A N/A -   N/A N/A N/A    
Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert Y N 19/02/2023 February 2023 Y Y ChatGPT BROWSER Y Y N N CoLa (2.5% of dev set)
SST-2 (1.3% of dev set)
MRPC (1.4% of dev set)
STS-B (3.5% of dev set)
QQP (<0.1% of dev set)
MNLI (0.1% of val set)
QNLI (0.4% of dev set)
RTE (0.8% of dev set)
WNLI (17.1% of dev set)
On the robustness of chatgpt: An adversarial and out-of-distribution perspective Y N 22/02/2023 February 2023 Y Y ChatGPT BROWSER Y Y N N AdvGLUE_SST-2 (100.0% of dev set)
AdvGLUE_QQP (100.0% of dev set)
AdvGLUE_MNLI (100.0% of dev set)
AdvGLUE_QNLI (100.0% of dev set)
AdvGLUE_RTE (100.0% of dev set)
custom AdvGLUE-T (100.0% of dev set)
ANLI-R3 (100.0% of test set)
Flipkart (0.1% of entire set)
DDXPlus (EN) (<0.1% of test set)
An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP) Y N 23/02/2023 January and February 2023 Y Y ChatGPT BROWSER N N N/A N DRAW-1K (100.0% of entire set)
Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family Y N 14/03/2023 February and March 2023 Y Y ChatGPT BROWSER Y Y N N KQApro (90.0% of entire set)
LC-quad2.0 (100.0% of entire set)
WQSP (99.2% of entire set)
CWQ (100.0% of entire set)
GrailQA (100.0% of test set)
GraphQuestions (100.0% of entire set)
QALD-9 (100.0% of entire set)
MKQA (2.3% of entire set)
Instruction tuning with gpt-4 Y N 06/04/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N N N/A    
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks N/A   01/03/2023 - N/A N/A - API - POST MARCH 2023 OR OPT-OUT N/A N/A N/A   custom-made psychological dataset (100.0% of entire set)
Consistency analysis of chatgpt Y N 11/03/2023 February 2023 Y N ChatGPT BROWSER Y Y Y N becel-SNLI (4.9% of test set)
becel-RTE (36.6% of test set)
becel-MRPC (16.6% of test set)
Does ChatGPT resemble humans in language use? Y N 10/03/2023 December 2022 and January 2023 Y Y ChatGPT BROWSER N N N/A Y  
A comprehensive capability analysis of gpt-3 and gpt-3.5 series models Y N 18/03/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability Y N 12/03/2023 - Y Y ChatGPT BROWSER N Y N N Spider (100.0% of dev set)
Spider-SYN (100.0% of dev set)
Spider-DK (100.0% of dev set)
Spider-Realistic (100.0% of dev set)
Spider-CG(sub) (100.0% of entire set)
Spider-CG(app) (100.0% of entire set)
ADVETA(rpl)-spider (100.0% of dev set)
ADVETA(add)-spider (100.0% of dev set)
CSpider (100.0% of dev set)
DuSQL (100.0% of entire set)
SParC (100.0% of dev set)
CoSQL (100.0% of dev set)
Is ChatGPT a good sentiment analyzer? A preliminary study Y N 10/04/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y Y    
A preliminary evaluation of chatgpt for zero-shot dialogue understanding Y N 09/04/2023 February 2023 Y N ChatGPT BROWSER N Y N N ATIS (100.0% of test set)
SNIPS (100.0% of test set)
MultiWOZ 2.1 (100.0% of test set)
MultiWOZ 2.4 (100.0% of test set)
Zero-shot Temporal Relation Extraction with ChatGPT Y Y 11/04/2023 - Y N ChatGPT BROWSER N Y N N TimeBank-Dense (100.0% of test set)
MATRES (100.0% of test set)
TDDiscourse (100.0% of test set)
Can chatgpt reproduce human-generated labels? a study of social computing tasks Y N 22/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Chatgraph: Interpretable text classification by converting chatgpt knowledge to graphs Y Y 19/09/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
Uncovering the Potential of ChatGPT for Discourse Analysis in Dialogue: An Empirical Study Y N 15/05/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Evaluating ChatGPT’s Performance for Multilingual and Emoji-based Hate Speech Detection Y N 22/05/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Sentiment Analysis in the Era of Large Language Models: A Reality Check Y N 24/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
ChatGPT for Suicide Risk Assessment on Social Media: Quantitative Evaluation of Model Performance, Potentials and Limitations Y Y 15/06/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Metacognitive Prompting Improves Understanding in Large Language Models Y N 10/08/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Exploring ai ethics of chatgpt: A diagnostic analysis Y N 30/01/2023 December 2022 -January 2023 PART N ChatGPT BROWSER Y Y N N BBQ (80.0% of entire set)
BOLD (80.0% of entire set)
custom-made query to inspect gender occupation bias (100.0% of entire set)
IMDB (1.9% of test set)
IMDB-contrast (100.0% of entire set)
BoolQ (10.3% of test set)
BoolQ-contrast (100.0% of test set)
FLORES-200 (80.0% of dev set)
ProsocialDialog (0.2% of test set)
custom jailbreaking (100.0% of entire set)
Open-BookQA (80.0% of test set)
TruthfulQA (100.0% of test set)
TruthfulQA gen (12.2% of test set)
RealToxicPrompts (80.0% of entire set)
custom character role-play (100.0% of entire set)
Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech Y Y 11/02/2023 January 2023 Y N ChatGPT BROWSER Y N N/A N LatentHatred (12.0% of entire set)
Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media Y N 06/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
The Self-Perception and Political Biases of ChatGPT Y N 14/04/2023 March 2023 N N ChatGPT BROWSER N N N/A N custom political compass test (100.0% of entire set)
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting Y N 11/05/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
BAD: BiAs Detection for Large Language Models in the context of candidate screening Y N 17/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models Y N 20/06/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y N N/A    
On Large Language Models’ Selection Bias in Multi-Choice Questions Y N 08/09/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y Y    
Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization Y N 16/02/2023 February 2023 Y N ChatGPT BROWSER Y Y Y N QMSum (35.5% of test set)
SQuALITY (55.5% of test set)
CovidET (25.1% of test set)
NEWTS (16.6% of test set)
ChatGPT as a Factual Inconsistency Evaluator for Text Summarization Y N 13/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Uncovering ChatGPT’s Capabilities in Recommender Systems Y N 24/08/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models Y N 22/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text Y N 22/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A N N/A    
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge Y N 05/06/2023 December 2022 Y N ChatGPT BROWSER N/A N N/A Y custom pokemon dataset (100.0% of entire set)
Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study Y N 01/06/2023 - Y N ChatGPT BROWSER Y N N N custom book and article summarization dataset (100.0% of entire set)
Generative Job Recommendations with Large Language Model Y N 05/06/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A N/A N/A    
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews Y N 12/06/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
L-Eval: Instituting Standardized Evaluation for Long Context Language Models Y N 31/07/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N Y N    
LLM-Rec: Personalized Recommendation via Prompting Large Language Models N/A   2023 - - - -   - -      
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding Y N 28/08/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Can Large Language Models Understand Real-World Complex Instructions Y N 27/09/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
Mind meets machine: Unravelling GPT-4’s cognitive psychology Y N 20/03/2023 - Y N GPT-4 BROWSER N N N/A N  
Capabilities of GPT-4 on Medical Challenge Problems Y N 11/04/2023 - Y N Both API - POST MARCH 2023 OR OPT-OUT N Y N    
GPT is becoming a Turing machine: Here are some ways to program it Y N 25/03/2023 - Y N GPT-4 BROWSER Y N N/A Y custom-made dataset of coding problems (100.0% of entire set)
ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models Y N 29/03/2023 March 2023 Y N ChatGPT BROWSER Y N N/A N CommonsenseQA (9.8% of dev set)
OpenBookQA (24.0% of dev set)
WSC (36.6% of dev set)
PIQA (6.5% of dev set)
Social IQA (6.1% of dev set)
ARC (33.4% of dev set)
QASC (12.9% of dev set)
HellaSWAG (1.2% of dev set)
NumerSense (10.6% of dev set)
MC-TACO (3.1% of dev set)
ProtoQA (100.0% of dev set)
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure Y N 30/03/2023   Y N Both BROWSER N N N/A N custom-made theory of reason (100.0% of entire set)
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior? N/A   31/07/2023 - - - -   - -      
How well do Large Language Models perform in Arithmetic tasks? Y N 16/03/2023 - Y Y Both BROWSER Y Y Y N custom mathematical (100.0% of entire set)
Uncertainty in Natural Language Generation: From Theory to Applications N/A   28/07/2023 - N/A N/A -   N/A N/A N/A    
ChatGPT-Crawler: Find out if ChatGPT really knows what it’s talking about Y N 06/04/2023 - N Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
How is ChatGPT’s behavior changing over time? Y N 18/07/2023 - N Y Both API - POST MARCH 2023 OR OPT-OUT Y N/A N/A    
ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT Y N 21/04/2023 - PART N Both API - POST MARCH 2023 OR OPT-OUT N N N/A    
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models Y N 16/07/2023 - PART N Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
StructGPT: A General Framework for Large Language Model to Reason over Structured Data Y N 16/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N    
Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models Y N 26/06/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Tree of Thoughts: Deliberate Problem Solving with Large Language Models Y N 17/05/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT Y N N/A    
ChatLog: Recording and Analyzing ChatGPT Across Time Y N 27/04/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Chain of Knowledge: A Framework for Grounding Large Language Models with Structured Knowledge Bases Y N 3/10/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
ChatHaruhi: Reviving Anime Character in Reality via Large Language Model. Y N 18/08/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A Y N/A    
Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Conflicts Y N 3/10/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
GPT-3.5 vs GPT-4: Evaluating ChatGPT’s Reasoning Performance in Zero-shot Learning Y N 21/05/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N N N    
LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4 Y N 20/05/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N/A N/A N/A    
Enabling Large Language Models to Generate Text with Citations Y N 24/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
Why Does ChatGPT Fall Short in Providing Truthful Answers? Y N 24/05/2023 - Y N Both API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation From Deductive, Inductive and Abductive Views Y N 08/08/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation Y N 23/07/2023 late May and early June 2023 Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models Y N 20/07/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N N N/A    
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph Y N 15/07/2023 - N N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models Y N 14/08/2023 - Y N Both API - POST MARCH 2023 OR OPT-OUT Y Y N    
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions Y N 22/08/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Large Language Models on the Chessboard: A Study on ChatGPT’s Formal Language Comprehension and Complex Reasoning Skills Y N 29/08/2023 - PART N ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A N N/A    
Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis Y N 22/08/2023 June 2023 Y Y Both BROWSER N/A Y N N custom graph (100.0% of entire set)
Exploring New Frontiers in Agricultural NLP: Investigating the Potential of Large Language Models for Food Applications Y N 20/06/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback Y N 19/09/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N Y N    
ChatGPT for Robotics: Design Principles and Model Abilities N/A   20/02/2023 - Y Y ChatGPT UNCLEAR - N/A      
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT Y N 24/03/2023 March 2023 Y Y ChatGPT BROWSER Y N N/A N WMT20 EN-DE (4.0% of test set)
WMT20 ZH-EN (4.0% of test set)
Towards Making the Most of ChatGPT for Machine Translation Y N 24/03/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Linguistically Informed ChatGPT Prompts to Enhance Japanese-Chinese Machine Translation: A Case Study on Attributive Clauses Y N 27/03/2023 - Y N ChatGPT UNCLEAR N/A N N/A N  
How to Design Translation Prompts for ChatGPT: An Empirical Study Y N 21/04/2023 - Y N ChatGPT BROWSER Y N N/A N Flores-101 (4.9% of test set)
WMT2014 Ott et al 2018 (10.0% of test set)
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Y N 10/06/2023 - N/A Y Both API - POST MARCH 2023 OR OPT-OUT N Y N    
Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness Y N 23/04/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Can ChatGPT Assess Human Personalities? A General Evaluation Framework Y N 07/03/2023 - Y Y ChatGPT BROWSER N N N/A N Myers-Briggs Type Indicator (100.0% of entire set)
Is ChatGPT Equipped with Emotional Dialogue Capabilities? Y N 19/4/2023 - N Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
AugGPT: Leveraging ChatGPT for Text Data Augmentation Y N 20/03/2023 - Y N ChatGPT BROWSER N N N/A N Amazon (customer reviews) (<0.1% of train set)
Symptoms (on Kaggle) (100.0% of train set)
PubMed20K (100.0% of train set)
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations Y N 5/4/2023 31/03/2023 Y Y Both API - POST MARCH 2023 OR OPT-OUT N N N/A    
To ChatGPT, or not to ChatGPT: That is the question! Y N 4/4/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N/A N/A    
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT Y N 18/08/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Analyzing ChatGPT’s Aptitude in an Introductory Computer Engineering Course Y N 13/03/2023 - Y Y ChatGPT BROWSER Y N N/A Y custom CS exams/homeworks (100.0% of entire set)
Linguistic ambiguity analysis in ChatGPT Y N 13/02/2023 - Y Y ChatGPT BROWSER N N N/A Y custom linguistic ambiguity (100.0% of entire set)
custom (100.0% of entire set)
Using ChatGPT for Entity Matching Y N 05/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
MEGA: Multilingual Evaluation of Generative AII Y Y 22/10/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers Y Y 07/06/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Zero-shot Clinical Entity Recognition using ChatGPT Y N 29/03/2023 22/02/2023 Y N ChatGPT BROWSER Y Y N N i2b2 2010 (33.5% of entire set)
MTSamples (1.9% of entire set)
Human-like Summarization Evaluation with ChatGPT Y N 05/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
“HOT” ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media Y N 20/04/2023 - Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A    
Large language models effectively leverage document-level context for literary translation, but critical errors persist Y N 06/04/2023 - Y Y GPT-4 API - POST MARCH 2023 OR OPT-OUT N/A N/A N/A    
BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models Y N 19/06/2023 - Y Y ChatGPT BROWSER Y Y N N WMT22 (0.1% of test set)
Neural Machine Translation Data Generation and Augmentation using ChatGPT N/A   11/07/2023 - N/A N/A - - N/A N/A N/A    
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages Y N 14/09/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
How would Stance Detection Techniques Evolve after the Launch of ChatGPT? Y N 30/12/2022 - Y N ChatGPT BROWSER N Y N N SemEval-2016 Task 6 (100.0% of test set)
P-Stance (100.0% of test set)
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models Y N 12/09/2023 - Y N GPT-4 API - POST MARCH 2023 OR OPT-OUT Y N N/A    
Zero-shot Approach to Overcome Perturbation Sensitivity of Prompts N/A Y 25/05/2023 - Y Y -   -        
UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data Generation for Cross-Lingual Learning in Tweet Intimacy Prediction Y Y 02/03/2023 December 2022 Y N ChatGPT BROWSER Y N/A N/A N SemEval-2023 Task 9 (0.001% of training set)
Exploring the Feasibility of ChatGPT for Event Extraction Y N 07/03/2023   Y N ChatGPT BROWSER Y Y N N ACE05 (0.3% of test set)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? Y N 08/03/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction Y N 09/03/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT - - -    
ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design Y N 11/03/2023 January - February 2023 Y N/A ChatGPT BROWSER N/A N/A N/A Y  
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models Y N 15/03/2023 - Y Y -   N/A N/A N/A    
Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential Y Y 16/03/2023   Y N Both API - POST MARCH 2023 OR OPT-OUT - N N/A    
An Empirical Study of Pre-trained Language Models in Simple Knowledge Graph Question Answering Y Y 18/03/2023 January 2023 Y Y ChatGPT BROWSER Y Y Y N SimpleQuestions (0.3% of entire set)
WebQuestionsSP (4.5% of entire set)
FreebaseQA (1.0% of entire set)
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 Y N 20/03/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Y N 20/03/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N/A N/A    
Is ChatGPT A Good Keyphrase Generator? A Preliminary Study Y N 23/03/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N    
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to Document Level Y Y 13/06/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A N  
Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms Y N 30/03/2023 February 2023 Y Y ChatGPT BROWSER Y N N/A N CNN DailyMail (0.4% of test set)
MGTBench: Benchmarking Machine-Generated Text Detection N/A   26/03/2023   N/A N/A - - N/A N/A N/A    
Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs Y N 30/3/2023   N N Both API - POST MARCH 2023 OR OPT-OUT Y Y Y    
What would Harry say? Building Dialogue Agents for Characters in a Story Y N 13/11/2022   N N ChatGPT BROWSER Y Y Y N custom Harry Potter dialogues (100.0% of entire set)
Reliable Natural Language Understanding with Large Language Models and Answer Set Programming N/A       N/A N/A -   N/A N/A N/A    
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models Y Y 20/12/2022 - Y Y ChatGPT BROWSER Y Y Y Y  
Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End Y N 20/12/2022   Y N ChatGPT BROWSER Y Y Y Y custom EMNLP 2022 abstracts (4.8% of entire set)
Modeling Label Semantics Improves Activity Recognition N/A   01/01/2023   N/A N/A -   N/A N/A N/A    
An Analysis of the Automatic Bug Fixing Performance of ChatGPT Y N 20/01/2023 December 2022 - January 2023 Y Y ChatGPT BROWSER N Y N N QuixBugs (100.0% of entire set)
Chat2VIS: Generating Data Visualisations via Natural Language using ChatGPT, Codex and GPT-3 Large Language Models Y Y 04/02/2023   Y N ChatGPT BROWSER N Y N N  
ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots Y N 08/02/2023   - N ChatGPT BROWSER Y Y N N QALD-9 (100.0% of test set)
LCQuAD-2.0 (0.3% of entire set)
YAGO-Bench (100.0% of entire set)
DBLP-Bench (100.0% of entire set)
MAG-Bench (100.0% of entire set)
ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models Y N 14/02/2023 January 2023 Y N ChatGPT BROWSER Y Y Y N MIMIC (0.5% of test set)
Guiding Large Language Mdels via Directional Stimulus Prompting Y N 07/07/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N    
Utilizing ChatGPT Generated Data to Retrieve Depression Symptoms from Social Media N/A   05/07/2023 - N/A - - - N/A N/A N/A    
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback Y N 24/02/2023   N/A N ChatGPT BROWSER Y N N/A N DSTC7 Track 2 (100.0% of entire set)
DSTC11 Track 5 (100.0% of val set)
OTT-QA (100.0% of entire set)
ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity? Y Y 02/06/2023 January 2023 Y Y ChatGPT BROWSER Y Y N N MultiWOZ 2.1 (13.5% of test set)
ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer Y Y 01/07/2023 February to April 2023 Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
Not The End of Story: An Evaluation of ChatGPT-Driven Vulnerability Description Mappings Y Y 01/07/2023 - Y Y ChatGPT BROWSER N Y N N CVE-2021 (100.0% of entire set)
CVE-ATT&CK (100.0% of entire set)
BRON (preprocessed) (100.0% of entire set)
ChatGPT is not a good indigenous translator Y Y 01/07/2023   N N GPT-4 API - POST MARCH 2023 OR OPT-OUT N Y N N  
You’ve Got a Friend in … a Language Model? A Comparison of Explanations of Multiple-Choice Items of Reading Comprehension between ChatGPT and Humans Y Y 01/07/2023 December 2022 and January 2023 Y Y ChatGPT BROWSER N N N/A Y custom - Evaluar para Avanzar (EpA) exams (100.0% of entire set)
Language-Agnostic Transformers and Assessing ChatGPT-Based Query Rewriting for Multilingual Document-Grounded QA Y Y 03/05/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A N  
HW-TSC at SemEval-2023 Task 7: Exploring the Natural Language Inference Capabilities of ChatGPT and Pre-trained Language Model for Clinical Trial Y Y 01/07/2023 January 2023 Y N ChatGPT BROWSER N Y N Y SemEval-2023 Task 7 (100.0% of entire set)
Can ChatGPT Understand Causal Language in Science Claims? Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A N  
ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models N/A Y 02/06/2023 January to March 2023 Y Y ChatGPT BROWSER N/A N/A N/A N  
Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study Y Y 30/03/2023 - Y Y ChatGPT BROWSER Y N N/A Y Hofstede Cultural Survey (100.0% of entire set)
Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications Y Y 01/07/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N N  
ChatGPT vs. Crowdsourcing vs. Experts: Annotating Open-Domain Conversations with Speech Functions Y Y 01/09/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A N  
Leveraging Large Language Models for Automated Dialogue Analysis Y Y 01/09/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N N  
Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance Y Y 28/08/2023   Y N Both API - POST MARCH 2023 OR OPT-OUT Y Y N N  
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction Y Y 05/06/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y N N/A N  
Credible Without Credit: Domain Experts Assess Generative Language Models Y Y 01/07/2023 - Y Y ChatGPT BROWSER Y N N/A Y custom qa (100.0% of entire set)
SUT at SemEval-2023 Task 1: Prompt Generation for Visual Word Sense Disambiguation N/A Y 01/07/2023   N/A N/A   - N/A N/A N/A N/A  
How Well Do Large Language Models Perform on Faux Pas Tests? Y Y 24/05/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y N    
Zero-Shot Information Extraction for Clinical Meta-Analysis using Large Language Models Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
Can Large Language Models Be an Alternative to Human Evaluation? Y Y 03/05/2023 December 2022 - January 2023 Y Y ChatGPT BROWSER Y Y N N WritingPrompts (1.3% of test set)
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs N/A Y 15/05/2023   N/A N/A     N/A N/A N/A N/A  
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis N/A Y 20/12/2022   N/A N/A     N/A N/A N/A N/A  
Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text N/A Y 21/11/2022   N/A N/A     N/A N/A N/A N/A  
MolXPT: Wrapping Molecules with Text for Generative Pre-training N/A Y 18/05/2023   N/A N/A     N/A N/A N/A N/A  
MUX-PLMs: Pre-training Language Models with Data Multiplexing N/A Y 24/02/2023   N/A N/A     N/A N/A N/A N/A  
Is GPT-3 a Good Data Annotator?
Y 20/12/2022   Y N Both BROWSER -        
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey N/A Y 14/10/2022   N/A N/A     N/A N/A N/A N/A  
Detoxifying Online Discourse: A Guided Response Generation Approach for Reducing Toxicity in User-Generated Text Y Y 01/07/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N Y  
Faithful Question Answering with Monte-Carlo Planning Y Y 04/05/2023 December 2022 Y Y ChatGPT BROWSER Y Y Y N EntailmentBank (15.0% of test set)
Nut-cracking Sledgehammers: Prioritizing Target Language Data over Bigger Language Models for Cross-Lingual Metaphor Detection Y Y 01/09/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N N  
Generating Faithful Text From a Knowledge Graph with Noisy Reference Text Y Y 12/8/2023   Y   ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N Y  
Empowering Conversational Agents using Semantic In-Context Learning Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A N  
Can Large Language Models Safely Address Patient Questions Following Cataract Surgery? Y Y 01/07/2023 December 2022 Y N ChatGPT BROWSER N N N Y custom-sourced post-operations queries from patients (100.0% of entire set)
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models Y Y 15/06/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y Y N  
Generative Pretrained Transformers for Emotion Detection in a Code-Switching Setting Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
On the Underspecification of Situations in Open-domain Conversational Datasets Y Y 01/07/2023   Y N ChatGPT BROWSER - N/A N/A    
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors Y Y 25/05/2023 - Y Y ChatGPT UNCLEAR -        
Multilingual Language Models are not Multicultural: A Case Study in Emotion N/A Y 03/06/2023           -        
Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers Y Y 07/06/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT Y Y N N  
IDOL: Indicator-oriented Logic Pre-training for Logical Reasoning Y Y 27/06/2023 February 2023 Y Y ChatGPT BROWSER Y Y N N ReClor (0.4% of entire set)
LogiQA (0.3% of entire set)
PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A N  
Do PLMs Know and Understand Ontological Knowledge? Y Y 01/07/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N Y  
Evaluation of Question Generation Needs More References Y Y 26/05/2023   Y N ChatGPT BROWSER N Y N N Quiz_Design (100.0% of entire set)
A System for Answering Simple Questions in Multiple Languages Y Y 01/07/2023 March 2023 version of ChatGPT Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
HIT-SCIR at WASSA 2023: Empathy and Emotion Analysis at the Utterance-Level and the Essay-Level Y Y 01/07/2023   Y N ChatGPT UNCLEAR - N/A N/A N  
Pre-trained Language Models Can be Fully Zero-Shot Learners Y Y 26/05/2023 March 2023 Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
Predicting the Quality of Revisions in Argumentative Writing Y Y 01/06/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N N N/A Y  
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods N/A Y 29/05/2023           -        
When Truth Matters – Addressing Pragmatic Categories in Natural Language Inference (NLI) by Large Language Models (LLMs) Y Y 01/07/2023 December 2022 Y Y ChatGPT BROWSER Y N N/A N MNLI (0.4% of val set)
Debiasing should be Good and Bad: Measuring the Consistency of Debiasing Techniques in Language Models N/A Y 23/05/2023           -        
GPoeT: a Language Model Trained for Rhyme Generation on Synthetic Data N/A Y 01/05/2023           -        
Examining Bias in Opinion Summarisation Through the Perspective of Opinion Diversity Y Y 01/07/2023 March 2023 version of ChatGPT Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
Unsupervised Summarization Re-ranking Y Y 14/05/2023 - N Y ChatGPT API - POST MARCH 2023 OR OPT-OUT Y Y N N  
Improving Dutch Vaccine Hesitancy Monitoring via Multi-Label Data Augmentation with GPT-3.5 Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A N N/A Y  
What Makes a Good Counter-Stereotype? Evaluating Strategies for Automated Responses to Stereotypical Text Y Y 01/07/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N/A N N/A Y  
What Makes Good Counterspeech? A Comparison of Generation Approaches and Evaluation Metrics Y Y 01/09/2023 May 2023 Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N Y  
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition Y Y 05/05/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations? Y Y 29/06/2023 - Y Y Both API - POST MARCH 2023 OR OPT-OUT N Y N N  
Frontier Review of Multimodal AI N/A Y 03/08/2023   N/A N/A - -          
Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation Y N 04/04/2023   Y N ChatGPT API - POST MARCH 2023 OR OPT-OUT N Y N N  
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI Y N 19/07/2023 - Y Y ChatGPT API - POST MARCH 2023 OR OPT-OUT - N/A N/A N  


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        


              -        

Acknowledgements

This work was co-funded by the European Union (ERC, NG-NLG, 101039303).

erc-logo