2025

unveiLing: What Makes Anon Tricky for LLMs?
unveiLing: What Makes Anon Tricky for LLMs?

Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron, ..., Ekaterina Kochmar, Monojit Choudhury

Under Review 2025

Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on anon remains consistently poor. Anon, often derived from Anon contests, provide a minimal contamination environment to assess LLMs' linguistic reasoning abilities across low-resource languages. In this work, we analyze LLMs' performance on 629 anon across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on anon involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.

unveiLing: What Makes Anon Tricky for LLMs?
unveiLing: What Makes Anon Tricky for LLMs?

Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron, ..., Ekaterina Kochmar, Monojit Choudhury

Under Review 2025

Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on anon remains consistently poor. Anon, often derived from Anon contests, provide a minimal contamination environment to assess LLMs' linguistic reasoning abilities across low-resource languages. In this work, we analyze LLMs' performance on 629 anon across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on anon involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.

Llama-3-AnonFlavor: An Open Generative LLM for AnonLang
Llama-3-AnonFlavor: An Open Generative LLM for AnonLang

Monojit Choudhury, ..., Mukund Choudhary, ..., Preslav Nakov

Under Review 2025

We introduce Llama-3-AnonFlavor-Chat, or Anon for short, a new state-of-the-art AnonLang-centric instruction tuned open generative large language model (LLM). Anon is adapted from the LLaMA-3-8B model via continuous pretraining with expansion of transformer blocks, following LLaMA Pro approach. This model employs the decoder-only architecture and has been trained on a mixture of AnonLang and English texts. With 10 billion parameters, Nanda demonstrates improved knowledge and reasoning capabilities in AnonLang, suprassing existing open AnonLang and multilingual models of comparable size by a substantial margin; it also achieves highly competitive performance in English. We release Anon as an open-sourced instruction-tuned model and provide a detailed overview of its training, tuning, safety alignment, and evaluation processes. We believe that this release will foster further research in AnonLang LLMs and support diverse practical applications across various domains.

Llama-3-AnonFlavor: An Open Generative LLM for AnonLang
Llama-3-AnonFlavor: An Open Generative LLM for AnonLang

Monojit Choudhury, ..., Mukund Choudhary, ..., Preslav Nakov

Under Review 2025

We introduce Llama-3-AnonFlavor-Chat, or Anon for short, a new state-of-the-art AnonLang-centric instruction tuned open generative large language model (LLM). Anon is adapted from the LLaMA-3-8B model via continuous pretraining with expansion of transformer blocks, following LLaMA Pro approach. This model employs the decoder-only architecture and has been trained on a mixture of AnonLang and English texts. With 10 billion parameters, Nanda demonstrates improved knowledge and reasoning capabilities in AnonLang, suprassing existing open AnonLang and multilingual models of comparable size by a substantial margin; it also achieves highly competitive performance in English. We release Anon as an open-sourced instruction-tuned model and provide a detailed overview of its training, tuning, safety alignment, and evaluation processes. We believe that this release will foster further research in AnonLang LLMs and support diverse practical applications across various domains.

2023

Neural Models for Factual Inconsistency Classification with Explanations
Neural Models for Factual Inconsistency Classification with Explanations

Tathagata Raha#, Mukund Choudhary, Abhinav S Menon, Harshit Gupta, KV Aditya Srivatsa, Manish Gupta, Vasudeva Varma (# corresponding author)

ECML PKDD 2023

Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in context, or (b) detecting broad contradiction (as part of natural language inference literature). However, there has been no work on detecting and explaining types of factual inconsistencies in text, without any knowledge base in context. In this paper, we leverage existing work in linguistics to formally define five types of factual inconsistencies. Based on this categorization, we contribute a novel dataset, FICLE (Factual Inconsistency CLassification with Explanation), with 8K samples where each sample consists of two sentences (claim and context) annotated with type and span of inconsistency. When the inconsistency relates to an entity type, it is labeled as well at two levels (coarse and fine-grained). Further, we leverage this dataset to train a pipeline of four neural models to predict inconsistency type with explanations, given a (claim, context) sentence pair. Explanations include inconsistent claim fact triple, inconsistent context span, inconsistent claim component, coarse and fine-grained inconsistent entity types. The proposed system first predicts inconsistent spans from claim and context; and then uses them to predict inconsistency types and inconsistent entity types (when inconsistency is due to entities). We experiment with multiple Transformer-based natural language classification as well as generative models, and find that DeBERTa performs the best. Our proposed methods provide a weighted F1 of 87% for inconsistency type classification across the five classes. We make the code and dataset publicly available.

Neural Models for Factual Inconsistency Classification with Explanations
Neural Models for Factual Inconsistency Classification with Explanations

Tathagata Raha#, Mukund Choudhary, Abhinav S Menon, Harshit Gupta, KV Aditya Srivatsa, Manish Gupta, Vasudeva Varma (# corresponding author)

ECML PKDD 2023

Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in context, or (b) detecting broad contradiction (as part of natural language inference literature). However, there has been no work on detecting and explaining types of factual inconsistencies in text, without any knowledge base in context. In this paper, we leverage existing work in linguistics to formally define five types of factual inconsistencies. Based on this categorization, we contribute a novel dataset, FICLE (Factual Inconsistency CLassification with Explanation), with 8K samples where each sample consists of two sentences (claim and context) annotated with type and span of inconsistency. When the inconsistency relates to an entity type, it is labeled as well at two levels (coarse and fine-grained). Further, we leverage this dataset to train a pipeline of four neural models to predict inconsistency type with explanations, given a (claim, context) sentence pair. Explanations include inconsistent claim fact triple, inconsistent context span, inconsistent claim component, coarse and fine-grained inconsistent entity types. The proposed system first predicts inconsistent spans from claim and context; and then uses them to predict inconsistency types and inconsistent entity types (when inconsistency is due to entities). We experiment with multiple Transformer-based natural language classification as well as generative models, and find that DeBERTa performs the best. Our proposed methods provide a weighted F1 of 87% for inconsistency type classification across the five classes. We make the code and dataset publicly available.

CoPara 🥥: The First Dravidian Paragraph-level n-way Aligned Corpus
CoPara 🥥: The First Dravidian Paragraph-level n-way Aligned Corpus

E Nikhil#, Mukund Choudhary, Radhika Mamidi (# corresponding author)

RANLP: DravidianLangTech 2023

We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.

CoPara 🥥: The First Dravidian Paragraph-level n-way Aligned Corpus
CoPara 🥥: The First Dravidian Paragraph-level n-way Aligned Corpus

E Nikhil#, Mukund Choudhary, Radhika Mamidi (# corresponding author)

RANLP: DravidianLangTech 2023

We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, a seq2seq NMT model on all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at CoPara to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.

Pseudowords: Generatucing, Evaluadating, and their Impactfluence
Pseudowords: Generatucing, Evaluadating, and their Impactfluence

Mukund Choudhary#, Bapi Raju Surampudi, Dipti Misra Sharma (# corresponding author)

Masters Thesis @ IIIT Hyderabad 2023

Pseudowords are a part of language that are not translatable to another, as they have no meaning attached to them while also having the constraint of sounding like a phonologically valid sequence under the desired language’s native phonotactics. This thesis thus explores automated language-agnostic pseudoword generation, evaluation of them, and use of them outside psycholinguistics research and clinical use. As the thesis progresses, we highlight current research, draw inspiration from close topics of study, build a pipeline to generate pseudowords and generate Hindi and English pseudoword candidates for further experiemntation. We make this reusable pipeline available on a public repository, as one of the deliverables of this work. Then we show how the current evaluation work in this field is very scarce and sew an evaluation framework with reproducible details on how to design and analyse a human-inthe-loop experiment for something as tricky as pseudoword judgement, conducted for a layman native speaker. After showing various ways to prod a pseudoword set for quality, we compare notes against past sets in English and present observations summarising how comparable they are. However as there is no Hindi pseudoword dataset yet, we add in psycholinguistic features on top of results of evaluation metrics per Hindi pseudoword and release “Soodkosh” another fully public and usable for research resource. Finally, we conduct two separate studies involving pseudowords to show the application, impact, and importance of them across fields. The first study uses pseudowords to establish gradient between high-frequency words, low-frequency words, and non-sensical sequences of alphanumerics used as passwords. The aim of this study is to find correlation and its strength between the perceived security and memorability of a password/phrase. The other part of this chapter is an exploration into language models’ performance on Aphasia classification and if replacing pseudowords can help them. This is as pseudowords like neologisms, mis-pronunciations, and other novel forms generated by Aphasic speakers are largely out-of-vocabulary to a standard languge model that functions off of a pile of mostly well-formed and coherent data. As these are not directly helpful to the field of Aphasia, this work replaces one possible hurdle to see if it is a feasible solution. However the results show that pseudowords are passively used as features and cannot be replaced directly.

Pseudowords: Generatucing, Evaluadating, and their Impactfluence
Pseudowords: Generatucing, Evaluadating, and their Impactfluence

Mukund Choudhary#, Bapi Raju Surampudi, Dipti Misra Sharma (# corresponding author)

Masters Thesis @ IIIT Hyderabad 2023

Pseudowords are a part of language that are not translatable to another, as they have no meaning attached to them while also having the constraint of sounding like a phonologically valid sequence under the desired language’s native phonotactics. This thesis thus explores automated language-agnostic pseudoword generation, evaluation of them, and use of them outside psycholinguistics research and clinical use. As the thesis progresses, we highlight current research, draw inspiration from close topics of study, build a pipeline to generate pseudowords and generate Hindi and English pseudoword candidates for further experiemntation. We make this reusable pipeline available on a public repository, as one of the deliverables of this work. Then we show how the current evaluation work in this field is very scarce and sew an evaluation framework with reproducible details on how to design and analyse a human-inthe-loop experiment for something as tricky as pseudoword judgement, conducted for a layman native speaker. After showing various ways to prod a pseudoword set for quality, we compare notes against past sets in English and present observations summarising how comparable they are. However as there is no Hindi pseudoword dataset yet, we add in psycholinguistic features on top of results of evaluation metrics per Hindi pseudoword and release “Soodkosh” another fully public and usable for research resource. Finally, we conduct two separate studies involving pseudowords to show the application, impact, and importance of them across fields. The first study uses pseudowords to establish gradient between high-frequency words, low-frequency words, and non-sensical sequences of alphanumerics used as passwords. The aim of this study is to find correlation and its strength between the perceived security and memorability of a password/phrase. The other part of this chapter is an exploration into language models’ performance on Aphasia classification and if replacing pseudowords can help them. This is as pseudowords like neologisms, mis-pronunciations, and other novel forms generated by Aphasic speakers are largely out-of-vocabulary to a standard languge model that functions off of a pile of mostly well-formed and coherent data. As these are not directly helpful to the field of Aphasia, this work replaces one possible hurdle to see if it is a feasible solution. However the results show that pseudowords are passively used as features and cannot be replaced directly.

2022

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh Dhole#, ..., KV Aditya Srivatsa*, Mukund Choudhary*, ... (* equal contribution, # corresponding author)

NEJLT 2022

Contributed Butter Fingers Pertubation For Indian Langauges to this paper. Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository.

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh Dhole#, ..., KV Aditya Srivatsa*, Mukund Choudhary*, ... (* equal contribution, # corresponding author)

NEJLT 2022

Contributed Butter Fingers Pertubation For Indian Langauges to this paper. Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository.

2021

Is Convenient Secure? Exploring the impact of Metacognitive beliefs in password selection
Is Convenient Secure? Exploring the impact of Metacognitive beliefs in password selection

Mukund Choudhary#, KV Aditya Srivatsa, Ishan Sanjeev Upadhyay, Priyanka Srivastava (# corresponding author)

CogSci 2021

Recently, there has been research on what factors influence a user’s password setting practices, which include various types of emotions such as anger, risk-taking tendencies, etc. However, research has shown that factors such as memorability and perceived memorability have a greater influence on password choice. Some recent research has shown a negative correlation between the perceived memorability and the perceived security of passwords, particularly passphrases (that are technically more secure). However, it is unclear whether this effect can be extended to groups with good experiences with digital spaces (IT professionals, entrepreneurs, etc.). Furthermore, it has not been determined whether random, uncommonly-worded, or complex structure passphrases would also maintain the correlation, as opposed to relatively less secure, common/simple passphrases. This study examines this problem using a diverse demographic and different categories of passphrases.

Is Convenient Secure? Exploring the impact of Metacognitive beliefs in password selection
Is Convenient Secure? Exploring the impact of Metacognitive beliefs in password selection

Mukund Choudhary#, KV Aditya Srivatsa, Ishan Sanjeev Upadhyay, Priyanka Srivastava (# corresponding author)

CogSci 2021

Recently, there has been research on what factors influence a user’s password setting practices, which include various types of emotions such as anger, risk-taking tendencies, etc. However, research has shown that factors such as memorability and perceived memorability have a greater influence on password choice. Some recent research has shown a negative correlation between the perceived memorability and the perceived security of passwords, particularly passphrases (that are technically more secure). However, it is unclear whether this effect can be extended to groups with good experiences with digital spaces (IT professionals, entrepreneurs, etc.). Furthermore, it has not been determined whether random, uncommonly-worded, or complex structure passphrases would also maintain the correlation, as opposed to relatively less secure, common/simple passphrases. This study examines this problem using a diverse demographic and different categories of passphrases.