Back to Home

Request Correction

Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.

Correction Guidelines

  1. Click the edit button next to a field to report a correction.
  2. Fill in the suggested correction value for each field you want to correct.
  3. Provide your name and email so we can contact you if needed.

Paper Information

lrec2026-ws-wildre-11

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Paper Fields

Click the edit button next to a field to report a correction.

Title

Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

Abstract

The digitization of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition (NER). While recent methodologies utilize generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2. Our experiments reveal a key insight: while both models scale well with synthetic data, IndicBERTv2 qualitatively outperforms XLM RoBERTa in entity identification and classification. On a fixed split of 92,647 train and 10,295 validation examples, IndicBERTv2 achieves the best validation F1 of 0.9615, outperforming XLM R’s 0.9506 while remaining substantially lighter for deployment. We demonstrate that the generic tokenizer of XLM R fractures Sanskrit terms, whereas the domain adapted tokenizer of IndicBERTv2 preserves semantic integrity.


Authors

Expand an author to correct their information. Use the remove button to request author removal, or add a new author.


PDF Attachment

You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.

Drag & drop a PDF here, or click to select

Your Information

Author Declaration *

Select at least one field to correct using the edit buttons above.