Request Correction

Use this form to request corrections to the paper metadata. Select the fields that need correction and provide the correct information.

Correction Guidelines

Click the edit button next to a field to report a correction.
Fill in the suggested correction value for each field you want to correct.
Provide your name and email so we can contact you if needed.

View all submitted correction requests

Paper Information

lrec2026-main-535

Sanskrit Travelogue: A Large-Scale Unified and Annotated Corpus of Sanskrit Texts

View lrec2026-main-535.pdf

Paper Fields

Click the edit button next to a field to report a correction.

Title

Sanskrit Travelogue: A Large-Scale Unified and Annotated Corpus of Sanskrit Texts

Abstract

We present Sanskrit Travelogue, to our knowledge the largest open, unified and richly annotated Sanskrit corpus. Aggregating eight digital libraries, it comprises 12,394 texts, 73.1M tokens and 9M segments after de-duplication. A reproducible pipeline standardizes transliteration to IAST, reconciles heterogeneous metadata, preserves structural semantics (verse markers, chapter hierarchies, textual apparatus) and adds automatic annotations. We provide corpus-scale morphosyntactic annotation combining two systems: the BYT-5 Sanskrit model for compound and sandhi splitting, and the process-sanskrit library for inflection removal and morphological tagging through a hybrid deterministic-statistical cascade. For each segment we materialize synchronized representations: cleaned, analyzed (sandhi/compound split), stemmed, diacritic-normalized and morphologically tagged. These representations are indexed jointly for retrieval. Both approaches achieve high accuracy (84.61% sentence-level exact matches for BYT-5 segmentation, 92.37% correct root extraction for compounds, 95.94% on the Yoga Sūtra). Manual evaluation on the Yoga Sūtra showed 98% correct root extraction when combining both methods, outperforming individual approaches. These annotations enable searching across orthographic sandhi and within compounds, robust lemma-level retrieval despite rich inflectional variation, and provide training material for segmentation and lemmatization while maintaining ambiguity for downstream modeling. We release the annotated corpus as TSV shards, code for corpus acquisition, processing and annotation, a query normalizer, all under a Creative Commons non-commercial license.

Authors

Expand an author to correct their information. Use the remove button to request author removal, or add a new author.

PDF Attachment

You may attach a PDF as a corrected version of the paper. Max file size: 10MB. Only PDF files are accepted.

Drag & drop a PDF here, or click to select

Your Information

Name

Comment

Author Declaration *

I declare that I have notified all co-authors of the proposed corrections and obtained their consent, and that all modifications adhere to research ethics standards and the LREC correction policy.

Select at least one field to correct using the edit buttons above.