JSS: A few notes for the technically inclined user

  • Search words are folded to lowercase and stemmed (e.g. final "s", "ing", and "ed" are removed).
  • Short words with 3 letters or less that are not capitalized in the original text are not indexed.

How JSS works

JSS uses a simple inverted word list. Each individual word that appears in the corpus of documents is associated with a list of documents (or a list of pages) in which the word apprears.

The inverted word list is encoded in string literals in JavaScript source code. With this trick, the search engine and the index are entirely contained in a piece of JavaScript source code that runs in the browser. For CDROM-based collections, this provides a machine-independent search capability without requiring any software installation, and without requiring a Java virtual machine. For web-based collections, this provides a simple search capability, without requiring any server-side software installation, and without consuming any server resources.

JSS is the simplest way to provide search capabilities for CDROM document collections, and for Web sites with no access to CGI scripts.

Inverted Word List Encoding

The inverted word list is encoded in string literals in the JavaScript source files. The encoding scheme is as follows. Each document or page is identified by an ID number. For each word, a list of ID numbers of documents in which the word appears is built. The list is sorted in ascending order. This list is transformd into a list of differences between successive IDs. This list is then encoded using a very simple entropy coder. Small numbers up to 170 are encoded as single-byte non-escaped characters taken from the printable ISO-latin set. Larger numbers are escaped with the space character and encoded with two bytes.

Indexing your Collection with JSS

There are two ways to index a collection with JSS. The easiest one is to use the Bib2Web conversion server. Bib2Web is a free web service that allows you to build and index a collection of documents in PostScript, TIFF, PDF, or DjVu formats. Bib2Web has the considerable advantage of having a built-in OCR engine (which is particularly useful for scanned documents). No software installation is required.

The second possibility is to install the JSS package. This package includes the jssindex program which provides a very simple way to make a document collection searchable.

jssindex is a script written in the Lush language. To use the JSS indexer package, you must first download and install Lush. Lush runs on GNU/Linux, Unix, and Windows under Cygwin.

To index collections of documents in the DjVu format with JSS, you must download and install the DjVuLibre package.

Powered by ELDA © 2008 ELDA/ELRA