The Lucene Search Engine

Doug Cutting <cutting@lucene.com>
Inktomi Seminar
16 June, 2000

Lucene is . . .

a full-text search engine
written in Java
open source
http://www.lucene.com

Disclaimer

Lucene is different than Excite's search technology

Excite uses some proprietary methods
Lucene uses no proprietary methods

I can't talk about Excite's proprietary methods

Context

Lucene is fourth search engine I've written

others at Xerox, Apple & Excite

sought simple implementation w/o major sacrifice
mostly written in two months, two days/week
my first Java program!

Architecture

similar to http://www.lucene.com/papers/riao91.ps
simple document model

document is a sequence of fields
fields are name/value pairs
values can be strings or Reader's
clients must strip markup first
field names can be repeated
string-valued fields can be stored in index

plug-in analysis model

grammar-based tokenizer included
tokenizers piped through filters: stop-list, lowercase, stem, etc.

storage api: named, random-access blobs
index library
search library
query parsers

Inverted Index

abstractly, maps:

Term	->	<docNo>*	-- for boolean search
Term	->	<docNo, tf>*	-- for better ranking
Term	->	<docNo, tf, <position>* >*	-- for proximity searching

also stores:

number of terms, docs
docNo -> doc mapping
etc.

in Lucene

terms are <field, token> tuples
separate norm factor for each field in each doc
e.g., by default, a "title" match is stronger than a "body" match

Some Inverted Index Strategies

batch-based: use file-sorting algorithms (textbook)

+ fastest to build

+ fastest to search

- slow to update

b-tree based: update in place (http://www.lucene.com/papers/sigir90.ps)

+ fast to search

- update/build does not scale

- complex implementation

segment based: lots of small indexes (Verity)

+ fast to build

+ fast to update

- slower to search

hash-file based (Ultraseek ISTK?)

+ fast to build/update/search

- unsorted dictionary

- no suffix matching

- slower index merging, more seeks

(strategies not exclusive, can be combined)

Lucene's Inverted Index Strategy

two basic algorithms:

make an index for a single document
merge a set of indices

incremental algorithm:

maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
let M=10 be the merge factor; K=infinity

for (size = 1; size < K; size *= M) {

if (there are M indexes with size docs on top of the stack) {

pop them off the stack;

merge them into a single index;

push the merged index onto the stack;

} else {

break;

}

optimization: single-doc indexes kept in RAM, saves system calls
notes:

batch indexing w/ K=infinity, flush at end

depth-first traversal of M-ary merge tree
a good sorting algorithm: good locality, sequential i/o

segment indexing w/ K<infinity

Indexing Diagram

M = 3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred

Search Algorithms

assume a vector-space model

å (tf_d * idf_t ) / norm_d -style weightings

approximate search algorithms

+ faster, may not process all postings

- not guaranteed to return top-scoring documents

- frequently not amenable to incremental index changes

e.g., sorting postings by score

exact search algorithms

- slower, must usually process all postings

+ guaranteed to return top-scoring documents

Lucene currently only implements exact searching

Lucene's Disjunctive Search Algorithm

described in http://www.lucene.com/papers/riao97.ps
since all postings must be processed

goal is to minimize per-posting computation

merges postings through a fixed-size array of accumulator buckets
performs boolean logic with bit masks

Lucene's Phrase Scoring

approximate phrase IDF with sum of terms
compute actual tf of phrase
slop penalizes slight mismatches by edit-distance

Future

add conjunctive search algorithm

can skip postings when some terms are required
note: Google requires all terms...

Lucene needs:

a spider
admin/install UI
parsers for PDF, SQL, etc.
. . .

http://www.lucene.com