+ fast to search
- update/build does not scale
- complex implementation
segment based: lots of small indexes (Verity)
+ fast to build
+ fast to update
- slower to search
hash-file based (Ultraseek ISTK?)
+ fast to build/update/search
- unsorted dictionary
- no suffix matching
- slower index merging, more seeks
(strategies not exclusive, can be combined)
Lucene's Inverted Index Strategy
two basic algorithms:
make an index for a single document
merge a set of indices
incremental algorithm:
maintain a stack of segment indices
create index for each incoming document
push new indexes onto the stack
let M=10 be the merge factor; K=infinity
for (size = 1; size < K; size
*= M) { if (there are M indexes with size
docs on top of the stack) { pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else { break; } }
optimization: single-doc indexes kept in RAM, saves system
calls
notes:
batch indexing w/ K=infinity, flush at end
depth-first traversal of M-ary merge tree
a good sorting algorithm: good locality, sequential i/o
segment indexing w/ K<infinity
Indexing Diagram
M = 3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred
Search Algorithms
assume a vector-space model
å (tfd
* idft ) / normd
-style weightings
approximate search algorithms
+ faster, may not process all postings
- not guaranteed to return top-scoring documents
- frequently not amenable to incremental index
changes
e.g., sorting postings by score
exact search algorithms
- slower, must usually process all postings
+ guaranteed to return top-scoring documents