Lucene



Doug Cutting
cutting@apache.org

November 24 2004
University of Pisa


Prelude


Lucene is


Lucene Architecture


[draw on whiteboard for reference throughout talk]

Lucene API



Package: org.apache.lucene.document


Example

 public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));

return doc;
}

Example (continued)

field
stored
indexed
analyzed
path
yes
yes
yes
modified
yes
yes
no
content
no
yes
yes


Package: org.apache.lucene.analysis


Example

public class ItalianAnalyzer extends Analyzer {
  private Set stopWords = 
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
 result = new LowerCaseFilter(result);
  result = new StopFilter(result, stopWords);
  result = new SnowballFilter(result, "Italian");
 return result;
 }
}

Package: org.apache.lucene.index


Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies

  1. batch-based: use file-sorting algorithms (textbook)
  2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
  3. segment based: lots of small indexes (Verity)

Lucene's Index Algorithm


Indexing Diagram



Index Compression

For keys in Term -> ... map, use technique from Paolo's slides:


For values in Term -> ... map, use technique from Paolo's slides:



VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000



1

00000001



2

00000010



...




127

01111111



128

10000000

00000001


129

10000001

00000001


130

10000010

00000001


...




16,383

11111111

01111111


16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...




This provides compression while still being efficient to decode.


Package: org.apache.lucene.search


Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

Search Algorithms

From Paolo's slides:


Lucene's Disjunctive Search Algorithm


[ draw a diagram to illustrate? ]

Lucene's Conjunctive Search Algorithm

From Paolo's slides:


Algorithm

Scoring

From Paolo's slides:

Is very much like Lucene's Similarity.

Lucene's Phrase Scoring


Thanks!

And there's lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.