Author: techfox9
Notes from ‘Lucene in Action’ ..
Sunday, September 7th, 2008 @ 1:28 am
Lucene in Action
ERIK HATCHER
OTIS GOSPODNETIC
MANNING
Greenwich
Ch. 1 – Introduction
- Lucene is a high performance, scalable Information Retrieval (IR) library.
- Lucene’s creator is Doug Cutting.
- Creating an index – see ‘Indexer.java’ (in ‘Files’, top right tabs)
- Indexing API:
— IndexWriter
— Directory (RAMDirectory)
— Analyzer
— Document
— Field - Searching an index – see ‘Searcher.java’ (in ‘Files’, top right tabs)
- Searching API:
— IndexSearcher
— Term
— Query
— TermQuery
— Hits
Ch. 2 – Indexing
- The Analyzer tasks:
— Decompose text into tokens.
— Remove ‘stop words’.
— Reduces words to roots. - The ‘Inverted Index’ – an efficient method of finding documents
that contain given words.
In other words, instead of trying to answer the question “what words are contained
in this document?” this structure is optimized for providing quick answers to
“which documents contain word X?” - Lucene doesn’t offer an update(Document) method;
instead, a Document must first be deleted from an index and then re-added to it. - Use ‘doc.setBoost(float)’ to adjust the importance of documents.
Use ‘field.setBoost(float)’ to set level for fields. - Using indexable date/time fields to high resolution (milliseconds) may cause
performance problems. - Use indexable numeric fields for range queries (store the size of email messages,
for example). - Tuning indexing performance – system properties org.apache.lucene.X where X is:
— mergeFactor – 10 – Controls segment merge frequency and size
— maxMergeDocs – Integer.MAX_VALUE – Limits the number of documents per segment
— minMergeDocs – 10 – Controls the amount of RAM used when indexing - Use ‘addIndexes(Directory[])’ to copy indexes from one IndexWriter to
another – for example, from RAMDirectory to FSDirectory . - Limit Field sizes with maxFieldLength – default is 10K terms per document.
- Optimizing an index
— Merging segments
— Optimizing an index only affects the speed of searches
against that index, and does not affect the speed of indexing.
— API invoke pattern:
IndexWriter writer = new IndexWriter(“/path/to/index”, analyzer, false);
writer.optimize();
writer.close(); - Scoring
Factors:
— tf(t in d) Term frequency factor for the term (t) in the document (d).
— idf(t) Inverse document frequency of the term.
— boost(t.field in d) Field boost, as set during indexing.
— lengthNorm(t.field in d) Normalization value of a field, given the number of terms within the
field. This value is computed during indexing and stored in the index.
— coord(q, d) Coordination factor, based on the number of query terms the
document contains.
— queryNorm(q) Normalization value for a query, given the sum of the squared weights
of each of the query terms. - Query types
— TermQuery
— RangeQuery
— PrefixQuery
— BooleanQuery
— PhraseQuery
— WildcardQuery
— FuzzyQuery (the Levenshtein distance) - Analysis operations:
— Extract words
— Discard punctuation
— Remove accents from characters
— Lowercase (also called normalizing),
— Remove common words
— Reduce words to a root form (stemming)
— Change words into the basic form (lemmatization)
Ch. 3 – Search in applications
Ch. 4 – Analysis