Archive for the 'Search' Category

 

Solr integration with Nutch ..

Aug 20, 2009 in nutch, Search, solr

“requestHandler” notes for the solrconfig.xml file:

— Fields are defined here:

<str name=”hl.fl”>text features name</str>

— Field values are defined here:

<str name=”f.name.hl.alternateField”>name</str>
<str name=”f.name.hl.fragsize”>0</str>
<str name=”f.text.hl.fragmenter”>regex</str>

— The alternate ‘nutch’ configuration is:

(See http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/)

— Fields:

<str name=”hl.fl”>title url content</str>

— Field values:

<str name=”f.content.hl.fragmenter”>regex</str>
<str name=”f.title.hl.alternateField”>title</str>
<str name=”f.title.hl.fragsize”>0</str>
<str name=”f.url.hl.alternateField”>url</str>
<str name=”f.url.hl.fragsize”>0</str>

— To map a parser to a file type,

— Map mime type for the file to a plugin in conf/parse-plugins.xml .

— Define new mime type for the file in conf/mime-types.xml .

Notes from ‘Lucene in Action’ ..

Sep 07, 2008 in Books, Java, lucene

Lucene In Action
Lucene in Action
ERIK HATCHER
OTIS GOSPODNETIC
MANNING
Greenwich

Ch. 1 – Introduction

  • Lucene is a high performance, scalable Information Retrieval (IR) library.
  • Lucene’s creator is Doug Cutting.
  • Creating an index – see ‘Indexer.java’ (in ‘Files’, top right tabs)
  • Indexing API:
    — IndexWriter
    — Directory (RAMDirectory)
    — Analyzer
    — Document
    — Field

  • Searching an index – see ‘Searcher.java’ (in ‘Files’, top right tabs)
  • Searching API:
    — IndexSearcher
    — Term
    — Query
    — TermQuery
    — Hits

Ch. 2 – Indexing

  • The Analyzer tasks:
    — Decompose text into tokens.
    — Remove ‘stop words’.
    — Reduces words to roots.
  • The ‘Inverted Index’ – an efficient method of finding documents
    that contain given words.
    In other words, instead of trying to answer the question “what words are contained
    in this document?” this structure is optimized for providing quick answers to
    “which documents contain word X?”
  • Lucene doesn’t offer an update(Document) method;
    instead, a Document must first be deleted from an index and then re-added to it.
  • Use ‘doc.setBoost(float)’ to adjust the importance of documents.
    Use ‘field.setBoost(float)’ to set level for fields.
  • Using indexable date/time fields to high resolution (milliseconds) may cause
    performance problems.
  • Use indexable numeric fields for range queries (store the size of email messages,
    for example).
  • Tuning indexing performance – system properties org.apache.lucene.X where X is:
    — mergeFactor – 10 – Controls segment merge frequency and size
    — maxMergeDocs – Integer.MAX_VALUE – Limits the number of documents per segment
    — minMergeDocs – 10 – Controls the amount of RAM used when indexing
  • Use ‘addIndexes(Directory[])’ to copy indexes from one IndexWriter to
    another – for example, from RAMDirectory to FSDirectory .
  • Limit Field sizes with maxFieldLength – default is 10K terms per document.
  • Optimizing an index
    — Merging segments
    — Optimizing an index only affects the speed of searches
    against that index, and does not affect the speed of indexing.
    — API invoke pattern:
    IndexWriter writer = new IndexWriter(“/path/to/index”, analyzer, false);
    writer.optimize();
    writer.close();
  • Ch. 3 – Search in applications

  • Scoring
    Factors:
    — tf(t in d) Term frequency factor for the term (t) in the document (d).
    — idf(t) Inverse document frequency of the term.
    — boost(t.field in d) Field boost, as set during indexing.
    — lengthNorm(t.field in d) Normalization value of a field, given the number of terms within the
    field. This value is computed during indexing and stored in the index.
    — coord(q, d) Coordination factor, based on the number of query terms the
    document contains.
    — queryNorm(q) Normalization value for a query, given the sum of the squared weights
    of each of the query terms.
  • Query types
    — TermQuery
    — RangeQuery
    — PrefixQuery
    — BooleanQuery
    — PhraseQuery
    — WildcardQuery
    — FuzzyQuery (the Levenshtein distance)
  • Ch. 4 – Analysis

  • Analysis operations:
    — Extract words
    — Discard punctuation
    — Remove accents from characters
    — Lowercase (also called normalizing),
    — Remove common words
    — Reduce words to a root form (stemming)
    — Change words into the basic form (lemmatization)