User:OrenBochman/Search/Porting

=Search 2 Code Porting Notes=

Questions

 * is it possible to use the old filter code with new lucene libraries.
 * is it possible to make a non invasive modification which would allow using old filter code almost as is.
 * what do various filters do.
 * which filters are in use for various setups.

Contributing To Solr
It would be strtegicly wise to contributed back to Apache Lucne and/or Apache Solr. This is beacause:
 * Once it is integrated into SOLR they have to keeping it up to date with changes in lucene's API which invariably occuer over time
 * They have extensive tests
 * Their user base will find and fix bug much faster than can be done in the Wikipedia ecosystem.

Filters
Some missing filters.
 * WikiArticleFilter indexes all wikipedia articles titles and redirects referenced in article texts using DB dump of title tables.
 * NamedEntity filter - uses a database to index named enteties mined from various wikis (semanticMW could help here)
 * WikiParserAnalyser - uses the latest parser implimentation to parse and index wiki source.

Others

 * WikiQueryParser
 * Aggregate  - Aggregate bean that captures information about one item going into the some index aggregate field.

Problems in Aggregate
 * 1) Token.termText - undefined
 * 2) Analyzer.tokenStream(String,String) - undefined
 * 3) TokenStream.next - undefined

Porting Filters
Filters should be ported from the 2.4.x to Lucene 2.9.x api. This involves: private CharTermAttribute termAtt; //Copies the contents of buffer, at an offset for length characters, to the termBuffer array. private TypeAttribute typeAttr; which should be intiated in the constructor via: public filterXtor {  super(input); termAttr = (CharTermAttribute) addAttribute(CharTermAttribute .class); typeAttr = addAttribute(TypeAttribute.class); }
 * 1) writing unit tests for the old filter and seeing that they still work with new input....
 * 2) Token next and Token next(Token) have been deprected.
 * 3) incrementToken needs to be called on the input token string, not on the filter which will cause a stack overflow).
 * 4) to process the token add to the filter properties:

boolean incrementToken {  if (!input.incrementToken) return false;
 * 1) boolean incrementToken is now required.
 * 2) it moves the token stream one step forward.
 * 3) it returns true is there are more tokens, false otherwise.

// process token via termAttr.term // next update buffers termAttr.setTermBuffer(modifiedToken); termAttr.setTermLength(this.parseBuffer(termAtt.termBuffer, termAtt.termLength)); typeAttr.setType(TOKEN_TYPE_NAME); return true; }


 * Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x
 * Porting to Lucene 4.0.x

=Search 2 Code Review=

org.apache.lucene.search/ArticleInfo.Java
note: the only implementation wraps methods of ArticleMetaSource so it could be refactored away
 * (limited) interface for metadata on article
 * isSubpage - if it is a subpage
 * daysOld - age in index
 * namespace - articles nameSpace
 * interface implementation is in org.wikimedia.lsearch.search/ArticleInfoImpl.java

org.apache.lucene.search/ArticleNamespaceScaling.Java

 * boosts article using its namespace.
 * is used in:
 * ArticleQueryWrap.customExplain,
 * ArticleQueryWrap.customScore
 * SearchEngine.PrefixMatch
 * tested in
 * testComplex
 * testDefalut

org.apache.lucene.search/ConstMinScore

 * provides a boost queary with a minumum score
 * used by
 * CustomScorer

org.apache.lucene.search/CustomBoostQuery

 * Query that sets document score as a programmatic function of (up to) two (sub) scores.

Package org.wikimedia.lsearch.index

 * Indexer and indexing related classes.

org.wikimedia.lsearch.indexWikiSimilarity
public float lengthNorm(String fieldName, int numTokens) has been deprecated and needs to be replaced by public float computeNorm(String field, FieldInvertState state)

=References=