User:OrenBochman/Search/Porting

=Search 2 Code Porting Notes=

Questions

 * is it possible to use the old filter code with new lucene libraries.
 * is it possible to make a non invasive modification which would allow using old filter code almost as is.
 * what do various filters do.
 * which filters are in use for various setups.

Contributing To Solr
It would be strtegicly wise to contributed back to Apache Lucne and/or Apache Solr. This is beacause:
 * Once it is integrated into SOLR they have to keeping it up to date with changes in lucene's API which invariably occuer over time
 * They have extensive tests
 * Their user base will find and fix bug much faster than can be done in the Wikipedia ecosystem.

=Search 2 Code Review=

Filters
Some missing filters.
 * WikiArticleFilter indexes all wikipedia articles titles and redirects referenced in article texts using DB dump of title tables.
 * NamedEntity filter - uses a database to index named enteties mined from various wikis (semanticMW could help here)
 * WikiParserAnalyser - uses the latest parser implimentation to parse and index wiki source.

Others

 * WikiQueryParser
 * Aggregate  - Aggregate bean that captures information about one item going into the some index aggregate field.

Problems in Aggregate
 * 1) Token.termText - undefined
 * 2) Analyzer.tokenStream(String,String) - undefined
 * 3) TokenStream.next - undefined

Porting Filters
Filters should be ported from the 2.4.x to Lucene 2.9.x api. This involves: private CharTermAttribute termAtt; //Copies the contents of buffer, at an offset for length characters, to the termBuffer array. private TypeAttribute typeAttr; which should be intiated in the constructor via: public filterXtor {  super(input); termAttr = (CharTermAttribute) addAttribute(CharTermAttribute .class); typeAttr = addAttribute(TypeAttribute.class); }
 * 1) writing unit tests for the old filter and seeing that they still work with new input....
 * 2) Token next and Token next(Token) have been deprected.
 * 3) incrementToken needs to be called on the input token string, not on the filter which will cause a stack overflow).
 * 4) to process the token add to the filter properties:

boolean incrementToken {  if (!input.incrementToken) return false;
 * 1) boolean incrementToken is now required.
 * 2) it moves the token stream one step forward.
 * 3) it returns true is there are more tokens, false otherwise.

// process token via termAttr.term // next update buffers termAttr.setTermBuffer(modifiedToken); termAttr.setTermLength(this.parseBuffer(termAtt.termBuffer, termAtt.termLength)); typeAttr.setType(TOKEN_TYPE_NAME); return true; }


 * Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x
 * Porting to Lucene 4.0.x

Package org.apache.xmlrpc.webserver

 * no issues

Package org.wikimedia.lsearch.beans

 * no issues

Package org.wikimedia.lsearch.benchmark

 * no issues

Package org.wikimedia.lsearch.config

 * no issues

Package org.wikimedia.lsearch.frontend

 * no issues

Package org.wikimedia.lsearch.highlight

 * review then migrate to

org.apache.lucene.search/ArticleInfo.Java
note: the only implementation wraps methods of ArticleMetaSource so it could be refactored away
 * (limited) interface for metadata on article
 * isSubpage - if it is a subpage
 * daysOld - age in index
 * namespace - articles nameSpace
 * interface implementation is in org.wikimedia.lsearch.search/ArticleInfoImpl.java

org.apache.lucene.search/ArticleNamespaceScaling.Java

 * boosts article using its namespace.
 * is used in:
 * ArticleQueryWrap.customExplain,
 * ArticleQueryWrap.customScore
 * SearchEngine.PrefixMatch
 * tested in
 * testComplex
 * testDefalut

org.apache.lucene.search/ConstMinScore

 * provides a boost queary with a minumum score
 * used by
 * CustomScorer

org.apache.lucene.search/CustomBoostQuery

 * Query that sets document score as a programmatic function of (up to) two (sub) scores.

Broking Api Changes
IndexWriter writer = new IndexWriter(String path,null,boolean newIndex) Has been deprecated from the api. Indexing has advanced considerably since since 2.3 and these changes should be intergrated into the indexer code.
 * it is possible to use an index while it is being updated.
 * it is possible to update documents in the index.

Non Breaking Api Changes
writer.setSimilarity(new WikiSimilarity) IndexWriterConfig.setMaxBufferedDocs(int) writer.setMergeFactor(mergeFactor) writer.setUseCompoundFile(true); writer.setMaxBufferedDocs(maxBufDocs) writer.setMaxFieldLength(WikiIndexModifier.MAX_FIELD_LENGTH); Have been deprecated from the api and should be replaced with: IndexWriterConfig.setSimilarity(Similarity)} IndexWriterConfig.setMaxBufferedDocs(int)} LogMergePolicy.setMergeFactor(int) LogMergePolicy.setUseCompoundFile(boolean) LimitTokenCountAnalyzer Porting notes: needs to be created and then passed to the constructor. IndexWriterConfig conf = new IndexWriterConfig(analyzer); conf.setSimilarity(new WikiSimilarity).setMaxBufferedDocs(maxBufDocs); //allows chaining writer=new IndexWriter(Directory d, conf)     // Constructs a new IndexWriter per the settings given in conf. All but the last are pretty trivial changes. However using  instead of   means that another step needs to be added to analysis also the  fact that some of the anlysis uses mutivalued where?? fields means that LimitTokenCountAnalyzer will operate differently from before. The analyzer limits the number of tokens per token stream created, while this setting limits the total number of tokens to index. So that more tokens would get indexed

Package org.wikimedia.lsearch.index

 * Indexer and indexing related classes.

org.wikimedia.lsearch.indexWikiSimilarity
public float lengthNorm(String fieldName, int numTokens) has been deprecated and needs to be replaced by public float computeNorm(String field, FieldInvertState state)

=References=