Extension talk:SolrStore

About this board

Using 'Range' input type for a field

2
182.73.117.198 (talkcontribs)

I have implemented the solr search on my wiki using the SolrStore extension. I used the filter based search feature of the Solr. One of the filter input in the search is the year input. For now, the year input is a dropdown (user can select a particular year and search for results in that year). I want to change it to a range ( where user can select a group of years, say from a to b, to search for results in those years.

How camnI achieve this?

Thanks in advance.

Kghbln (talkcontribs)
Reply to "Using 'Range' input type for a field"
M art in (talkcontribs)

After installation there is a new link on Special:Specialpage to "SolrSeach" but when I follow the link, no Specialpage is found. Thanks for help!

Reply to "No Specialpage"
M art in (talkcontribs)

Will there be a stable version for newer Solr-Server (like 5.5. or 6.0), SMW 2.3. and Mediawiki 1.25 oder 1.26? Because I have lot's of problems to configure this extension on my server.

Thanks!

Reply to "Updates?"

summary line wrong

15
David Mason (talkcontribs)

hi,

We're now trying SolrStore with MW 1.19/Solr 4.1/SMW 1.8. It all seems to work, but the summary line is wrong:

Relevance: 27.0% - 2 KB (19 words) - 22:28, 16 May 2013

The search word is in the title, so relevance should be higher? The article is more than 19 words (not sure about KB size), and the date is incorrect since the article was last modified on the 15th. Is there a known fix?

thanks!

SBachenberg (talkcontribs)

Hi David, nice to here that it's working with Solr 4.1 we haven't test it yet. The Relevance is a Bit tricky, because Solr generates a Score for each result based on TF-IDF. Normally you can not convert a TF-IDF score cleanly into a percentage. But the default MediaWiki search form wants a Relevance in percent. We have often relevance values far over the 100% so Please do not take it as accurate.

For the last modified date you have to do 2 things:

  1. Look at your solr search result xml and find the actual field name of your Modification date. The Problem here is, its based on the language you are using. In an English wiki it should be "Modification date_dt", in German it's "Zuletzt geändert_dt".
  2. Go into SolrStore/templates/SolrSearchTemplate_Standart.php line 81 and change it from: if ( $docd[ 'name' ] == 'Zuletzt geändert_dt'){ to your language. if it's English: if ( $docd[ 'name' ] == 'Modification date_dt'){

EDIT: I just uploaded a fix for the English language to SourceForge: http://sourceforge.net/projects/smwsolrstore/files/SolrStore_0.8.1.zip/download

If you have any other Problems etc. just ask.

Kghbln (talkcontribs)

Heiya Simon,

it would be nice to have the commit for the new version also in Gerrit. Thus all the translation update would move into this version, too.

Cheers

David Mason (talkcontribs)

Hi,

Thanks for the translation fix, the date is correct now. However, the file size is still incorrect. For example, it shows "212 B (16 words)" for a page that is 752 words, 4534 bytes. As it's different for each entry I presume it's not a translation problem.

How should the "relevance" score be interpreted? Is the sorting correct? I don't want to put something in front of the users that's confusing.

SBachenberg (talkcontribs)

Hi, the Score is a correct tf-idf score, the higher the score the better and the sorting is also correct.

I'll look into this Bytes/Words Problem. I have currently no idea where the problem is, but I'll answer you as soon as possible.

One thing you should know about the extension is, that we currently don't support the search in selected namespaces. You can only search in all namespaces, but you can disable some namespace in your LocalSettings.php with the parameter $wgSolrOmitNS.

The default is:$wgSolrOmitNS = array('102' );

You should hide you advance search options so nobody gets confused. The CSS for that is: .mw-search-formheader div.search-types, #mw-searchoptions{ display: none; }

David Mason (talkcontribs)

Thanks very much for your diligence! Let me know if I can help.

SBachenberg (talkcontribs)

Sorry that it took so long, but I was a bit busy the last Days.

Could you please change the Code in the File /SolrStore/Templates/SolrSearchTemplate_Standart.php line 33 to this:

// get Size, Namespace, Wordcound, Date from XML:		
foreach ( $xml->arr as $doc ) {
	switch ( $doc[ 'name' ] ) {
		case 'text':
			$textsnip = '';
			$textsnipvar = 0;
			foreach ( $doc->str as $inner ) {
				$textsnipvar++;
				if ( $textsnipvar >= 4 && $textsnipvar <= $snipmax ) {
					$textsnip .= ' ' . $inner;
				}
				$textsnip = substr( $textsnip, 0, $textlenght );
			}
			$this->mDate = $doc->date;
			break;
		case 'wikitext':
			$this->mSize = strlen( $doc->str );
			$this->mWordCount = count( $doc->str );
			$textsnipy = "";
			$textsnipy = $doc->str;
			$textsnipy = str_replace( '{', '', $textsnipy );
			$textsnipy = str_replace( '}', '', $textsnipy );
			$textsnipy = str_replace( '|', '', $textsnipy );
			$textsnipy = str_replace( '=', ' ', $textsnipy );
			$textsnipy = substr( $textsnipy, 0, $textlenght );
			break;
	}
}

I will upload a fix to SourceForge later.

David Mason (talkcontribs)

Hi again,

I tried this fix, it changes the output but it's still not correct, unless there is something unusual about how it handles text in SMW templates, where most of our text is located. I also had to comment out code references to $nsText.

SBachenberg (talkcontribs)

Hi,I thought that solves the problem.

Let me tell you a bit about how to handle the wikitext. We store the wikitext in solr field "wikitext" and each SMW attribut in its own field. We also have a field called "text", in which we save all fields combined. Before the patch we used "text" for the calculation now we changed it to wikitext, which should be the right field for that purpose.

All of this fields can be customized through solr it self and thats where the Problem must be. Could you please have a look in your Solr schema.xml. In line 953 should be something like that:

<field name="wikitext" type="text_general" indexed="true" stored="true" multiValued="true"/>

This defines "wikitext" with the Solr FieldType "text_general", which I thought would be the right, but I never thought about to count words and Bytes. Could you please change it to "string", because "text_general" uses analyzers, tokenizer and a handful of filter. All these things manipulate the original text, which leads to the miss calculation.

The only big Problem is, that you have to restart you solr after altering the schema.xml and also have to re-index you wiki, so that the new field definition can show it results.


Please tell me if it works, because re-indexing our SofisWiki takes up to 3 Days and you will probably be faster :-)

David Mason (talkcontribs)

Hi again,

I've changed that line and restarted solr, then I ran SMW_refreshData (?) . It's not clear how to refresh the data so I ran SemanticMediaWiki/maintenance/SMW_refreshData.php and also maintenance/runJobs.php. During the former I saw lots of this:

PHP Notice: Array to string conversion in /var/www/mw/extensions/SemanticMediaWiki/includes/storage/SQLStore/SMW_SQLStore3_Writers.php on line 383

Unfortunately now the result looks the same as it did previously, and I see results like this "2 KB (1 word)" — that's some word!

If it helps I could set up an isolated instance for you to connect to directly?

SBachenberg (talkcontribs)

Hi David, this sounds nice, but I think it would be enough if you could sent me an XML result from your Solr. Then i can find out, why the stored data is not counted correctly.

The way you re-indexed was absolutely right, but the error you get is not from the SolrStore. Thats an known SMW error: https://bugzilla.wikimedia.org/show_bug.cgi?id=42321

SBachenberg (talkcontribs)

HI David, I may have found the Error. Could you please change

$this->mWordCount = count( $doc->str );

to

$this->mWordCount = str_word_count( $doc->str );

Sorry, that fixing this takes so long. Because I haven't written this "Template" part. Thats all the work of my workmate Sascha Schüller, but he has currently no ambition to fix that.

David Mason (talkcontribs)

Hi,

I think that is better, it is higher than the "wc" word count but that may be what it considers "words." I will run this by the users with some caveats.

Thanks again!

I'd like to talk about ways to extend this project, for example to support 'classes' without a php-coded template.

And it could also support uploaded documents (Word, PDF) since it's based on Solr.

Are these being considered?

SBachenberg (talkcontribs)

Hi David,

your ideas sound really good, but I'm not a good "Extension Developer", because I have almost no idea how the Mediawiki works internally. But maybe you have the knowledge that lacks me. I also would like to change the way, how the Fieldbased search templates get defined. Writing them into the LocalSettings.php is so uncool. It would be much nicer if I could define them with Semantic Forms.


So if you want to extend this project, you can do it on your own or we can make it together. Feel free to ask me everything about this extension.

David Mason (talkcontribs)

yes, I'm absolutely interested in working on this, though I'm swamped for the next week. Can we meet next week online to talk to about it?

Reply to "summary line wrong"
Hypergrove (talkcontribs)

What do you think of specifying the name of a solr core that is to be updated or queried?

SBachenberg (talkcontribs)

What do you actually mean?

Defining one url for updating and another for querying?

Or do you just want to add the solr core which should be ask for both ?

cheers,

Hypergrove (talkcontribs)

I realize it would be an extension of SMW, but the thought is to accommodate multiple solr cores. For instance {{#core-ask: |core=name}} and {{#core-set:name|prop=val}}. just a thought! - john

SBachenberg (talkcontribs)

When we started with the extension, we tried to do ask query's with solr. But we had to much trouble re-implementing the result printer. The SolrStore is currently a better version of the Extension:MWSearch. If you have good knowledge in the smw code, you implement this feature. I will help everybody who is interested in developing new features to the extension, just pm me.

Reply to "Multicore support"
Hypergrove (talkcontribs)

Why does SolrStore state a dependency on Tomcat? That's only one of various Servlet Containers supported by Solr itself, including Glassfish, JBoss, Jetty (default, included into Solr package), Resin, Weblogic and WebSphere. thanks

SBachenberg (talkcontribs)

Hi Hypergrove, you are absolutely right, you can use what ever you want.

Cheers,

Kghbln (talkcontribs)

I guess Tomcat is part of the setup you use and cater for.

Reply to "Why Tomcat dependency?"

Error with undefined method SolrConnectorStore::getConceptCacheStatus

3
MWJames (talkcontribs)

Sorry hadn't much time to look at it but the following keeps turning up while using concepts but SMW_SQLStore2 defines a method called getConceptCacheStatus somehow this method is not present in the extended SolrConnectorStore SMWStore class.

Fatal error: Call to undefined method SolrConnectorStore::getConceptCacheStatus() in
SBachenberg (talkcontribs)

Thx for reporting the Bug, to fix it you have to add the following code to your SolrConnectorStore.php line 39

        /**
	 * Return status of the concept cache for the given concept as an array
	 * with key 'status' ('empty': not cached, 'full': cached, 'no': not
	 * cachable). If status is not 'no', the array also contains keys 'size'
	 * (query size), 'depth' (query depth), 'features' (query features). If
	 * status is 'full', the array also contains keys 'date' (timestamp of
	 * cache), 'count' (number of results in cache).
	 *
	 * @param $concept Title or SMWWikiPageValue
	 */
	public function getConceptCacheStatus( $concept ) {
        		return self::getBaseStore()->getConceptCacheStatus( $concept );
	}
Reply to "Error with undefined method SolrConnectorStore::getConceptCacheStatus"
David Mason (talkcontribs)

great extension! hope to see it 100% soon. Running the 'trunk' version I encountered these:

  • fieldset name won't accept spaces
  • search form and results are messed up in Vector skin (fixed with /table from below)
  • Prompted to 'Create the page [fieldset name] on this wiki!'

3. trying to run refreshdata from semantic mediawiki/maintenance:

Warning: DOMDocument::loadHTML(): Unexpected end tag : p in Entity, line: 8 in /home/vid/webs/atip/docs/mediawiki-1.18.1/extensions/SolrStore/SolrTalker.php on line

258

Catchable fatal error: Argument 1 passed to DOMDocument::saveXML() must be an instance of DOMNode, null given, called in /home/vid/webs/atip/docs/mediawiki-1.18.1/extensions/SolrStore/SolrTalker.php on line 209 and defined in /home/vid/webs/atip/docs/mediawiki-1.18.1/extensions/SolrStore/SolrTalker.php on line 261

SBachenberg (talkcontribs)

Hi David, Thanks for reporting your Problems.

But I'm a bit Confused about some of your Problems.

fieldset name won't accept spaces

This should take spaces have a look http://sofis.gesis.org/sofiswiki/Spezial:SolrSearch/Projekte

This is our Fieldset definition from our wiki, maybe there is an error in your definition ?

$wgSolrFields = array(
    new SolrSearchFieldSet('Projekte', 'Titel; Personen; id', 'Titel; Personen und Authoren; SOFIS-Nr. (Erfassungs-Nr.)', ' AND category:Projekte', 'AND'),
    new SolrSearchFieldSet('Institutionen', 'name; Inst-ID;ort', 'Name; Institutions-Nr.;Ort', ' AND category:Institution', 'AND')
);

search form and results are messed up in Vector skin (fixed with /table from below)

this should be fixed now in the SVN, sometimes my friend User:Schuellersa gets a bit confused by his SolrStore Versions and commits the wrong one (This is the 3. time we fix it now :)

Prompted to 'Create the page [fieldset name] on this wiki!'

This is really new to me, could you please post your $wgSolrFields definition?

trying to run refreshdata from semantic mediawiki/maintenance

We are working on that, could you try the newest svn version This seems to be the same error Extension_talk:SolrStore#Unexpected_XML_tag_doc

SBachenberg (talkcontribs)

Hi David, the refreshdata error should now be fixed in the newest SVN version, please test it.

Reply to "a few problems"

Unexpected XML tag doc/p

15
MWJames (talkcontribs)

During a runJob exercise another error occurred, the backtrace does not say which document caused the error nor which XML tag, anyway please find the backtrace below.

The request sent by the client was syntactically incorrect (unexpected XML tag doc/p).
Backtrace:
#0 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(209): SolrTalker->solrSend('http://192.168....', '<add><doc><fiel...')
#1 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(279): SolrTalker->solrAdd('<add><doc><fiel...')
#2 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(439): SolrTalker->addDoc(Object(SolrDoc))
#3 D:\xampp\htdocs\...\extensions\SolrStore\SolrConnectorStore.php(139): SolrTalker->parseSemanticData(Object(SMWSemanticData))
SBachenberg (talkcontribs)

This looks like the same Error as Error on attribute value & #13; and & lt;

There seems to be a

Tag or something in your Value you. I have to Fix that at Tuesday, when I'm back at Work.

SBachenberg (talkcontribs)

If you want to fix it your self, have a look in the SolrDoc.php

In Line 26 is the function addField( $name, $value ), you have to add some String Replaces for the Field Value and remove '<' and '>', this should Probably Fix the errors.

But i have to build a better solution, for cleaning the Values.

MWJames (talkcontribs)

As for testing purpose, I just did a quick hack where at least the runJob doesn't break any more.

$value = preg_replace('/<|>/msu', '',$value);
Schuellersa (talkcontribs)
MWJames (talkcontribs)

Instead of using the preg_replace, I now use MW's own XML sanitizer (Sanitizer::normalizeCharReferences) which should make any name/value XML conform.

$this->output .= '<field name="' .  Sanitizer::normalizeCharReferences ( $name ) . '">' . Sanitizer::normalizeCharReferences ( $value ) . '</field>';
SBachenberg (talkcontribs)

This looks really cool. I didnt know that MW has its own Sanitizer, Thank you!

We will add this.

MWJames (talkcontribs)

Having said this, everything should be covered but somehow Solr still comes back with an error which means their must be another area where some misleading XML tags create a crash.

But I have a hypothesis that when a property for example Abstract (has type::text) not only contains text but also a notion of a template ({{value| ...}}) a crash dump is created while trying to save the article. Because when changing {{value| ...}} to [[value:: ...]] in the property value text the same article saves without any trouble.

Schuellersa (talkcontribs)
SBachenberg (talkcontribs)

The Best thing would be to remove all HTML Tags completely before sending them to Solr. I think nobody wants to query html tags, so you dont need them in your Solr index.

Could you try this piece of code for me ?

26 	public function addField( $name, $value ) {
27 	$this->output .= '<field name="' . strip_tags ( $name ) . '">' . strip_tags ( $value ) . '</field>';
28 	}
MWJames (talkcontribs)

For the above cited case of {{ }} within property values the above change didn't bring any success, it still runs into a back trace. Could their be another area where XML fragments are created?

SBachenberg (talkcontribs)

HI James, this is the only place where we create XML before we sent it to Solr.

Maybe you could sent me another backtrace ?

Your Normal way to parse the SMW-Data is:

  1. Read the Attributes and values
  2. add them to a SolrDoc
  3. sent the SolrDoc with the SolrTalker to Solr
  4. Done
MWJames (talkcontribs)

As I said before the backtrace is really ambiguous therefore one can't really tell where, when, and how things are happening.

I also tried to log ($wgDebugLogFile) any other possible messages but the log file does not show any related information to the above problem.

Maybe some wfDebugLog( 'SolrStore', __METHOD__,... log messages could help to shed light on where it comes to problems. Using the this message type would allow to filter all related Solr message using $wgDebugLogGroups = ...

Anyway the last backtrace based on SVN r114866.

The request sent by the client was syntactically incorrect (unexpected XML tag d
oc/span).
Backtrace:
#0 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(211): SolrTalker->so
lrSend('http://192.168....', '<add><doc><fiel...')
#1 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(281): SolrTalker->so
lrAdd('<add><doc><fiel...')
#2 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(441): SolrTalker->ad
dDoc(Object(SolrDoc))
#3 D:\xampp\htdocs\...\extensions\SolrStore\SolrConnectorStore.php(139): SolrTa
lker->parseSemanticData(Object(SMWSemanticData))
#4 D:\xampp\htdocs\...\extensions\SemanticMediaWiki\includes\storage\SMW_Store.
php(303): SolrConnectorStore->doDataUpdate(Object(SMWSemanticData))
#5 D:\xampp\htdocs\...\extensions\SemanticMediaWiki\includes\SMW_ParseData.php(
316): SMWStore->updateData(Object(SMWSemanticData))
#6 D:\xampp\htdocs\...\extensions\SemanticMediaWiki\includes\SMW_ParseData.php(
445): SMWParseData::storeData(Object(ParserOutput), Object(Title), true)
#7 [internal function]: SMWParseData::onLinksUpdateConstructed(Object(LinksUpdat
e))
#8 D:\xampp\htdocs\...\includes\Hooks.php(216): call_user_func_array('SMWParseD
ata::o...', Array)
#9 D:\xampp\htdocs\...\includes\GlobalFunctions.php(3631): Hooks::run('LinksUpd
ateCons...', Array)
#10 D:\xampp\htdocs\...\includes\LinksUpdate.php(98): wfRunHooks('LinksUpdateCo
ns...', Array)
#11 D:\xampp\htdocs\...\includes\job\RefreshLinksJob.php(49): LinksUpdate->__co
nstruct(Object(Title), Object(ParserOutput), false)
#12 D:\xampp\htdocs\...\maintenance\runJobs.php(78): RefreshLinksJob->run()
#13 D:\xampp\htdocs\...\maintenance\doMaintenance.php(105): RunJobs->execute()
#14 D:\xampp\htdocs\...\maintenance\runJobs.php(108): require_once('D:\xampp\ht
docs...')
#15 {main}
SBachenberg (talkcontribs)

Hi James, could you sent me the source code of one of your pages, that makes Problems. My mail is simon.bachenberg(at)gmail.com

I created a new mediawiki now to create this error and i need some good data for it ;-)

For testing purposes you can add $wgSolrDebug = true; to your localsettings.php to see everything that gets sent to solr.

SBachenberg (talkcontribs)

Hi James, we should have fixed this error now in the newest SVN version, could you please test it.

Reply to "Unexpected XML tag doc/p"

Probe connectivity to the Solr host

4
MWJames (talkcontribs)

When suddenly the Solr host is not available, all article saving goes south. The interface should somehow check if it is able to connect to the Solr host otherwise bail-out.

couldn't connect to host
Backtrace:
#0 D:\xampp\htdocs\...\extensions\SolrStore\SolrTalker.php(211): SolrTalker->solrSend('http://192.168....', '<add><doc><fiel...')
SBachenberg (talkcontribs)

I love you for testing our Extension, we are going to fix this somehow. I could think of retrying to sent it to Solr for 5 times, but after that an error will be thrown.

The Bigger Problem is, that the SMW indexer have to stop until Solr is ready again. I have no idea how to tell him to Stop.

You can allays re-index your wiki by using the "Repair-Button" under Spezial:SMW-Administration, but thats no solution for the Problem.

MWJames (talkcontribs)

Actually for the case above, Solr was not available because the server was restarted. Not sure about the inner working of Solr but certainly their must be method to check if Solr is ready to receive index values and in case it is not return true for the hook and marked the document as non-indexed.

Normally for any indexing services, you would have to have a status table on which one can track the current status of those documents, while I'm sure you don't want to introduce any special handling nor create a additional status table you could instead trace the status by creating a meta-subobject (with a special property) which is created and annotated to an entity (page) in case the status returns with anything other than successful. So either one can run a #ask query to find those subobjects or a special status page can pick those, display and allow for a mass re-index because running Special:SWMAdmin is not alwasy the best option (in our case we have around 1.1M triples which makes every Special:SWMAdmin run very costly).

SBachenberg (talkcontribs)

I'll have to think about it, I'll find a nice solution

we have the same problem with re-indexing, it takes us 1-2 days for a Full rebuild. This is why we restart Solr only if we have changed our schema, because after the most schema changes you have to re-index to have all property's indexed the right way.

A Tip beside: Create your own solr schema for your wiki, for better query results. You can add stemmers, tokenizer and many more for your Data types or copyfields, where you can merge two fields into one. The most things are only interesting if you use the field based search.

Reply to "Probe connectivity to the Solr host"