Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?
Help talk:CirrusSearch
Hello!
Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?
Hi,
I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac
do you have a specific page missing.
The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:
intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/
.
When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".
DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/
only narrows down the result(s) without adjusting the displayed lines.
But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/
. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).
Hi - there are two types of dumps available for enwiki pages - monthly database dump structured in XML which you can subscribe to and weekly cirrussearch dumps, which are structured in JSON for bulk upload to elasticsearch. We're trying to diff the two dumps to see if they're comparable, but notice some articles are in the monthly XML dump not in the weekly cirrussearch dump. I'm having trouble finding an explanation in the main wikimedia homepage that clearly states the difference beteween these two enwiki dumps. Any additional information would be much appreciated.
I would post links, but am getting an error when trying to post, so please navigate to dumps.wikimedia.org and look for the extensions
cirrussearch dump: /other/cirrussearch/
xml dump: /enwiki/latest/
Check if the "missing" articles in cirrus search dumps exist on the live wiki. If not, that means those articles got deleted after the monthly XML dumps but before the weekly cirrus search dumps
I try to -incategory:"Actif" and I have a result of an include. So I try with ! and he doesn't work (-! and !- (is equal with -)). I try to include Actif with incategory:"Actif" and he doesn't work.
It's not normal... how can fix this?
Any help is apprised.
The syntax you are using is correct and should work:
is searching for pages that are not directly under the Cartographie
category.
It might be that you are expecting incategory
to find articles belonging to the tree of subcategories?
This is not the case, incategory
is limited to direct relationships.
For finding articles in a category and the subcategories of this category you must use the deepcat
keyword: deepcat:"Cartographie".
This keyword is available on WMF wikis.
Thank you @DCausse (WMF), but it's still doesn't work.
Actif is the directely liked.
And I try again with deepcat the same, but doesn't work with just deepcat.
If you are using a publicly accessible wiki could you provide the link to it so that we can try to have a closer look?
If you are using your own private wiki could you double check that CirrusSearch is properly installed, to do this I generally append &cirrusDumpQuery
to the URL bar on the search results page.
This is how it looks like on the french wikipedia:
This should display the JSON document sent to elasticsearch and we might be able to detect what is wrong in your setup, esp. if the match query on the category.lowercase_keyword
field is wrapped inside a must_not
block or not.
The deepcat keyword requires a graph engine to be installed and setup, this most probably explains why it does not work out of the box.
It’s in local so I can’t provide you a link. So I tried what you said (mediawiki-1.37.1/index.php?search=-Category%3A"Actif"&title=Spécial%3ARecherche&cirrusDumpQuery) I think is cirrusSearch is not properly installed, I go retry maybe I don’t quite understand something?
OK - after checking my install
I have 1 question :
I only have to have the elastica extension for Wikipedia, and the library cURL in order to running circusSearch?
If no, so how can I download for xampp on windows elastica librairie for php?
After some time I retry to download elasticaSearch, now it work, incategory don't realy work (so for me: !incategory: and -incategory: to include, I don't have exclude, the oprator like OR, AND, NOT don't work).
So in order to run ElasticSearch I have to do a command in the cmd on windows, but I need to have a cmd run constantly. I shear on the internet and I find the commend START \B
.So now I have elasticSearsh who run in a hidden cmd. To stop it I just have to close my common cmd.
so here is the result with &cirrusDumpQuery for !incategory:"Actif" -incategory:"Sql"
{
"__main__": {
"description": "full_text search for '!incategory:\"Actif\" -incategory:\"Sql\"'",
"path": "test\/page\/_search",
"params": {
"timeout": "20s",
"search_type": "dfs_query_then_fetch"
},
"query": {
"_source": [
"namespace",
"title",
"namespace_text",
"wiki",
"redirect.*",
"timestamp",
"text_bytes"
],
"stored_fields": [
"text.word_count"
],
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"query_string": {
"query": "!incategory\\: (all.plain:\"Actif\"~0^1)",
"fields": [
"all.plain^1",
"all^0.5"
],
"phrase_slop": 0,
"default_operator": "AND",
"allow_leading_wildcard": true,
"fuzzy_prefix_length": 2,
"rewrite": "top_terms_boost_1024"
}
},
{
"multi_match": {
"fields": [
"all_near_match^2",
"all_near_match.asciifolding^1.5"
],
"query": "!incategory:"
}
}
],
"filter": [
{
"bool": {
"must": [
{
"terms": {
"namespace": [
0,
4,
6,
14
]
}
}
],
"must_not": [
{
"bool": {
"should": [
{
"match": {
"category.lowercase_keyword": {
"query": "Sql"
}
}
}
]
}
}
]
}
}
]
}
},
"highlight": {
"pre_tags": [
"\ue000"
],
"post_tags": [
"\ue001"
],
"fields": {
"title": {
"type": "fvh",
"number_of_fragments": 0,
"order": "score",
"matched_fields": [
"title",
"title.plain"
]
},
"redirect.title": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 10000,
"matched_fields": [
"redirect.title",
"redirect.title.plain"
]
},
"category": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 10000,
"matched_fields": [
"category",
"category.plain"
]
},
"heading": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 10000,
"matched_fields": [
"heading",
"heading.plain"
]
},
"text": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 150,
"no_match_size": 150,
"matched_fields": [
"text",
"text.plain"
]
},
"auxiliary_text": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 150,
"matched_fields": [
"auxiliary_text",
"auxiliary_text.plain"
]
},
"file_text": {
"type": "fvh",
"number_of_fragments": 1,
"order": "score",
"fragment_size": 150,
"matched_fields": [
"file_text",
"file_text.plain"
]
}
},
"highlight_query": {
"query_string": {
"query": "!incategory\\: (title.plain:\"Actif\"~0^20 OR redirect.title.plain:\"Actif\"~0^15 OR category.plain:\"Actif\"~0^8 OR heading.plain:\"Actif\"~0^5 OR opening_text.plain:\"Actif\"~0^3 OR text.plain:\"Actif\"~0^1 OR auxiliary_text.plain:\"Actif\"~0^0.5 OR file_text.plain:\"Actif\"~0^0.5)",
"fields": [
"title.plain^20",
"redirect.title.plain^15",
"category.plain^8",
"heading.plain^5",
"opening_text.plain^3",
"text.plain^1",
"auxiliary_text.plain^0.5",
"file_text.plain^0.5",
"title^10",
"redirect.title^7.5",
"category^4",
"heading^2.5",
"opening_text^1.5",
"text^0.5",
"auxiliary_text^0.25",
"file_text^0.25"
],
"phrase_slop": 1,
"default_operator": "AND",
"allow_leading_wildcard": true,
"fuzzy_prefix_length": 2,
"rewrite": "top_terms_boost_1024"
}
}
},
"stats": [
"full_text",
"full_text_querystring",
"complex_query",
"incategory",
"query_string"
],
"rescore": [
{
"window_size": 8192,
"query": {
"query_weight": 1,
"rescore_query_weight": 1,
"score_mode": "multiply",
"rescore_query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "incoming_links",
"modifier": "log2p",
"missing": 0
}
},
{
"weight": 0.1,
"filter": {
"terms": {
"namespace": [
4
]
}
}
},
{
"weight": 0.2,
"filter": {
"terms": {
"namespace": [
6,
14
]
}
}
}
]
}
}
}
}
],
"size": 21
},
"options": {
"timeout": "20s",
"search_type": "dfs_query_then_fetch"
}
}
}
Hi,
the proper syntax is -incategory:Sql
and the json output you pasted shows that it works since it has this section:
"must_not": [
{
"bool": {
"should": [
{
"match": {
"category.lowercase_keyword": {
"query": "Sql"
}
}
}
]
}
}
]
Thank you very much for your patience and your response!
But why does it work without quotes to exclude and work with quotes to include?
incategory:"sql"
and incategory:sql
should produce the same query.
Similarly, -incategory:"sql"
and -incategory:sql
should also produce the same query. If not please try to identify what differs from the cirrusDumpQuery
output.
Could someone provide the code or the means to parse the wikicode as in the "text" attribute within the elasticsearch pages - or where to better inform me ? I've been working with NLP dataset generation from wikipedia dumps, but I can't get satisfactory results with most of the parsers I've tested (mwparserfromhell, wikitextparser, mediawiki-parser). I would need to have the same text as in the cirrus dump but keeping the internal links. Thank you for any information!
The text used in the CirrusSearch dumps comes from the allText value created by WikiTextStructure::extractWikitextParts.
For the most part the processing takes the html output from mediawiki's wikitext parser, strips out elements matching a set of css selectors identifying some of the non-content and auxiliary parts of a page, and then strips all the tags out of the remaining content.
Unfortunately I'm not aware of a way to get the bulk html content of the wiki, you may need to use the mediawiki parser, and that still may have difficulties depending on template and lua usage.
Hi Team,
Hi Team,
we are using below,
MediaWiki | 1.35.3 |
PHP | 7.4.23 (apache2handler) |
MySQL | 8.0.26 |
Lua | 5.1.5 |
Elasticsearch | 6.5.4 |
/usr/share/elasticsearch/lib/log4j-1.2-api-2.11.1.jar
log4j-api-2.11.1.jar
log4j-core-2.11.1.jar
x-pack-security/log4j-slf4j-impl-2.11.1.jar
please provide us any patch which is higher then log4j>2.15.0
We do not provide elasticsearch. Mediawiki only uses it. Please contact elastic.co itself. Or just restart elasticsearch with the variable which disables the affected functionality. This is widely documented but looks something like -Dlog4j2.formatMsgNoLookups=true
Thanks @TheDJ for help,
please let me know only i need to set -Dlog4j2.formatMsgNoLookups=true into etc/elasticsearch/jvm.options or anything else also need to do.
As we are using Elasticsearch v 6.5.4 and java 1.8.0
Or pls suggest me link where i can confirm all this.
thanks
Also should i remove the JndiLookup
class ??
pls suggest.
Please contact elastic.co
Is there a sandbox to test in, so I don't affect performance ~~~~
No So I am not doing a performance load test, :-) I just saw the performance issue affecting others. Have many people screwed up? I assume you have some sort of timeout - Is that a switch I can set lower? Anway I am trying to work out ways of getting what I am after some other way.
I’m not sure if I’m the right one to ask. If there is a performance issue it’s probably discussed at phabricator. Wish you the best of luck!
If you are referring to the WMF wikis, when the search system cuts off a query for performance reasons (typically it took too long to execute) that's normal and expected behaviour that shouldn't negatively impact others (assuming you aren't a bot making many parallel queries). Overall you shouldn't need to worry about it, beyond pondering how to construct a query that doesn't timeout
If you are trying to construct a query and it keeps timing out, could you could post some details of the information you are trying to retrieve? Someone might be able to point out a more efficient way to get the same information.
Hello guys,
Lately i've been up to a task of deploying local MediaWiki. Everything went smooth until it came to indexing inside of pdf files that contain characters other that US ascii. Doing '?action=cirrusDump' and looking at 'file_text' field shows that all cyrillic characters are getting dropped while latin characters are preserved. Folks at ru.wikipedia.org somehow managed to do it but i couldn't find solution online. I would be very thankful if somebody could point out why that happens and how i could potentially solve this problem.
My configuration is:
MediaWiki - 1.36.1
PHP - 7.4.22 (apache2handler)
PostgreSQL - 13.3
ICU - 66.1
Elasticsearch - 6.5.4
PDF Handler - c9705a8
AdvancedSearch - c8a42b8
CirrusSearch - 6.5.4 (ab802b7)
Elastica - 6.1.3 (9f6e66a)
My elasticsearch configuration:
analysis-icu
extra MediaWiki plugin
ingest-attachment
Hi,
CirrusSearch does not manipulate the text it receives from Extension:PdfHandler. I would check if this extension is working properly especially that the tooling it depends on (set via $wgPdftoText
, likely to be pdftotext
) is properly extracting the text you expect.
How to exclude edits by users (let from some certain vector of users) from search view via cirrus?
It'd be useful for patrolling recent changes when you don't want see 2 most active users right now, for example.
Unsatisfying answer: Search can analyze content only, but not metadata.
- The adjoined metadata of a page would be timestamp, user, summary etc.
- The content is the visible text of the rendered page, or source text of the page itself without transclusions.
For recent changes a search within all articles or pages within the wiki is the wrong tool.
- There might be some gadgets already existing, which will filter the recent change list by a vector of trusted users, to focus on less well known people. At least a gadget programmer could easily remove those by screen grabbing. This one could do that, among many other things, but is not focussing on such list of trusted users.
- By MediaWiki software there are options on watchlist and recent changes (which are based on the identical software) to suppress bots, registered users, or minor edits (unsafe).
Hi, I am using these, I installed elastic server But not showing below.
Product | Version |
---|---|
MediaWiki | 1.35.3 |
PHP | 7.4.24 (apache2handler) |
MySQL | 8.0.26 |
ICU | 65.1 |
Lua | 5.1.5 |
also installed required extensions and elasticsearch/elasticsearch": "6.5 client.
Elastica | 6.1.3 (f3c9459) 01:29, 3 September 2021 |
CirrusSearch | 6.5.4 (95b958b) 19:07, 20 August 2021 |
wfLoadExtension( 'Elastica' );
wfLoadExtension( 'CirrusSearch' );
$wgDisableSearchUpdate = true;
$wgCirrusSearchIndexBaseName = ''; //DataBase Name
php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php
Now remove wgDisableSearchUpdate = true;
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php –-skipParse
$wgSearchType = 'CirrusSearch';
# php /data/www/html/wiki/maintenance/runJobs.php
1)when trying to search anything in search engine :
An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later.
2) http://wiki/api.php?action=cirrus-settings-dump
"code": "internal_api_error_Elastica\\Exception\\Connection\\HttpException",
"info": "[YUsbMuAOtsHM4e8mbGG6NwAAAAs] Exception caught: Couldn't connect to host, Elasticsearch down?",
"errorclass": "Elastica\\Exception\\Connection\\HttpException",
"*": "Elastica\\Exception\\Connection\\HttpException at /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php(190)\n#0 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(194): Elastica\\Transport\\Http->exec()\n#1 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(689): Elastica\\Request->send()\n#2 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php(571): Elastica\\Client->request()\n#3 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(383): Elastica\\Index->request()\n#4 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(75): Elastica\\Index\\Settings->request()\n#5 /data/www/html/wiki/extensions/CirrusSearch/includes/Api/SettingsDump.php(36): Elastica\\Index\\Settings->get()\n#6 /data/www/html/wiki/includes/api/ApiMain.php(1593): CirrusSearch\\Api\\SettingsDump->execute()\n#7 /data/www/html/wiki/includes/api/ApiMain.php(529): ApiMain->executeAction()\n#8 /data/www/html/wiki/includes/api/ApiMain.php(500): ApiMain->executeActionWithErrorHandling()\n#9 /data/www/html/wiki/api.php(90): ApiMain->execute()\n#10 /data/www/html/wiki/api.php(45): wfApiMain()\n#11 {main}"
}
pls suggest which step i am missing,