Topic on Talk:Quarry

How can i separate the labels without bracket, comma, etc.

6
Info-farmer (talkcontribs)

This Quarry contains the labels with brackets, comma, etc., How can i separate the labels with brackets, comma and the labels without any special characters? Why i am asking is i want to follow the wikidata rules here.

Mess (talkcontribs)

Maybe I've understood what you need. You want to separate the ""Q",ips_item_id" and "ips_site_page" in two different columns, right? Well, so write this SELECT command:

SELECT CONCAT("Q",ips_item_id), ips_site_page

and you'll split them without using tildes or other special characters.

If you want also to get rid of commas and brackets into the labels upon the columns, then write like this:

SELECT CONCAT("Q",ips_item_id) AS ips_item_id, ips_site_page

(as I've done over there: https://quarry.wmflabs.org/query/19715)

Info-farmer (talkcontribs)

Excuse me. sorry for my English language.

Let me furnish my need clearly.

I want ips_site_pages without comma and also without brackets.

Example,

the following pages should not come into my result

Q28235 ~ தூரமேற்கு வளர்ச்சி பிராந்தியம், நேபாளம்

Q41559 ~வெள்ளையன்கள் (பட்டாம்பூச்சிக்குடும்பம்)

If possible, i want the result separately as follows.

ips_site_pages without comma and also without brackets, etc

ips_site_pages with comma .

ips_site_pages with brackets.

Milimetric (WMF) (talkcontribs)

I see. So you want to detect which ips_site_pages values have:

1. no commas or parentheses (you call them brackets, but those are usually [ or {. These: ( are parentheses). 2. only commas but no parentheses 3. only parentheses but no commas

You can use the "if" function in SQL and order your results by that, that's one way. So taking your original query and adding that logic would look like this:

use wikidatawiki_p;

 select CONCAT("Q",ips_item_id,'~',ips_site_page),
        if(ips_site_page not like '%,%' and ips_site_page not like '%(%' and ips_site_page not like '%)%',
            1, if(ips_site_page like '%,%' and ips_site_page not like '%(%' and ips_site_page not like '%)%',
                  2, if(ips_site_page not like '%,%' and (ips_site_page like '%(%' or ips_site_page like '%)%'),
                      3, 4
                  )
            )
        ) as ips_item_id_type

   from wb_items_per_site i

  where i.ips_site_id='tawiki'
    and not exists
        (select term_entity_id
           from wb_terms
          where i.ips_item_id = term_entity_id
            and term_type = "label"
            and term_language='ta')

  group by ips_item_id
  order by ips_item_id_type
  limit 3000;

You can play with it here: https://quarry.wmflabs.org/query/19793. The results are sorted by the type we identified in the if statement, so you have to look at page 30 or so to find results with type 3 for example.

There are other ways to write the string comparisons, but the "like" and "not like" operators are really straightforward so I went with that.

Info-farmer (talkcontribs)

Thanks indeed. How can i get all the labels of ta.wikipedia or te.wikipedia? Is it possible to get all labels in batches? For example, 1 to 10,000 then, 10,000 to 20,000 and so on.

Milimetric (WMF) (talkcontribs)

Well, your where clause has "where i.ips_site_id='tawiki'" so that's limiting results to ta.wikipedia. Just change it to tewiki for te.wikipedia, or do "where i.ips_site_id in ('tawiki', 'tewiki')" to get both, but then be careful with your term_language = 'ta' clause, you might want to do it one at a time to keep it simple.

It would be easier for the system to just return all the results at once rather than get batches of 1-10000, 10000-20000, and so on. Is there a problem with just using the pagination in the results? You can figure out what page to go to based on the results count per page.

Reply to "How can i separate the labels without bracket, comma, etc."