Wikimedia Developer Summit/2017/Future of Wikidata Query Service

From MediaWiki.org
Jump to navigation Jump to search

The Wikidata Query Service session is intended as a feedback and brainstorming session for how the Wikidata Query Service can be improved.

  • Dan Garry gave a brief introduction
  • Stas Malyshev gave a brief demo of Wikidata Query Service
  • Feedback commenced! Dan Garry taking notes.

Possible ideas for improving the service: Questions:

  • What kind of graphs are supported when it says "graphs"? Just line or bar charts?
    • Lots of different graphs! Network graph, i.e. vertices and edges. Also normal charts, in "Graph builder".
  • Protein interactions? Do I need to do anything special to get them to show up?
    • No, if there are properties defining the interaction, then you can graph them easily.
  • Federated queries - taking data from Wikidata to supplement it in other Wikibase installations. I have my own Wikibase, but I want data from Wikidata to fill in the gaps.
    • If your wikibase has a SPARQL endpoint (i.e. its own query service), it should be able to read from Wikidata already. We are thinking about how to integrate other SPARQL endpoints into Wikidata's. Please watch the project spaces for more information!
  • What is the property for mapping my own entities with Wikidata entities when there is overlap?
    • Don't know yet. We will have to look at that.
  • We have this interface; graphical interface for SPARQL with nice visualisation. The great strength is to expose a simplified API for third-party use. A lot of web services are based on knowledge graphs, and this one is the biggest free one in the world. It would be important to allow everyone to query it. Is it viable for it to be a general purpose instrument outside our project?
    • What does "simplify" mean?
      • Not using SPARQL for example, making it simpler.
      • We do not have a defintion for simplification yet.
      • A goal is to define simplified queries for a specific subset of operations. "Give me X that are in Y" and not much more complicated.
      • A natural language interface?
      • There is one, but it's very simple: https://askplatyp.us/
      • Try to distinguish between two things:
        • Make it easier for programmers to use
          • We don't know what "easier" means, i.e. what subset of queries are likely to be most useful for us to implement
            • We can analyse existing queries to try to figure that out
        • Make it easier for humans to write queries
          • Natural language is the easiest bit for that
    • Google has its own API for knowledge graph. It's already simpler than SPARQL. Is something like this doable?
    • Defining our own query language, and providing a mapping of that to SPARQL?
      • The original WDQ language by Magnus is a prototype for that idea.
      • You could hide complexity by having a domain-specific language for our data model.
      • It would make things simpler, but simple enough?
      • This would not help the average user, but it would help developers
    • Simplifying the API. If Wikidata could enable the domain the each item, e.g. countries can have capitals, but dogs can't have capitals. This could potentially enable easier and faster query.
      • extended discussion that note taker does not really understand :-)
  • Do we need a simplified interface, or something more visual that helps you build your own queries?
  • Do we have a simple tutorial? The examples are quite helpful, but it's hard to learn how to write your own queries from them. Some tutorial for that?
    • There are useful tutorials on Wikidata. You can find it in the help menu. It will explain how to use it.
  • If you have a bunch of pages, can you filter things out. e.g. if I have a list of pages, can I find all the humans in it?
    • Some kind of batch processing tools? Yes, that would be helpful.
    • Theoretically this is kind of possible now, but within the limits of GET URLs. But there is also a limit of query complexity, you might hit timeouts.
  • I've tripped up on the timeout a lot. A useful idea would be to have a job queue where you can submit much longer-running queries that get lower priority.
    • There is potential, but there is also potential for abuse, e.g. query that takes half a year, people might not try to optimise their own queries
  • We've been looking at using Wikidata to augment maps in the apps, but response time is the major factor for us, e.g. sub-second.
    • For some subset of queries, or pre-defined queries, we may be able to guarantee a particular response time.
    • If it's an arbitrary query, we cannot guarantee anything.
    • Caching would be helpful.
    • It goes back to the case our simpier API; we have a very, very specific query in mind that we want to run.
  • Is there data not in WDQS or Wikidata that would be useful to query?
    • Pagerank
    • Commons Datasets (commons.wikimedia.org/wiki/Data:*)
    • Metadata
      • e.g. time page was created, users who edited it, etc.
      • there may be some privacy concerns but they can are not necessarily blockers
      • there is already some metadata present, but not many things
      • there are also some performance issues, may have to query metadata separately from existing data
      • can you give an example?
        • e.g. "give me all items not changed since date X?"
  • Having CirrusSearch in the service would be awesome! e.g. do a Cirrus-type search search, then query on that data
  • Are there statistics on how often things, such as certain items, are used?
    • We do have logs and some external collaborators who are working on such analyses
    • There are interesting statistics, but how do we process that data, and it takes time
  • How about using query and query service on-wiki.
    • The canonical example was infoboxes, generating those from Wikidata.
    • Graphs and visualisation library worked on by Yuri
      • it's difficult because you need to know SPARQL and Vega language
      • it's very powerful and there are nice examples that are totally data-driven
  • Timeouts
    • This is a pain point
    • Some queries can be optimised to avoid timeouts, but not all
    • Do we have query optimisers?
      • Blazegraph does have a query optimiser, and we don't think we can do better
      • Sometimes we submit bugs to the optimiser and they work on them
    • Do we expose profiling data to help people optimise their queries?
    • It's very in-depth and is not likely to be useful to your average user
    • Can we build something else? We don't know. It's possible.
    • Optimising queries to get the under the limit is one thing. But what about increasing the limit?
      • We can try
      • Extra hardware that is coming may help
      • More hardware is always good! But it may not help in the case of long-running single queries
    • Label service is a little slow and can cause the service to time out
      • There may be optimisations that can be made to help improve that
    • Is there a document on making queries more performant?
      • There is something but it's mostly not official. Maybe we should make it official.
      • There is a help portal, maybe we can improve that portal
  • What is the relationship between ElasticSearch/CirrusSearch and WDQS?
    • There is no specific strategy
    • Generally we should just look at what use case is served better by each one
    • Cirrus works on single documents. You cannot inspect a graph using CirrusSearch, it treats the whole wiki as a single document.
  • Embed graph view or visualisation into articles
    • Not possible right now
    • Simpler way: Thinking about making some visualisations exportable as image. Glorified screenshot.
    • Complex way: export visualisation as dynamic visualisation. Harder, but it's much cooler! We embed imagemaps.
    • Should be possible using Lua?
  • Early thoughts on integrating with structured data that will work with Commons?
    • There are some thoughts on this, but it is early
    • May not be suitable for SPARQL, but we can possibly do it by Elasticsearch, documents filtered by criteria is good for it
  • What about integrating WDQS search into full text search?
    • https://askplatyp.us/ is a prototype
    • It would be hard to get SPARQL to the level where it can process a lot of queries