User:Ebernhardson/mysql deprecated

Abstract
MediaWiki's database abstraction is starting to show its age. It depends on the ext/mysql module which was deprecated with the release of PHP 5.5, released June 20, 2013. The php release process dictates ext/mysql will be removed in the following yearly release. This release will receive bug fixes for 2 years and security releases for one more year, giving us an EOL of mid 2017. Four years means we do not have to rush, but we should look to the future and consider our options.

Deprecation of ext/mysql also does not force us to replace our current abstraction. We can develop a new implementation of the abstraction, perhaps MysqlndDatabase.php. So why might we want to replace the abstraction? The Database.php abstraction was built and maintained with the goal of achieving the best performance with ext/mysql, occasionally to the detriment of other DBMS. The next generation of mysql drivers offer a different feature set which influence the ways in which you can most effectively use your database abstractions. Moving away from ext/mysql is not only about security but also about having access to all features of the MySQL database.

Primary differences between ext/mysql and mysqlnd
The obvious one is bound parameters. This is a method where you replace all variables in a string query with either ? or :varname. The query with placeholders and the variables to be used in those placeholders are passed to the database client which prepares the query, either client side or server side, in a manner safe from sql injection. Unlike our current abstraction there is no impedance mismatch requiring manual quoting for certain cases. MediaWiki core contains over 400 instances of manually quoted sql variables outside of includes/db.

Asynchronous, non-blocking queries. All instances of sharded database models, including the current ExternalStore implementation, can benefit from non-blocking queries.

150+ performance statistics. And who doesn't love having more statistics? Adding in a whole host of new mysql statistics to our current monitoring can lead to more efficient use of our resources by shining a new light on problems we previously did not have visibility.

The PHP Community
Within the PHP community at large development processes have changed. Projects like the Framework Interoperability Group[1], tools like composer[2][3] and websites like packagist.org are encouraging modern php developers to share code between projects and develop common solutions to common problems.

MediaWiki, and the php community at large, can directly benefit from us using and contributing back to the php ecosphere rather than living on our own little island. If we decide to disregard the existing projects within the php ecosphere we should strive to develop an independent abstraction layer that will be used by many php projects and not just mediawiki. Locking all that knowledge up into a single project, no matter how large, is a waste of technical talent.

[1] http://www.php-fig.org

[2] http://www.getcomposer.org

[3] http://www.gossamer-threads.com/lists/wiki/wikitech/332270

From the deprecation RFC
Moving away from ext/mysql is not only about security but also about having access to all features of the MySQL database.

ext/mysql was built for MySQL 3.23 and only got very few additions since then while mostly keeping compatibility with this old version which makes the code a bit harder to maintain. From top of my head missing features not support be ext/mysql include:
 * Stored Procedures (can't handle multiple result sets)
 * Prepared Statements
 * Encryption (SSL)
 * Compression
 * Full Charset support

So moving away from ext/mysql is a good thing.

What does the current abstraction do for us
To know if any existing solutions can meet our use case we must first look at our current abstraction and see what it does for us, and what it has failed to do for us.

Random quotes about the database layer from wikitech-l

 * You should think of Database.php as a query-building module rather than a DB abstraction layer. -- Tim Starling


 * As far as I'm concerned, DB abstraction is a courtesy to low-traffic non-Wikimedia users. The queries we use, and the schema, should be optimised for MySQL. -- Tim Starling


 * While more abstraction never hurts for testing, the abstraction that we currently have is sufficient. -- Christian Aistleitner


 * Well, having pgsql and all the other BOOL-supporting DBMSes forced to use integer types just because devs are too lazy to make sure things work properly on all databases is exactly the sort of thing the abstraction layer is meant to prevent. Each database should be able to properly use its own features. -- Aryeh Gregor


 * In my opinion that would be a bit horrifyingly scary. ;) -- Brion Vibber ( in response to the question: Would it make sense for the database abstraction layer to know our schema, so that it can automatically cast things to bool before handing them down to the actual database )


 * Incremental improvements to our current way of doing things (cutting back on raw queries, moving MySQL-specific stuff from Database to DatabaseMySql, defining more clearly what Database methods mean and avoiding undefined behavior) seem entirely sufficient to allow support for any number of additional database backends. -- Aryeh Gregor


 * Maintenance of non-MySQL DBMS code has been a problem in the past -- Tim Starling

Random other quotes from wikitech-l

 * could avoid a lot of work for the entire development team writing, testing and debugging functionality readily available in an Open Source library. We all benefit from a better product that comes from sharing our ideas. -- Evan Prodromou

Benefits

 * Small. A review of the mailing list archives indicates the lightweight nature of the current abstraction is a favorable aspect to some developers.


 * Treats all databases other than MariaDB as second hand citizens in the name of performance.


 * Safer quoting, avoidance of SQL injection attacks. The current abstraction covers perhaps 90% of use cases where a user provided variable needs to be used in a query.


 * Table prefix support


 * Filtering or emulation of unsupported options to queries such as DELAYED


 * Emulation of MySQL-specific queries such as REPLACE


 * Table name quoting


 * Simple access to `foo_col` IN (x1,x2...xN)


 * Real prepared statements involve two round trips to the database, We have short living connections, and no statements are repeated, so the actual prepared query will not be reused.


 * DBO_TRX - Some sort of automated transaction handling


 * Cross-wiki connections - Some sort of something ;-)
 * Integrated profiling (sort of)


 * Load Balancing implementation which handles many possible edge cases


 * ORMTable impelmentation which is NOT an ORM


 * Supports multiple DBMS: MySQL, SQLite, Postgres and Oracle

Failings

 * History shows us that the requirement to manually escape queries (even if only some queries) can lead to mistakes causing security issues. The php community at large has mostly settled on the solution of bound parameters.  It is quite easy to see when reviewing a query using bound parameters if it will properly protect from malicious user input.  MediaWiki core contains 372 references to DatabaseBase::addQuotes, and extensions adds 425 more for just shy of 800 total.


 * https://www.mediawiki.org/wiki/Bug_management/Triage/Databases_20111102


 * Uses mysql_*, which is not a failure except that its going away


 * Treats all databases other than MariaDB as second hand citizens in the name of performance.

Size
phploc 1.7.4 by Sebastian Bergmann.

Lines of Code (LOC):                             15367 Cyclomatic Complexity / Lines of Code:          0.14 Comment Lines of Code (CLOC):                     6638 Non-Comment Lines of Code (NCLOC):                8729

Namespaces:                                          0 Interfaces:                                          7 Traits:                                              0 Classes:                                            41 Abstract:                                          2 (4.88%) Concrete:                                         39 (95.12%) Average Class Length (NCLOC):                    238 Methods:                                           886 Scope: Non-Static:                                    869 (98.08%) Static:                                         17 (1.92%) Visibility: Public:                                        794 (89.62%) Non-Public:                                     92 (10.38%) Average Method Length (NCLOC):                    11 Cyclomatic Complexity / Number of Methods:      2.41 Anonymous Functions:                                 0 Functions:                                           0

Constants:                                           3 Global constants:                                  0 Class constants:                                   3

Options existing within the PHP Community
Over the last three years the php community has moved to a development methodology that involves sharing independant pieces of code between multiple projects rather than going full force in NIH mode. Here we should at least review what options the rest of the php community is using: