Talk:Analytics/Reportcard/2.0/Database design

MySQL...or something new?
Is this for

change column type date to datetime
If you don't need to store dates before 1970, I recommend using TIMESTAMP for all date/time columns. They are stored as integers but work with MySQL date functions

DIEDERIK: OK, will change that.

set engine to Innodb instead of MyISAM.
Otto: if we are planning on doing lots of updates, then definitely. If not, then MyISAM is probably fine. Also, if we use MyISAM then we could also use a MERGE table to aggregate a bunch of these smaller tables. As long as we are pretty sure the structure of this table will rarely change, keeping lots of smaller tables and then making them look like one big table is very useful. A rotator script could be responsible for creating a new data table (once a week? once a month?), and then altering the MERGE table UNION definition to include the new table. INSERTs will be inserted into the last table in the UNION list, and SELECTs on the MERGE table will select across all of the underlying tables and aggregate the results.

See: http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html

DIEDERIK: My understanding is that InnoDB has become the default and that performance wise they are very close.

OTTO: InnoDB has a lot of advantages if you are doing a lot of updates/deletes on the table. It doesn't do full table locks like MyISAM usually does. But! As long as you never modify the existing data (like with log tables), then MyISAM doesn't do full table locks:

From: http://dev.mysql.com/doc/refman/5.0/en/internal-locking.html
 * The MyISAM storage engine supports concurrent inserts to reduce contention between readers and writers for a given table: If a MyISAM table has no free blocks in the middle of the data file, rows are always inserted at the end of the data file. In this case, you can freely mix concurrent INSERT and SELECT statements for a MyISAM table without locks. That is, you can insert rows into a MyISAM table at the same time other clients are reading from it. Holes can result from rows having been deleted from or updated in the middle of the table. If there are holes, concurrent inserts are disabled but are enabled again automatically when all holes have been filled with new data.

Also:
 * Generally, table locks are superior to page-level or row-level locks in the following cases:
 * Most statements for the table are reads
 * Statements for the table are a mix of reads and writes, where writes are updates or deletes for a single row that can be fetched with one key read:
 * Many scans or GROUP BY operations on the entire table without any writers
 * Statements for the table are a mix of reads and writes, where writes are updates or deletes for a single row that can be fetched with one key read:
 * Many scans or GROUP BY operations on the entire table without any writers
 * Many scans or GROUP BY operations on the entire table without any writers

OTTO: Ah! I was under the assumption that this was going to be a HUGE log table with records for each data point. Pssh, this is aggregate! I take back what I just wrote! InnoDb is fine!

add view column which is of type ENUM
I like ENUMs when either the table size is going to stay small, or we can be 100% certain that we will never need to alter the ENUM column to add or remove values. Could we use a separate join table instead? So, somethinglike:

So instead of this in the data table `view` char(9) NOT NULL ENUM('media', 'mobile', 'editors', 'offline', 'targets', 'wikistats', 'pageviews'),

Have this `data_view_id` int(9) NOT NULL,

And then CREATE TABLE data_view (`data_view_id` int(9) NOT NULL, name varchar(256) ); INSERT INTO data_view(name) VALUES('media', 'mobile', 'editors', 'offline', 'targets', 'wikistats', 'pageviews');

DIEDERIK: Yes, we could that. However, I am not sure if we can put an index on a foreign key and I want to avoid JOINS. From a performance point of view, ENUM combined with InnoDB is the fastest http://www.mysqlperformanceblog.com/2008/01/24/enum-fields-vs-varchar-vs-int-joined-table-what-is-faster/. THe only real downside, IMHO, is adding new values to ENUM as that requires an ALTER TABLE statement. However, I think we know quite well which theme's we want to display in the dashboard so I am not too concerned about having to change this often.

OTTO: Yeah, but I think in this case we don't really need a join. Actually, we don't even really need the other table at all, as long as the data_view_id is well defined somewhere (in code?). An ENUM is just an integer with string aliases for the integer values. The disadvantage is you can only have as many values as you have names. If you just use an integer, then you can have maxint values. The index goes on the data_view_id in the data table, which is just an integer.

OTTO: Ack! Same! This is an aggregate data table, it won't grow so huge. ENUM is fine for this column.

`value` bigint(9) unsigned DEFAULT 0,
Are you always storing integers? maybe a BLOB type here would be more appropriate for free form data?

DIEDERIK: To keep it simple, I would say let's assume we will only store integers. That allows us to do sum, max, min and other statistical queries.

OTTO: MMmk!

rename table name to data
I am a huge fan of long unambiguous names. 'data' is probably appropriate here, because this is, um, 'data', but then again, what isn't data? How about something like 'log_data', or 'log' or 'analytics_data' or something?

DIEDERIK: Yes, you are right, I'll come up with something more descriptive.

PRIMARY KEY (`dt`,`project_code`,`language_code`,`view`, `key`)
I believe that all tables should have a single (probably auto increment) unsigned integer primary key. You can make this combo a UNIQUE KEY instead.

DIEDERIK: maybe we should store the hashed value of `dt`,`project_code`,`language_code`,`view`, `key` and index that, and have an ordinary primary key.

OTTO: Oo, I like, sortaaa. Is there a reason not to have an simple integer primary key? Even if you don't need it in the data, it is useful to have them for maintenance purposes. If things go bad with certain bits of data you can work on restoring with ranges on primary key. It is way harder to add back in later, and doesn't hurt to have.

What's the point of this key? Just to ensure that these values are unique? If so, then the hashed unique field could be a good idea anyway, even if it is not the primary key. MySQL will probably have less trouble maintaining this single column index than the composite one.