Manual talk:MIME type detection

Fix for MS Office File Confusion
I have had continuing problems with my system (RHEL 5) not recognizing MS Office files correctly. More precisely, Excel and Powerpoint files get recognized as MIME type application/msword. This seems to be a common problem related to the fact that the only reliable way to tell this is to look a certain offset from the end of the file, and the magic file standard used on unix-based systems only knows how to specify an offset from the beginning of the file.

I'm sure there's a more elegant workaround to this, but I was able to solve it by adding the following code to the file includes/MimeMagic.php. It should be inserted immediately before the line

if (strpos($mime,"text/")===0 || $mime==="application/xml") {

This on or about line 372:

# UGLY HACK TO FIX BAD MIME IDENTIFICATION FOR MSOFFICE FILES 11/27/07, DWIGGINS # note that this requires the addition of ppt and xls as possible extensions for # application/msword files in the includes/mime.types file. $ext= strtolower(strrchr($file, '.')); if( $mime == 'application/msword'){ wfDebug("$fname with extension $ext has a msword MIME type.\n"); if (strtolower($ext) == '.ppt'){ $mime = 'application/vnd.ms-powerpoint'; wfDebug("$fname reset to Powerpoint MIME type.\n"); } elseif (strtolower($ext) == '.xls') { $mime = 'application/vnd.ms-excel'; wfDebug("$fname reset to Excel MIME type.\n"); }    }     # END UGLY MIME HACK

As noted above, you will need to edit the includes/mime.types file and add a line that looks like so:

application/msword doc ppt xls

If someone has a more elegant way of solving this problem, I'd love to hear about it. In the meantime, I hope this helps someone! --Dmdwiggi 02:55, 28 November 2007 (UTC)


 * The 'problem' is that file type is determined by the Unix `magic` command, which looks at the first few bytes of the file for a unique pattern.  But from what I have seen this doesn't work so well.   These are all "msword" documents, just different subtypes.    On my system (probably yours too) the file /usr/share/magic.mime contains the patterns, including

# # 0      string          \376\067\0\043                  application/msword 0      string          \320\317\021\340\241\261        application/msword 0      string          \333\245-\0\0\0                 application/msword
 * 1) msword: file(1) magic for MS Word files
 * 1) Contributor claims:
 * 2) Reversed-engineered MS Word magic numbers


 * Now you might think that the first is .doc, the second .ppt, the third .xls, but that's not the case. I looked at .ppt and .xls files I had lying around and they both match the second line, and they match much farther than these first 6 bytes.


 * If there is a distinguishable pattern later in the file header which indicates .ppt or .xls then that could be used to determine a more specific MIME type, rather than just trusting the file extension. The magic file format is more flexible than just specifying a fixed offset from the beginning of the file, but more complicated cases require more complicated rules. But a quick comparison between the files shows that the first many bytes of both kinds of files are the same.  So based just on content, they are indeed all "msword" files.  --Eric Myers 14:53, 28 November 2007 (UTC)