Manual talk:MIME type detection

Fix for MS Office File Confusion
I have had continuing problems with my system (RHEL 5) not recognizing MS Office files correctly. More precisely, Excel and Powerpoint files get recognized as MIME type application/msword. This seems to be a common problem related to the fact that the only reliable way to tell this is to look a certain offset from the end of the file, and the magic file standard used on unix-based systems only knows how to specify an offset from the beginning of the file.

I'm sure there's a more elegant workaround to this, but I was able to solve it by adding the following code to the file includes/MimeMagic.php. It should be inserted immediately before the line

if (strpos($mime,"text/")===0 || $mime==="application/xml") {

This on or about line 390:

# UGLY HACK TO FIX BAD MIME IDENTIFICATION FOR MSOFFICE FILES 11/27/07, DWIGGINS # Update 3/5/2008 - AEGA # note that this requires the addition of ppt and xls as possible extensions for # application/msword files in the includes/mime.types file. if ( $ext === true ) { $i = strrpos( $file, '.' ); $ext = strtolower( $i ? substr( $file, $i + 1 ) : '' ); }    if( $mime === 'application/msword') { wfDebug( "$fname with extension $ext has a msword MIME type.\n" ); if ( $ext === 'ppt' ) { $mime = 'application/vnd.ms-powerpoint'; wfDebug( "$fname reset to Powerpoint MIME type.\n" ); } elseif ( $ext === 'xls' ) { $mime = 'application/vnd.ms-excel'; wfDebug( "$fname reset to Excel MIME type.\n" ); }    }     # END UGLY MIME HACK

As noted above, you will need to edit the includes/mime.types file and add a line that looks like so:

application/msword doc ppt xls

If someone has a more elegant way of solving this problem, I'd love to hear about it. In the meantime, I hope this helps someone! --Dmdwiggi 02:55, 28 November 2007 (UTC)

Small changes to make it work properly v1.11 ($ext is already defined as already having the extension (string) or true - need to extract it) --Aega 20:12, 5 March 2008 (UTC)

magic file

 * The 'problem' is that file type is determined by the Unix `magic` command, which looks at the first few bytes of the file for a unique pattern.  But from what I have seen this doesn't work so well.   These are all "msword" documents, just different subtypes.    On my system (probably yours too) the file /usr/share/magic.mime contains the patterns, including

# # 0      string          \376\067\0\043                  application/msword 0      string          \320\317\021\340\241\261        application/msword 0      string          \333\245-\0\0\0                 application/msword
 * 1) msword: file(1) magic for MS Word files
 * 1) Contributor claims:
 * 2) Reversed-engineered MS Word magic numbers


 * Now you might think that the first is .doc, the second .ppt, the third .xls, but that's not the case. I looked at .ppt and .xls files I had lying around and they both match the second line, and they match each other much farther along than just these first 6 bytes.


 * If there is a distinguishable pattern later in the file header which indicates .ppt or .xls then that could be used to determine a more specific MIME type, rather than just trusting the file extension. The magic file format is more flexible than just specifying a fixed offset from the beginning of the file, but more complicated cases require more complicated rules. But since a quick comparison between the files shows that the first many bytes of both kinds of files are the same, then just on content, they are indeed all "msword" files, and that's the best you can do based on content.  In that case trusting the file extensions may be all you can hope for.  --Eric Myers 14:53, 28 November 2007 (UTC)