TL;DR

The file command line utility is really useful if you're not sure what format a file is. For example:

$ file IMGP0175.JPG
IMGP0175.JPG: MPEG sequence, v2, program multiplex

A Mystery Photo Format

I recently received an email from my grandpa. He found an old DVD with pictures from my brother's graduation, but he could not open some of the files. I asked if he would send a few of them my way so I could take a crack at figuring out what was wrong with them.

He sent me a file called IMGP0175.JPG. I searched around and found that ImageMagick (sudo apt-get install imagemagick) had a tool called identify. From the man page (man identify):

identify - describes the format and characteristics of one or more image files.

This seemed promising, so I ran it on my file:

$ identify IMGP0175.JPG
identify: Not a JPEG file: starts with 0x00 0x00 `IMGP0175.JPG' @ error/jpeg.c/JPEGErrorHandler/322.

This told me what I already knew, it wasn't a good JPEG. I was hoping it would try to guess at what other format it was (spoiler: there's a tool that will). I continued searching and found a forum post by someone in a similar predicament. That person's problem turned out to be that the file had some preceding garbage bytes. A helpful poster indicated that the leading bytes that ImageMagick uses to determine file format are available in a config file. It was in /etc/ImageMagick-6/magic.xml on my machine. The poster indicated that he found that the JPEG leading bytes showed up after 21 bytes of garbage, and once they were removed, the image would open.

Here is the relevant portion of the magic.xml file:

<magicmap>
  <!-- <magic name="GIF" offset="0" target="GIF8"/> -->
  <!-- <magic name="JPEG" offset="0" target="\377\330\377"/> -->
  <!-- <magic name="PNG" offset="0" target="\211PNG\r\n\032\n"/> -->
  <!-- <magic name="TIFF" offset="0" target="\115\115\000\052"/> -->
</magicmap>

To investigate if any of these magic bytes occurred in my file, I used the tool hexdump to translate the contents of my file into hexidecimal. The "\377\330\377" translates to FF D8 FF in hexidecimal.

$ hexdump -C IMGP0175.JPG | head -n 3
00000000  00 00 01 ba 45 e2 1e f4  f4 01 01 89 c3 f8 00 00  |....E...........|
00000010  01 e0 07 ec 80 00 00 bd  e9 25 76 03 b8 83 6c 68  |.........%v...lh|
00000020  6e 50 78 e5 05 34 a1 88  58 0f f8 7f d5 c4 30 0d  |nPx..4..X.....0.|

(Note: I used the -C option to get one-byte display. Without this option, hexdump will return 2-bytes at a time and display them as little-endian. Practically, this means the beginning of the file below would be displayed as 0000 ba01. Thanks to [this stackexchange post][4] for clarifying that.)

Unfortunately, the FF D8 FF hex pattern was no where to be found in my file.

I then considered that the file might be compressed, so I wondered if there was a way to detect that. I searched around, and sure enough, the file command line utility does just that. From file man page:

file — determine file type

The result for my file:

$ file IMGP0175.JPG
IMGP0175.JPG: MPEG sequence, v2, program multiplex

It turns out it was a video file this whole time. I renamed it to have a ".mpg" extension, and successfully opened it in VLC. Mystery solved!

Other Formats

I was curious what file would tell me about the .deb files I had been digging into in a [previous post][2]

$ file atom-amd64.deb
atom-amd64.deb: Debian binary package (format 2.0)

I was expecting it to tell me it was an ar file, but it went even further. The output on an ar archive is:

$ file test.ar
test.ar: current ar archive

What is this strange magic?

I was curious how the file utility does its magic. The man page describes a 3 different tests it does. It first examines the output of stat, it then looks for some magic bytes in the beginning of the file, and finally checks if it matches a known text character encoding.

The man page explained that stat would indicate a symbolic link, but also a "special" file. I was curious what this would be like, so I started looking for a file that would be considered "special". I discovered that the file descriptors for running processes exist in the /proc/<pid>/fd directory. (You can get the process id, or pid, by running ps aux | grep <process name>). I started running stat on the file descriptors of random processes on my machine, but I kept finding symbolic links to broken pipes. I then decided to run it on the descriptors of my currently running shell process, and voila:

$ stat /proc/12156/fd/0
  File: ‘/proc/12156/fd/0’ -> ‘/dev/pts/0’
  Size: 64              Blocks: 0          IO Block: 1024   symbolic link
...

$ stat /dev/pts/0      
  File: ‘/dev/pts/0’
  Size: 0               Blocks: 0          IO Block: 1024   character special file
...

$ file /dev/pts/0
/dev/pts/0: character special (136/0)

This showed how the stat test would be helpful. I was next interested in the "magic" test since that is what would have triggered for my MPEG file since it is not a special file and it is not a text file. The man page listed the locations to look for the magic file as /etc/magic and /usr/share/misc/magic/magic.mgc. The first location was a file with just a comment saying I could put local magic data in there in a format described in magic. The man page for magic describes the format to use.

As for the magic.mgc file:

$ file /usr/share/misc/magic.mgc
/usr/share/misc/magic.mgc: symbolic link to ../file/magic.mgc

$ file /usr/share/file/magic.mgc
/usr/share/file/magic.mgc: magic binary file for file(1) cmd (version 12) (little endian)

It's a binary file, so it is not clear what specific pattern is being applied in my situation. Luckily the file source code is available on github. Here are [the lines][3] that correspond to the "MPEG v2 sequence":

# MPEG sequences
# Scans for all common MPEG header start codes
...
0        belong&0xFFFFFF00  0x00000100
>3       byte               0xBA           MPEG sequence
!:mime  video/mpeg
>>4      byte               &0x40          \b, v2, program multiplex
>>4      byte               ^0x40          \b, v1, system multiplex

The format of the file is in 3 columns, a byte offset, a format, and a test. Additionally, the > characters represent a hierarchy as explained in the magic man pages:

The number of > on the line indicates the level of the test; a line with no > at the beginning is considered to be at level 0. Tests are arranged in a tree-like hierarchy: if the test on a line at level n succeeds, all following tests at level n+1 are performed, and the messages printed if the tests succeed, until a line with level n (or less) appears.

Looking back at the MPEG lines, there are 3 tests that had to pass in order to observe the output that I saw.

First Test

0        belong&0xFFFFFF00  0x00000100

The first line says start at byte 0, take a big-endian long (4 bytes), take the bit-wise AND of those 4 bytes with 0xFFFFFF00, and check that it is equal to 0x00000100. Taking a big-endian long basically means don't flip them around. If we took a little-endian long, the 4th byte would be considered the first. To illustrate this, I wrote a lelong line to the /etc/magic file and then created some test files:

$ cat /etc/magic
0       lelong  0x00000001        Paul Little-Endian Long

# Create test files
$ echo "01000000" | xxd -r -p >test0.dat
$ echo "00010000" | xxd -r -p >test1.dat
$ echo "00000100" | xxd -r -p >test2.dat
$ echo "00000001" | xxd -r -p >test3.dat

# Run file on the test files
$ file test*.dat                        
test0.dat: Paul Little-Endian Long
test1.dat: raw G3 data, byte-padded
test2.dat: data
test3.dat: data

I used the xxd tool with the -r option which turns an ascii hex dump into bytes. The results show that a file with 0x01 as the first byte matched the test for a little-endian long matching 0x00000001

Back to the MPEG tests, the first 4 bytes of my file are 0x000001ba. This AND'd with 0xFFFFFF00 is 0x00000100, so we pass that check.

Second Test

>3       byte               0xBA           MPEG sequence

This line indicates that we should look at the byte at index 3 (the 4th byte) and check that it is equal to 0xBA. This passes.

Third Test
>>4 byte &0x40 \b, v2, program multiplex indicates that we should look at the byte at index 4 (the 5th byte) and ensure that it has its 2nd bit flipped. The 5th byte of my file was 0x45, so this check passed as well. This means my file is an MPEG Sequence v2.

We can make a dummy file that passed all these checks by echoing 5 bytes, like this:

$ echo "000001ba40" | xxd -r -p >dummy.dat && file dummy.dat
dummy.dat: MPEG sequence, v2, program multiplex

And with that, we now know how the magic in the file utility works. It's not magic at all, but the product of a lot of hard work by the maintainers to find the "magic" numbers for thousands of different file types.

[2]:{% post_url 2016-08-28-whats-in-a-deb %}
[3]:https://github.com/file/file/blob/f27ea71acfe3bf9609c923d2980227484b7f1b20/magic/Magdir/animation#L195-L199
[4]:http://unix.stackexchange.com/questions/55770/does-hexdump-respect-the-endianness-of-its-system