Thursday, February 14, 2008

File decomposition (part 1)

Almost every game share several proprietary file and/or archive formats. When bound by an underlying file system such as FAT, NTFS, HFS Plus or UFS (Mac OS X) you can only do so much, yet some of these come with features well beyond what which you are willing to implement.

I spend a decent amount of time reverse engineering the internals of things, and when it comes to file formats I've looked into a couple of formats that I have found to be more interesting. Among these are MPQ (Mo'PaQ/Blizzard Entertainment), GCF (Half-Life 2/Valve) as well as Scimitar (Assassin's Creed/Ubisoft).

In summary there are a couple things that you would consider when creating a archive/packing format and these are:

Performance, compression and integrity

This is my take on it, and why these things matter.

Performance

On Win32 platforms the WINAPI CreateFile function is not only used to create files but to open them as well. The performance implication of retrieving a single file handle through this function is negligible, but consider a scenario were there is thousands of small files that need to placed in memory. This is an entirely different case were it is much faster to have an in memory representation of the archive structure and do unbuffered sequential block read operations with this single file handle.

The MPQ archive is efficient as it keeps the entire archive structure in memory while maintaining a very small footprint ~64K. If I wanted to open a file in this archive, I would however, need to know the file name, because what I cannot do in a MPQ archive is browse.
It's simple, the MPQ archive uses a hash table to represent paths efficiently. e.g. 'ThisFolder/ThisFile.ext' transform into a 32-bit hash 0x7c14a6c9, which in modulus 32768 turns up 9929. This is a general case, but it means that the path is a file which is described by a block which is found at that index (it's all very efficient in practice).
This is also interesting as in Blizzard games the file is always queried in patch_rt.mpq before any other archive. This gives a way to override any file but it's a solution in which the initial construction failed to address replacement of legacy files within an archive. This became a challenge with the launch of World of Warcraft with massive monthly content updates.

MPQ archives did not initially support archives larger than ~4 GB. Something which became a problem. However, the initial version had almost no room for improvements. Extensions added after the initial release never looked as good as the initial release. A word of advice, if you think you might need it, make room for it. I'm talking about writing for any need you might want later. Mostly reserving binary space in headers. Backwards compatibility is nice, but not really necessary, the reusability is wroth a lot more.

I like the MPQ hash table approach, it's lean and efficient. But I realized while writing this, its a bit hefty to go through it all at once. I will be covering compression and integrity in the future.

I've been looking into the application of ternary search trees (TST) as well as partial match algorithms. To see if there isn't a even better way of creating even more efficient archiving methods.