September 15, 2024 in Data by Den Delimarsky24 minutes
Diving into the Halo Infinite lesser-known post-match film data.
One of the conversations on my blog comments led to a discussion about film files in Halo Infinite. In case you are not familiar with them, no worries - it’s a pretty obscure component of the match data that I haven’t gone in-depth yet on my blog or here, on the OpenSpartan blog.
The idea behind film files is simple - they aren’t your traditional video but rather a combination of game engine metadata that is captured during your gameplay. When you complete a match, a “film” (a recording of all match metadata) is captured and you end up with a whole bunch of binary content that is available through a dedicated API endpoint.
Before we go down this rabbit hole, I want to give a massive shout-out to Andy Curtis for doing quite a bit of work digging through film file structure 🙌
Before we get to the film content, let’s figure out how we find them. To get started, first try to get your own matches from the Halo Infinite API. This will allow you get the match IDs that we can later use to query for film data. You can send a request to this endpoint to get the most recent matches:
In the example above, {{XUID}}
is the numeric identifier of your player ID. I talked about the process of converting a gamertag into a XUID in a separate blog post.
Note
You will need to make sure that you authenticate for the API call above to succeed (and all other API calls in this blog post). You can learn more about this in Halo Infinite Web API Authentication.
The match data you will get will be by default in JSON format, like this:
This is all useful metadata, but we are looking specifically for the match ID captured in the MatchId
property. In my case, the match I am looking for is 4fb89c93-53e1-4d7e-b273-5f4c4c1a58e4
, which is a recent Husky Raid game I’ve been a part of.
With the match ID in hand, we can now request the film chunks (every film has several “chunks” that are just binary data) by constructing the URL for another API endpoint, like this:
If the call succeeds, the metadata you will get will look like this:
The way Halo Infinite API handles films is by splitting them up into separate chunks that contain different classes of in-game metadata during different parts of the game. You will see those chunks yourself when you are in theater mode - the timeline is clearly split into them (see the black markers):
Film chunks are player-independent - they are recorded for the match itself and contain metadata about all players in them. To get the content of each chunk we will construct the URL based on the BlobStoragePathPrefix
property and the FileRelativePath
for each chunk:
Note
While this is not explicitly called out, the first GUID is the film asset ID and the second is the film asset version, similar to how game asset metadata is associated in the game CMS. If you have film IDs, you can get those directly without worrying about getting match IDs first.
With the URLs ready, we can now download every single chunk for a match and analyze them. If you are on Linux (or using Windows Subsystem for Linux) you can use this Bash script to quickly download all film chunks for a match (make sure to replace your token and clearance):
You can make the script executable with chmod +x yourscript.sh
and then run it by passing the match GUID as the first argument:
This script helpfully decompresses the chunks as well, but we’ll get to that a bit later in this post.
As you look at the metadata for each chunk you will notice that individual chunks have a type. From what I can infer, they break down like this:
Chunk type | Description |
---|---|
1 | Game bootstrap metadata |
2 | In-game event captures |
3 | Game summary metadata |
We’ll be using every single one of them in our explorations.
Looking at existing chunks, we see that the ones that have the type of 1
or 2
have very sparse event data, at least on the surface. However, they contain valuable information that we will need. To explore the content, let’s download a random chunk for an existing match:
Opening it in a hex editor produces this result:
Not exactly “human-readable”, and that’s because we’re missing a core step here - decompression. The clue for that are the first two bytes of the chunk file 78 5E
, which is an indicator of zlib
Fast Compression. You can read more about it in the official RFC. Looks like we’re dealing with compressed data, and therefore need to make sure that we “extract” it before attempting to read the data.
Let’s do this a bit differently then - we’re going to download the binary file with cURL and then decompress it with Python. Assuming that you are not already using the script I shared earlier to download every chunk, our first step is this:
And then, we can run a bit of inline Python magic to decompress the content we just downloaded into its own file - decompressed_output.bin
:
This looks a bit more promising because we actually see repeating patterns. It’s even more promising if we look up events inside the chunk by the XUID for a given player that existed in a match. Because I am using a hex editor, I can easily look up the UInt64
value (all XUIDs are unsigned 64-bit integers), leading me to this:
Because Halo Infinite is generally known to use quite a bit of Bond-encoded data, I wanted to pass the content of the file through my tool - bond-reader
. Doing that was fruitless, though, as it turned out that the data is not Bond-formatted (at least not that I could tell from some short-term digging). I guess we’ll have to stick with proper inference of binary data based on vanilla binary pattern analysis.
Another wrench thrown into our plans was also detected by Andy Curtis the fact that data is not necessarily byte-aligned in the film chunks. That is - if you use a hex editor to spot all existing patterns you might find some but there is quite a bit of data “hiding” in plain sight because it just isn’t properly positioned for a hex editor to render it.
Because we can’t count on just our hex editor to find the data, we can write some custom code to find the things we want that are not aligned with our expectations 😎
To do that, here is a complete C# application that does just that - if you give it a byte pattern to search for (disregard the actual example pattern - it’s just a demo), it will try to find it regardless of how the data is actually aligned in the file:
Running this code will enable us to quickly detect the positions of data sequences that contain relevant information. For example, one of the observations about the film file is that we can spot XUID references by looking at the 0x2D 0xC0
pattern. If we use this pattern and run the tool across a set of film chunks we’ll see quite a few results:
Before we go any further, though, let me explain a bit the “magic” of bit shifting that you might’ve noticed in the program above. Let’s say we have a data array like this:
Byte Index | Hex Value | Binary |
---|---|---|
0 | 0xAB | 10101011 |
1 | 0xCD | 11001101 |
2 | 0xEF | 11101111 |
3 | 0x12 | 00010010 |
The pattern we want to look for is this:
Byte Index | Hex Value | Binary |
---|---|---|
0 | 0xCD | 11001101 |
1 | 0xEF | 11101111 |
Let’s pick a random bit offset - 10
. that means that we’re starting at the 10th bit in the data array. If we look at the IsBitMatch
function, it takes the bit offset as an argument.
That means that if we pass 10
as the value, we get a byteOffset
of 1
, meaning that we skip an entire byte (just one) when looking for the data.
Now, keep in mind that when calculating byteOffset
it was not a “clean” division - we have a remainder, that is helpfully captured by bitShift
, and that remainder is equal to 2
, which means that with the byte at index 1
(remember, we skipped the one at 0
), we start with the third bit (skip first two, as bitShift
tells us).
That can be visualized in a table like this:
Byte Index | Hex Value | Binary | Comment |
---|---|---|---|
0 | 0xAB | 10101011 | We’re skipping this entirely. |
1 | 0xCD | 11001101 | We start comparing from the third bit. |
2 | 0xEF | 11101111 | We’ll use the data from this bit to make sure we can build a full byte. |
3 | 0x12 | 00010010 | Used in comparison later. |
Now, I mentioned that we start our parsing with the byte at index 1
at the third bit. Look at the binary representation for that byte:
We skip the first two bits, and shift the bits left, padding the “missing” bits with zeroes at the end:
Now, instead of using the zeroes, we can steal the two leading bits from the next byte in our sequence (at index 2
- that is, 0xEF
). We shift it right by six bits to the right to get the top 2 bits (because that’s all we need to complete it), so that:
Becomes:
So now from the shifted bytes we have these two values:
Combining them gives us:
This binary value does not match the first value of our pattern (11001101
), so the search will move on from the next offset, and so on.
So now that we have an idea on how to look for data we can start looking at individual “envelopes” that contain player details. As I mentioned above, there are many chunks that are usually provided for a given film; however, the ones that capture specific events, like deaths, kills, or medal awards, are all aggregated in the last film chunk file, with the ChunkType
of 3
.
Within the very last chunk (of type 3
) the events are usually structured like this:
Header | Gamertag (Unicode) | Padding | Type | Timestamp | Padding | Medal Marker | Padding | Metadata (Medal Type) |
---|---|---|---|---|---|---|---|---|
12 bytes | 32 bytes | 15 bytes | 1 byte | 4 bytes | 3 bytes | 1 byte | 3 bytes | 1 byte |
Note
Be careful with assuming that a gamertag is unique within a match. There were cases where the same match had a gamertag like MyGamertag
and another MsMyGamertag
- you can’t search just for MyGamertag
as that will produce some unexpected results. You need to check that there are 12 preceding bytes of “header” (arbitrary given that I don’t know what they represent, but consistent for individual gamertags) exist and then the headers before that are 0x00
(I limit to 3 zero bytes). That way you can ensure that you are extracting a properly offset event.
Note
Some matches may not have a chunk of type 3
- that’s very likely a bug in the API. Without this chunk there is no timeline you can parse as easily. Additionally, it’s entirely possible that the chunk of type 3
doesn’t contain gamertag-associated data. Additional investigation is needed to understand that behavior.
If you are using a tool like 010 Editor and extract the binary data on a per-file basis (i.e., find the bit positions for the gamertag start and then extract the bytes into its own file from there), you can use the following extremely basic binary template to highlight the sequences for easier parsing:
The structure above is consistent across matches - I’ve extracted thousands of my own games and ran into minimal issues (with the exception of a few stray gamertags).
Out of all the fields above, the most interesting to me is the metadata one. The metadata field (i.e., the medal type) is capturing numeric values that represent medals. The values are different from the medal mapping. There is no clear mapping between those and a human-readable JSON representation, so we need to infer them by looking at medal volume here and correlate with medals earned per match or through a player’s career. Andy Curtis did the heavy lifting on this for some medals in his SPNKr project (a few are pending additional research).
The following medals are currently known:
Medal ID | Medal |
---|---|
0 | Double Kill |
1 | Triple Kill |
2 | Overkill |
3 | Killtacular |
4 | Killtrocity |
5 | Killamanjaro |
6 | Killtastrophe |
7 | Killpocalypse |
8 | Killionaire |
9 | Killing Spree |
10 | Killing Frenzy |
11 | Running Riot |
12 | Rampage |
13 | Perfection |
26 | Killjoy |
27 | Nightmare |
28 | Boogeyman |
29 | Grim Reaper |
30 | Demon |
31 | Flawless Victory |
32 | Steaktacular |
36 | Stopped Short |
37 | Flag Joust |
38 | Goal Line Stand |
39 | Necromancer |
43 | Ace |
44 | Extermination |
45 | Sole Survivor |
46 | Untainted |
47 | Blight |
48 | Disease |
49 | Plague |
51 | Pestilence |
53 | Culling |
54 | Cleansing |
55 | Purge |
56 | Purification |
57 | Divine Intervention |
58 | Zombie Slayer |
59 | Undead Hunter |
60 | Hell’s Janitor |
61 | The Sickness |
62 | Spotter |
63 | Treasure Hunter |
64 | Saboteur |
65 | Wingman |
66 | Wheelman |
67 | Gunner |
68 | Driver |
69 | Pilot |
70 | Tanker |
71 | Rifleman |
72 | Bomber |
73 | Grenadier |
74 | Boxer |
75 | Warrior |
76 | Gunslinger |
77 | Scattergunner |
78 | Sharpshooter |
79 | Marksman |
80 | Heavy |
81 | Bodyguard |
82 | Back Smack |
83 | Nuclear Football |
84 | Boom Block |
85 | Bulltrue |
86 | Cluster Luck |
87 | Dogfight |
88 | Harpoon |
89 | Mind the Gap |
90 | Ninja |
91 | Odin’s Raven |
92 | Pancake |
93 | Quigley |
94 | Remote Detonation |
95 | Return to Sender |
96 | Rideshare |
97 | Skyjack |
98 | Stick |
99 | Tag & Bag |
100 | Whiplash |
101 | Kong |
102 | Autopilot Engaged |
103 | Sneak King |
104 | Windshield Wiper |
105 | Reversal |
106 | Hail Mary |
107 | Nade Shot |
108 | Snipe |
109 | Perfect |
110 | Bank Shot |
111 | Fire & Forget |
112 | Ballista |
113 | Pull |
114 | No Scope |
115 | Achilles Spine |
116 | Grand Slam |
117 | Guardian Angel |
118 | Interlinked |
119 | Death Race |
120 | Chain Reaction |
121 | 360 |
122 | Combat Evolved |
123 | Deadly Catch |
124 | Driveby |
125 | Fastball |
126 | Flyin’ High |
127 | From the Grave |
128 | From the Void |
129 | Grapple-jack |
130 | Hold This |
131 | Last Shot |
132 | Lawnmower |
133 | Mount Up |
134 | Off the Rack |
135 | Quick Draw |
137 | Pineapple Express |
138 | Ramming Speed |
139 | Reclaimer |
140 | Shot Caller |
141 | Yard Sale |
142 | Special Delivery |
146 | Fumble |
148 | Straight Balling |
151 | Always Rotating |
152 | Hill Guardian |
153 | Clock Stop |
154 | Secure Line |
156 | Splatter |
162 | All That Juice |
163 | Great Journey |
165 | Breacher |
166 | Mounted & Loaded |
167 | Monopoly |
168 | Counter-snipe |
174 | Driving Spree |
175 | Death Cabbie |
176 | Immortal Chauffeur |
177 | Blind Fire |
178 | Hang Up |
179 | Call Blocked |
180 | Clear Reception |
The event type, also captured in the envelope, can be one of the following:
Type (Decimal) | Description |
---|---|
10 | Mode-specific events (e.g., captured the flag, killed the carrier, stole the flag) |
20 | Death |
50 | Kill |
Note
Any other type identifier (such as 51
, 100
, or 250
) that you may see here, when associated with a medal, is representative of the medal sorting weight. It maps 1:1 to the information that you can get from the medal metadata endpoint.
Timestamp data is represented in milliseconds from the start of the match. You can obtain a readable value with a C# snippet like this:
One thing that I haven’t yet figured out is how assists are tracked within the event batch. It’s likely captured as a XUID reference further in the event envelope that I didn’t get to. This will be a topic for another blog post in the future as we dig more through the film file format.
Notice that to extract all events from the last chunk one specific thing is still needed - we need to start with knowing the gamertags for which the events should be extracted. And because gamertags are technically arbitrary text, we need to find an index somewhere. To do that, we can look inside all other chunks (other than ones of type 3
). That’s right, for us to get the list of gamertags that were involved in a given game we need to download and parse all existing film chunks other than the very last one that has ChunkType
set to 3
.
The last chunk contains information on all players in the game but doesn’t seem to contain a very clear XUID and Gamertag combination that will allow us to extract them cleanly. Luckily, inside all other chunks (where ChunkType
is either 1
or 2
), the gamertags and XUIDs can be found by looking at the pattern: 0x2D 0xC0
. From that pattern, we can deduce the following structure:
Gamertag (Unicode) | Padding | XUID | Marker 1 | Marker 2 |
---|---|---|---|---|
Dynamic length (32 bytes max) | 21 bytes | 8 bytes | 0x2D | 0xC0 |
Note
Keep in mind that gamertags are stored as Unicode (UTF-16) text. This means that the padding can be deceiving if you are looking at the binary file - you might think that there are 22 0x00
bytes before the gamertag value, when in fact the last zero byte is just the trailing byte for the gamertag text. Make sure to be careful when parsing the values.
We can scan all film chunks for this pattern by identifying the markers, getting the XUID, checking that the preceding 21 bytes are 0x00
(padding zero bytes), and then grab 32 bytes of the gamertag data that can be parsed as a Unicode string. There are more safeguards we can put in place for this logic, but ultimately it’s good enough to extract the basic data.
Once the data is extracted into, say, a dictionary, we can use that as a starting point to look up gamertags in the final (summary) chunk.
Note
As I mentioned earlier, depending on the matches that you are getting, some of them might not have a chunk with ChunkType
equal to 3
. Others can return HTTP 404
(blob does not exist) errors when attempting to download a chunk. The former may be a bug. The latter is likely caused by the folks at 343 occasionally cleaning up the storage from older matches.
In C#, the extraction logic can be formalized as such:
Recall that the data may or may not be byte-aligned so we need to operate on individual bits. In turn, once we find the marker pattern in film segment chunks (as we try to spot the gamertag and XUID combos), we can extract it with a function like this (where pattern
is set to 0x2D 0xC0
):
To simplify how I extract the data, I built a tool called OpenSpartan/film-event-extractor
which will let you log in with your Xbox Live ID and aggregate all match data within a local SQLite database. The entire parsing logic is very much in flux (feel free to follow the discussion on this), but once it stabilizes I can see integrating this better in OpenSpartan Workshop.
For my own account, having played more than seven thousand matches, the entire aggregation took around 48 hours. I haven’t yet optimized (and parallelized) the code, so this can be attributed to also me building a slower-than-needed tool, but it works for now and I can start analyzing the data.
The data that is available through the API is mostly good as-is, but an expanded dataset that accounts for film-based details enables me to see two things more clearly:
There are are a few improvements that I want to make to both the open-source tool that I built as well as to my understanding of the film files. I alluded to assists earlier - that’s a data point that I definitely want to cover. Additionally, film files may contain the data required for us to build heatmaps of map movement. For that, we need to better try and replicate behaviors in the game - that is, understand how binary data changes with movement, weapon switches, use of grenades, and so on. Something tells me it will be a much more protracted project than I initially anticipated 🤔