Calculating the Duration of MP3 Files

Armed with the ID3 decoder from my last post, we can extract most of the metadata from MP3 files. However, the one piece I still want for my music cataloging software is the track duration, which, for the vast majority of my files, is not included in the ID3 tag. Getting the duration of an audio file isn’t as straightforward as I had hoped. One of the easiest solutions would be to just shell out to another piece of software, such as ffmpeg, which can handle a great many audio formats. But that would be boring, and I wanted to minimize the number of dependencies, which meant writing a rudimentary MP3 decoder myself. Luckily, I don’t need to actually playback audio myself, so I can avoid a great deal of complexity. I only need to parse enough of the MP3 file to figure out how long it is.

First, a overview of the anatomy of an MP3 file. As we went through in the last post, an MP3 file may optionally start with an ID3 tag containing metadata about the track. After that comes the meat of the file: a sequence of MP3 frames. Each frame contains a specific amount of encoded audio data (the actual amount is governed by a number of properties encoded in the frame header). The total duration of the file is simply the sum of the durations of the individual frames. Each MP3 frame starts with a marker sequence of 11 1-bits followed by three bytes of flags describing the contents of the frame and then the audio data itself.

Based on this information, many posts on various internet forums suggest inspecting the first frame to find its bitrate, and then dividing the bit size of the entire file by the bitrate to get the total duration. There are a few problems with this though. The biggest is that whole MP3 files can generally be divided into two groups: constant bitrate (CBR) files and variable bitrate (VBR) ones. Bitrate refers to the number of bits used to represent audio data for a certain time interval. As the name suggests, files encoded with a constant bitrate use the same number of bits per second to represent audio data throughout the entire file. For the naive length estimation method, this is okay (though not perfect, because it doesn’t account for the frame headers and any potential unused space in between frames). In variable bitrate MP3s though, each frame can have a different bitrate, which allows the encoder to work more space-efficiently (because portions of the audio that are less complex can be encoded at a lower bitrate). Because of this, the naive estimation doesn’t work at all (unless, by coincidence, the bitrate of the first frame happens to be close to the average bitrate for the entire file). In order to accurately get the duration for a VBR file, we need to go through every single frame in the file and sum their individual durations. So that’s what we’re gonna do.

The overall structure of the MP3 decoder is going to be fairly similar to the ID3 one. We can even take advantage of the existing ID3 decoder to skip over the ID3 tag at the beginning of the file, thereby avoiding any false syncs (the parse_tag function needs to be amended to return the remaining binary data after the tag in order to do this). From there, it’s simply a matter of scanning through the file looking for the magic sequence of 11 1-bits that mark the frame synchronization point and repeating that until we reach the end of the file.

def get_mp3_duration(data) when is_binary(data) do
  {_, rest} = ID3.parse_tag(data)
  parse_frame(rest, 0, 0, 0)
end

The parse_frame function takes several arguments in addition to the data. These are the accumulated duration, the number of frames parsed so far, and the byte offset in the file. These last two aren’t strictly needed for parsing the file, but come in very useful if you have to debug any issues. The function has several different cases. The first looks for the sync bits at the start of the binary and, if it finds it, parses the frame header to caclulate the duration, adds it to the accumulator, and recurses. The next case skips a byte from the beginning of the binary and then recurses. And the final case handles an empty binary and simply returns the accumulated duration.

def parse_frame(
	  <<
	    0xFF::size(8),
		0b111::size(3),
		version_bits::size(2),
		layer_bits::size(2),
		_protected::size(1),
		bitrate_index::size(4),
		sampling_rate_index::size(2),
		padding::size(1),
		_private::size(1),
		_channel_mode_index::size(2),
		_mode_extension::size(2),
		_copyright::size(1),
		_original::size(1),
		_emphasis::size(2),
		_rest::binary
      >>,
      acc,
      frame_count,
      offset
    ) do
end

def parse_frame(<<_::size(8), rest::binary>>, acc, frame_count, offset) do
  parse_frame(rest, acc, frame_count, offset + 1)
end

def parse_frame(<<>>, acc, _frame_count, _offset) do
  acc
end

The main implementation of the parse_frames function isn’t too complicated. It’s just getting a bunch of numbers out of lookup-tables and doing a bit of math.

The first thing we need to know is what kind of frame we are looking at. MP3 frames are divided two ways, by the version of the frame and the layer of the frame. In the header, there are two fields, each two bits wide, that indicate the version and layer. But not every combination of those bits are valid. There are only three versions (and version 2.5 is technically an unnoficial addition at that) and three layers. We can use a couple functions to look up atoms representing the different versions/layers, since it’s more convenient than having to use the raw numeric values in other places. We also return the :invalid atom for version 0b01 and layer 0b00 respectively, so that if we enocunter one when parsing a frame, we can immediately stop and look for the next sync point.

defp lookup_version(0b00), do: :version25
defp lookup_version(0b01), do: :invalid
defp lookup_version(0b10), do: :version2
defp lookup_version(0b11), do: :version1

defp lookup_layer(0b00), do: :invalid
defp lookup_layer(0b01), do: :layer3
defp lookup_layer(0b10), do: :layer2
defp lookup_layer(0b11), do: :layer1

The next piece of information we need is the sampling rate which is the frequency (in Hertz) with respect to time at which individual audio samples are taken. In the header, it’s also represented by two bits. As before, we pattern match in the function definition to find the actual sampling rate from the index, and return :invalid if the index is not permitted.

defp lookup_sampling_rate(_version, 0b11), do: :invalid
defp lookup_sampling_rate(:version1, 0b00), do: 44100
defp lookup_sampling_rate(:version1, 0b01), do: 48000
defp lookup_sampling_rate(:version1, 0b10), do: 32000
defp lookup_sampling_rate(:version2, 0b00), do: 22050
defp lookup_sampling_rate(:version2, 0b01), do: 24000
defp lookup_sampling_rate(:version2, 0b10), do: 16000
defp lookup_sampling_rate(:version25, 0b00), do: 11025
defp lookup_sampling_rate(:version25, 0b01), do: 12000
defp lookup_sampling_rate(:version25, 0b10), do: 8000

The last piece of information we need from the header is the bitrate, or the number of bits that are used to represent a single second (or, in our case, the number of kilobits, for simplicity’s sake). The header has a four bit field that represent which index in the lookup table we should use to find the bitrate. But that’s not all the information that’s necessary. In order to know which lookup table to use, we also need the version and layer of the frame. For each version and layer combination there is a different set of bitrates that the frame may use.

So, the lookup_bitrate function will need to take three parameters: the version, layer, and bitrate index. First off, indices 0 and 15 are reserved by the spec, so we can just return the :invalid atom regardless of the version or layer. For the other version/layer combinations, we simply look up the index in the appropriate list. A couple things to note are that in version 2, layers 2 and 3 use the same bitrates, and all layers for version 2.5 use the same bitrates as version 2.

@v1_l1_bitrates [:invalid, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, :invalid]
@v1_l2_bitrates [:invalid, 32, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384, :invalid]
@v1_l3_bitrates [:invalid, 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, :invalid]
@v2_l1_bitrates [:invalid, 32, 48, 56, 64, 80, 96, 112, 128, 144, 160, 176, 192, 224, 256, :invalid]
@v2_l2_l3_bitrates [:invalid, 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160, :invalid]

defp lookup_bitrate(_version, _layer, 0), do: :invalid
defp lookup_bitrate(_version, _layer, 0xF), do: :invalid
defp lookup_bitrate(:version1, :layer1, index), do: Enum.at(@v1_l1_bitrates, index)
defp lookup_bitrate(:version1, :layer2, index), do: Enum.at(@v1_l2_bitrates, index)
defp lookup_bitrate(:version1, :layer3, index), do: Enum.at(@v1_l3_bitrates, index)
defp lookup_bitrate(v, :layer1, index) when v in [:version2, :version25], do: Enum.at(@v2_l1_bitrates, index)
defp lookup_bitrate(v, l, index) when v in [:version2, :version25] and l in [:layer2, :layer3], do: Enum.at(@v2_l2_l3_bitrates, index)

One could do some fancy metaprogramming to generate a function case for each version/layer/index combination to avoid the Enum.at call at runtime and avoid some of the code repetition, but this is perfectly fine.

With those four functions implemented, we can return to the body of the main parse_frame implementation.

def parse_frame(...) do
  with version when version != :invalid <- lookup_version(version_bits),
       layer when layer != :invalid <- lookup_layer(layer_bits),
       sampling_rate when sampling_rate != :invalid <- lookup_sampling_rate(version, sampling_rate_index),
       bitrate when bitrate != :invalid <- lookup_bitrate(version, layer, bitrate_index) do
  else
    _ ->
      <<_::size(8), rest::binary>> = data
      parse_frame(rest, acc, frame_count, offset + 1)
  end
end

We call the individual lookup functions for each of the pieces of data we need from the header. Using Elixir’s with statement lets us pattern match on a bunch of values together. If any of the functions return :invalid, the pattern match will fail and it will fall through to the else part of the with statement that matches anything. If that happens, we skip the first byte from the binary and recurse, looking for the next potential sync point.

Inside the main body of the with statement, we need to find the number of samples in the frame.

def parse_frame(...) do
  with ... do
    samples = lookup_samples_per_frame(version, layer)
  else
    ...
  end
end

The number of samples per frame is once again given by a lookup table, this time based only on the version and layer of the frame. As before, the version 2.5 constants are the same as the ones for version 2.

defp lookup_samples_per_frame(:version1, :layer1), do: 384
defp lookup_samples_per_frame(:version1, :layer2), do: 1152
defp lookup_samples_per_frame(:version1, :layer3), do: 1152
defp lookup_samples_per_frame(v, :layer1) when v in [:version2, :version25], do: 384
defp lookup_samples_per_frame(v, :layer2) when v in [:version2, :version25], do: 1152
defp lookup_samples_per_frame(v, :layer3) when v in [:version2, :version25], do: 576

Now, we have enough information to start calculating the actual byte size of the frame. This involves a bit of math that could be done all at once, but let’s break it down for clarity.

First, we need to know the duration of a single sample. We have the sampling rate, which is the number of samples per second, so dividing 1 by that value gives us the duration of an individual sample.

Then, since we know the number of samples in the entire frame, we can multiply that by the duration of an individual sample to get the total duration for the entire frame (this is the same value we’ll later add to the duration accumulator).

Next, we have the bitrate from the lookup function, but it’s in kilobits per second. When determining the frame size, we want the unit to be bytes, so we first multiply by 1000 to get bits per second, and then divide by 8 to get bytes per second.

The bytes per second value can then be multiplied by the frame duration to get the number of bytes. Finally, the frame may have padding to ensure that the bitrate exactly matches its size and duration. The size of the padding depends on the layer: for layer 1 the padding is 4 bytes and for layers 2 and 3, 1 byte.

defp get_frame_size(samples, layer, kbps, sampling_rate, padding) do
  sample_duration = 1 / sampling_rate
  frame_duration = samples * sample_duration
  bytes_per_second = kbps * 1000 / 8
  size = floor(frame_duration * bytes_per_second)

  if padding == 1 do
    size + lookup_slot_size(layer)
  else
    size
  end
end

defp lookup_slot_size(:layer1), do: 4
defp lookup_slot_size(l) when l in [:layer2, :layer3], do: 1

One thing to note is that we floor the size before returning it. All of the division changes the value into a floating point, albeit one for which we know the decimal component will be zero. flooring it turns it into an actual integer (1 rather than 1.0) because using a floating point value in a binary pattern will cause an error.

With that implemented, we can call it in the parse_frame implementation to get the number of bytes that we need to skip. We also perform the same calculation to get the frame duration. Then we can skip the requisite number of bytes of data, add the values we calculated to the various accumulators and recurse.

def parse_frame(...) do
  with ... do
    # ...
	frame_size = get_frame_size(samples, layer, bitrate, sampling_rate, padding)
	frame_duration = samples / sampling_rate
	<<_skipped::binary-size(frame_size), rest::binary>> = data
	parse_frame(rest, acc + frame_duration, frame_count + 1, offset + frame_size)
  else
    ...
  end
end

And with that, we can accurately find the duration of any MP3 file!

iex> data = File.read!("test.mp3")
iex> MP3.get_mp3_duration(data)
452.20571428575676 # 7 minutes and 32 seconds

Comments

Comments powered by ActivityPub. To respond to this post, enter your username and instance below, or copy its URL into the search interface for client for Mastodon, Pleroma, or other compatible software. Learn more.

Reply from your instance: