The MD5 (Message Digest 5) hash function explained

Overview


The MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that takes an input message and produces a fixed-size (128-bit) hash value. It was developed by Ronald Rivest in 1991 and is commonly used for checksums, data integrity checks, and storing passwords.

The invention of the MD5 hash algorithm

Ronald Rivest, a renowned cryptographer, developed the MD5 (Message Digest Algorithm 5) hash function in 1991. Rivest’s motivation was to create a fast and efficient algorithm for producing a fixed-size hash value from arbitrary-length input messages.

The design principles for MD5 drew inspiration from Rivest’s previous work on the MD4 (Message Digest Algorithm 4) hash function. MD4 had certain vulnerabilities, and Rivest aimed to address those issues and create an improved version.

Rivest iteratively refined the algorithm, taking into consideration its speed, simplicity, and resistance against various cryptographic attacks. The development process involved analyzing the properties of the function and applying mathematical and logical transformations to achieve the desired cryptographic properties, such as preimage resistance and collision resistance.

Through his expertise in cryptography and algorithm design, Ronald Rivest was able to create MD5 as a widely used hash function. However, over time, vulnerabilities were discovered that rendered MD5 unsuitable for secure cryptographic applications.

Despite its shortcomings, MD5 still finds limited use today for non-security-related purposes, such as checksums for data integrity checks or non-cryptographic hash tables. Nonetheless, for secure applications, it is recommended to use stronger hash functions that are resistant to known vulnerabilities and attacks.

How the MD5 hash function works

Here’s a simplified explanation of how the MD5 algorithm works:

  1. Padding: The input message is padded to ensure its length is a multiple of 512 bits (64 bytes) minus 64 bits (8 bytes). Padding is necessary because MD5 operates on fixed-size blocks.
  2. Initialization: MD5 uses four 32-bit registers (A, B, C, D) and four constant values. The initial values of these registers are predetermined and serve as the „state“ of the algorithm.
  3. Processing: The padded message is divided into 512-bit blocks. The algorithm processes these blocks one at a time, performing a series of bitwise logical operations, such as AND, OR, XOR, and NOT, as well as modular addition and rotation operations.
    • Round 1: A series of four rounds is performed on each block. Each round consists of 16 operations, using a different nonlinear function for each operation. These functions take three 32-bit words (X, Y, Z) and produce an output based on bitwise logical operations and the modular addition operation.
    • Round 2: The operations in this round are similar to Round 1 but use different functions and a different order of calculations.
    • Round 3: Similar to Round 2, this round has a different set of functions and a different order of calculations.
    • Round 4: The operations in this round are the same as in Round 1, but the functions used are different.
  4. Finalization: Once all the blocks have been processed, the resulting values in the A, B, C, and D registers are concatenated to form the 128-bit MD5 hash value.

Let’s dive into the detailed generation process of the MD5 hash algorithm

  1. Padding:
    • The input message is divided into 512-bit (64-byte) blocks.
    • A „1“ bit is appended to the message, followed by enough „0“ bits to ensure the message length (in bits) is congruent to 448 modulo 512.
    • Then, a 64-bit representation of the original message length (before padding) is appended to the end of the padded message.
  2. Initialization:
    • MD5 uses four 32-bit registers (A, B, C, D) and four constant 32-bit values (K).
    • The initial values of these registers are predetermined and serve as the initial state:
      • A = 0x67452301
      • B = 0xEFCDAB89
      • C = 0x98BADCFE
      • D = 0x10325476
  3. Processing:
    • The padded message is processed in 16-operation rounds, repeated four times (total of 64 operations).
    • Each round consists of 16 operations, labeled from 0 to 15, using a different nonlinear function for each operation.
    • Round 1:
      • Operation 0: Each 32-bit block of the current message block is denoted as X[i], where i ranges from 0 to 15. X[i] is transformed as follows:
        • X[i] := X[i] AND 0xFF
        • X[i] := X[i] OR (X[i] << 8) // Perform a left rotation by 8 bits
        • X[i] := X[i] OR (X[i] << 16) // Perform a left rotation by 16 bits
      • Operation 1: The following operations are performed on the four registers (A, B, C, D) in a loop:
        • A := (A + ((B AND C) OR (NOT B AND D)) + X[i] + K[1]) <<< s, where <<< denotes a left circular shift by s bits.
        • A is then assigned to D, D to C, C to B, and B to A.
      • Operations 2-15: These operations are similar to Operation 1, but the value of K and the functions applied to B, C, and D change.
    • Round 2, 3, and 4: These rounds follow the same structure as Round 1 but use different constants (K) and different nonlinear functions.
  4. Finalization:
    • The resulting values in registers A, B, C, and D are concatenated to form the 128-bit MD5 hash value.
    • The resulting hash value is typically represented as a sequence of 32 hexadecimal digits (each representing 4 bits).

Example walktrough

Let’s walk through an example of how the MD5 algorithm converts the message „Welcome to hashgenerator.de“ into a hash value:

  1. Message: „Welcome to hashgenerator.de“
    • The message consists of 26 characters and is encoded using ASCII or Unicode.
  2. Padding:
    • The message length is 26 characters or 208 bits.
    • To ensure the length is congruent to 448 modulo 512, we need to add padding bits.
    • A „1“ bit is appended to the message: „Welcome to hashgenerator.de1“.
    • Then, enough „0“ bits are added until the length is 448 bits minus the 64-bit representation of the original message length.
    • In this case, the 64-bit representation of the length is „0000000000000000000000000000000000000000000000000000000000011000“ (in binary).
    • After padding, the message becomes: „Welcome to hashgenerator.de1“ + 56 bytes of zero bits + „0000000000000000000000000000000000000000000000000000000000011000“.
  3. Initialization:
    • The initial state of the MD5 algorithm is as follows:
      • A = 0x67452301
      • B = 0xEFCDAB89
      • C = 0x98BADCFE
      • D = 0x10325476
  4. Processing:
    • The padded message is processed in 16-operation rounds, repeated four times (total of 64 operations), as described earlier.
  5. Finalization:
    • After processing all blocks, the resulting values in registers A, B, C, and D are concatenated to form the 128-bit MD5 hash value.

For the given message „Welcome to hashgenerator.de“, the MD5 hash value is: 7b6c0d2e632cb197d905eb52b9f1c8d1

Vulnerabilities

MD5 or Mesage Digest 5 has several known vulnerabilities that make it insecure for cryptographic purposes. Here are some of the notable vulnerabilities:

  1. Collision Vulnerabilities: MD5 is susceptible to collision attacks, where two different inputs can produce the same hash value. This means that an attacker can intentionally find different messages with the same MD5 hash, which undermines the integrity and security of the algorithm. Collisions can be generated relatively quickly using advanced techniques, such as the birthday attack.
  2. Preimage Attacks: MD5 is also vulnerable to preimage attacks, where an attacker can find a different input that produces a specific desired hash value. This makes it possible for an attacker to create a fraudulent message with the same MD5 hash as a legitimate message, without knowing the original message.
  3. Length Extension Attacks: MD5’s structure allows for length extension attacks. In such attacks, given the hash value of a message and its length, an attacker can easily append additional data to the message without knowing its original content while maintaining the same hash value.

Due to these vulnerabilities, MD5 is considered broken for cryptographic purposes. It is no longer recommended for applications such as password hashing, digital signatures, or secure data transmission. Instead, more secure hash functions, like SHA-256, SHA-3, or bcrypt, should be used to ensure data integrity and security.

Collision Example

One of the most famous examples of a real-life MD5 hash collision is the collision found for two different PDF files. In 2004, researchers Marc Stevens, Arjen Lenstra, and Benne de Weger demonstrated a practical collision attack on MD5 using two different PDF documents. The two PDF files with different content produced the same MD5 hash value. Here are the details of the collision:

First PDF File:

File Name: chosen-prefix1.pdf
MD5 Hash: 79054025255fb1a26e4bc422aef54eb4

Second PDF File:

File Name: chosen-prefix2.pdf
MD5 Hash: 79054025255fb1a26e4bc422aef54eb4

Despite having different content, both files had an MD5 hash of 79054025255fb1a26e4bc422aef54eb4. This finding highlighted the vulnerability of MD5 for secure applications.

Xiaoyun Wang and Hongbo Yu of the Shandong University in China published an article in 2005 in which they describe the way to find to different messages, that lead to the same MD5 hash. One famous such pair is the following:

d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89 
55ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b 
d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70

and

d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89 
55ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b 
d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70 

Both blocks have the MD5 hash 79054025255fb1a26e4bc422aef54eb4. Peter Selinger describes the process very detailed in an article on his university website.

Example Implementation in PHP

<?php

function md5Hash($message) {
    $padding = array(0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00);
    $bits = strlen($message) * 8;
    $message .= chr(0x80);
    
    while ((strlen($message) * 8) % 512 != 448) {
        $message .= chr(0x00);
    }
    
    $message .= pack("N", $bits & 0xFFFFFFFF);
    $message .= pack("N", ($bits >> 32) & 0xFFFFFFFF);
    
    $chunks = str_split($message, 64);
    $a = 0x67452301;
    $b = 0xEFCDAB89;
    $c = 0x98BADCFE;
    $d = 0x10325476;
    
    foreach ($chunks as $chunk) {
        $words = array();
        for ($i = 0; $i < 64; $i += 4) {
            $words[] = ord($chunk[$i]) | (ord($chunk[$i + 1]) << 8) | (ord($chunk[$i + 2]) << 16) | (ord($chunk[$i + 3]) << 24);
        }
        
        $f = function ($x, $y, $z) {
            return (($x & $y) | (~$x & $z));
        };
        $g = function ($x, $y, $z) {
            return (($x & $z) | ($y & ~$z));
        };
        $h = function ($x, $y, $z) {
            return ($x ^ $y ^ $z);
        };
        $i = function ($x, $y, $z) {
            return ($y ^ ($x | ~$z));
        };
        
        $rotateLeft = function ($x, $n) {
            return (($x << $n) | ($x >> (32 - $n)));
        };
        
        $shifts = array(7, 12, 17, 22, 5, 9, 14, 20, 4, 11, 16, 23, 6, 10, 15, 21);
        $functions = array($f, $g, $h, $i);
        
        $aa = $a;
        $bb = $b;
        $cc = $c;
        $dd = $d;
        
        for ($j = 0; $j < 64; $j++) {
            $index = floor($j / 16);
            $m = $words[$j];
            $f = $functions[$index];
            $s = $shifts[($index * 4 + $j) % 16];
            
            switch ($index) {
                case 0:
                    $tmp = $a + $f($b, $c, $d) + $m;
                    break;
                case 1:
                    $tmp = $a + $f($b, $c, $d) + $m + 0x5A827999;
                    break;
                case 2:
                    $tmp = $a + $f($b, $c, $d) + $m + 0x6ED9EBA1;
                    break;
                case 3:
                    $tmp = $a + $f($b, $c, $d) + $m + 0x8F1BBCDC;
                    break;
            }
            
            $tmp = $rotateLeft($tmp, $s);
            $tmp += $b;
            
            $a = $d;
            $d = $c;
            $c = $b;
            $b = $tmp;
        }
        
        $a += $aa;
        $b += $bb;
        $c += $cc;
        $d += $dd;
    }
    
    $output = pack("N", $a) . pack("N", $b) . pack("N", $c) . pack("N", $d);
    $hash = bin2hex($output);
    
    return $hash;
}

// Example usage
$message = "Hello, world!";
$hash = md5Hash($message);
echo "MD5 hash of '$message': $hash";
?>

This implementation follows the steps of the MD5 algorithm, including message padding, chunk processing, and the four rounds of operations. The final output is a 32-character hexadecimal representation of the MD5 hash.