When working with image directories—especially in content-heavy applications or media galleries—duplicate files can waste storage space and clutter your filesystem. Thankfully, with a few lines of PHP, we can automatically detect and move these duplicate files using a hash-based approach.

In this post, we’ll walk through a simple PHP script that removes duplicate images based on their file content (not just the name) using the md5 hashing algorithm.


✅ What This Script Does

  • Scans a given directory for image files.
  • Computes an MD5 checksum for each file.
  • If the checksum already exists, it recognizes the file as a duplicate.
  • Moves duplicates to a separate directory (duplicate_images/).

🛠️ The PHP Script

<?php 
// Function to ensure cross-platform compatibility for file paths
function platformSlashes($path) {
    return str_replace('/', DIRECTORY_SEPARATOR, $path);
}

$mdir = "D:\justest\\"; // Base directory
$dir = $mdir . "images"; // Directory containing images

$checksums = array();

if ($h = opendir($dir)) {
    while (($file = readdir($h)) !== false) {
        // Skip directories
        if (is_dir($_ = "{$dir}/{$file}")) continue;

        // Normalize file path
        $main_dir = platformSlashes($_);

        // Generate MD5 hash for the file
        $hash = hash_file('md5', $main_dir);

        // Destination path for duplicates
        $dup_dir = str_replace('images', 'duplicate_images', $main_dir);

        // Check if this hash has already been encountered
        if (in_array($hash, $checksums)) {
            // Move the duplicate file to another folder
            rename($main_dir, $dup_dir);
        } else {
            // Store the hash to detect future duplicates
            $checksums[] = $hash;
        }
    }
    closedir($h);
}

// Output the checksums (optional)
print_r($checksums);

📂 Folder Structure Before & After

Before:

D:\justest\
│
├── images\
│   ├── img1.jpg
│   ├── img1_copy.jpg  ← duplicate
│   ├── img2.png

After:

D:\justest\
│
├── images\
│   ├── img1.jpg
│   ├── img2.png
│
├── duplicate_images\
│   ├── img1_copy.jpg

🔒 Why Use Hashing?

Using a hashing function like md5 lets us compare files based on content rather than name or size alone. While md5 is not suitable for cryptographic security, it’s fast and ideal for checksumming in file comparisons.


📌 Notes

  • Ensure that the duplicate_images directory exists beforehand or add logic to create it.
  • This script works on all platforms but assumes a Windows-style path in the example. Modify paths accordingly for Linux/macOS.
  • You can replace rename() with unlink() to delete duplicates instead of moving them.

💡 Wrapping Up

With this quick script, you can clean up your image directories automatically, saving storage and keeping your media organized. Extend this script further by integrating it into your CMS, setting up cron jobs, or adding a logging system.


Leave a Reply

Your email address will not be published. Required fields are marked *

Upgrade PHP Version without using cPanel setting