When working with image directories—especially in content-heavy applications or media galleries—duplicate files can waste storage space and clutter your filesystem. Thankfully, with a few lines of PHP, we can automatically detect and move these duplicate files using a hash-based approach.
In this post, we’ll walk through a simple PHP script that removes duplicate images based on their file content (not just the name) using the md5
hashing algorithm.
✅ What This Script Does
- Scans a given directory for image files.
- Computes an MD5 checksum for each file.
- If the checksum already exists, it recognizes the file as a duplicate.
- Moves duplicates to a separate directory (
duplicate_images/
).
🛠️ The PHP Script
<?php
// Function to ensure cross-platform compatibility for file paths
function platformSlashes($path) {
return str_replace('/', DIRECTORY_SEPARATOR, $path);
}
$mdir = "D:\justest\\"; // Base directory
$dir = $mdir . "images"; // Directory containing images
$checksums = array();
if ($h = opendir($dir)) {
while (($file = readdir($h)) !== false) {
// Skip directories
if (is_dir($_ = "{$dir}/{$file}")) continue;
// Normalize file path
$main_dir = platformSlashes($_);
// Generate MD5 hash for the file
$hash = hash_file('md5', $main_dir);
// Destination path for duplicates
$dup_dir = str_replace('images', 'duplicate_images', $main_dir);
// Check if this hash has already been encountered
if (in_array($hash, $checksums)) {
// Move the duplicate file to another folder
rename($main_dir, $dup_dir);
} else {
// Store the hash to detect future duplicates
$checksums[] = $hash;
}
}
closedir($h);
}
// Output the checksums (optional)
print_r($checksums);
📂 Folder Structure Before & After
Before:
D:\justest\
│
├── images\
│ ├── img1.jpg
│ ├── img1_copy.jpg ← duplicate
│ ├── img2.png
After:
D:\justest\
│
├── images\
│ ├── img1.jpg
│ ├── img2.png
│
├── duplicate_images\
│ ├── img1_copy.jpg
🔒 Why Use Hashing?
Using a hashing function like md5
lets us compare files based on content rather than name or size alone. While md5
is not suitable for cryptographic security, it’s fast and ideal for checksumming in file comparisons.
📌 Notes
- Ensure that the
duplicate_images
directory exists beforehand or add logic to create it. - This script works on all platforms but assumes a Windows-style path in the example. Modify paths accordingly for Linux/macOS.
- You can replace
rename()
withunlink()
to delete duplicates instead of moving them.
💡 Wrapping Up
With this quick script, you can clean up your image directories automatically, saving storage and keeping your media organized. Extend this script further by integrating it into your CMS, setting up cron jobs, or adding a logging system.