DataCheckerDataChecker
  • Welcome!

Welcome!

DataChecker is an open-source command-line tool that helps users save space, fix data and improve security. It is simple, fast and easy to use.

🛠️ Features

▪️ Finds duplicate files using two or three-stage filtering

▪️ Finds all symbolic links and shortcuts within a folder

▪️ Creates or verifies file integrity using 26 algorithms, running in parallel or sequentially

▪️ Finds temporary files such as logs and lock files

▪️ Finds files potentially containing confidential data such as passwords and tokens

▪️ Finds losslessly compressed files with poor compression ratios

▪️ Finds files with repeated characters in their names

▪️ Finds empty files

▪️ Finds unusually large files

▪️ Finds files that haven't been accessed in a long time

▪️ Finds files stored in legacy or obsolete formats

▪️ Validates magic numbers against declared file extensions

▪️ Infers file types by inspecting file content

▪️ Validates JSON files and reports parsing errors

▪️ Finds files with timestamps set in the future

▪️ Finds empty directories

▪️ Finds directories with an excessive number of entries

▪️ Finds directories with only a single entry

▪️ Finds files or directories with excessively long names

▪️ Finds files or directories with excessively long paths

▪️ Finds non-portable or invalid characters in names and paths

📄 Requirements

  • FreeBSD, Linux, NetBSD or Windows
  • ANSI compatible terminal like kitty, xterm, alacritty, Windows Terminal

📥 Download

Prebuilt binaries for x86_64 are available here. If no binary is available for your system, you can build from source. To verify the downloaded file, place the .sha3_256 file in the same folder as the compressed file and run:

datachecker -i <download folder>

🚀 Usage

Just run:

datachecker <directory>

To run just one command (see commands section below):

datachecker <command> <directory>

If you want to configure the default behavior, create a default configuration:

datachecker config

A config.json file will be created in the same folder. If you set the INPUT_FOLDER variable, just run again without any parameter:

datachecker

Tip: if your results has too many issues, disable colored output and pipe the stderr:

datachecker -nc . 2> results.txt

▪️ Duplicate files

To list all duplicate files in a folder using two filters (filter by size and compare bytes):

datachecker -d <folder>

To use three filters (filter by size, calculate hash and compare bytes):

datachecker -dmt <folder>

Choose the best option based on your data.

▪️ Symbolic links and shortcuts

To list all symbolic links and shortcuts:

datachecker -ls <folder>

Symbolic links and shortcuts are not portable between filesystems, so datacheker displays a warning when finds them.

▪️ File integrity

To create or verify a hash the command is the same:

datachecker -i <folder>

or in parallel:

datachecker -imt <folder>

This will create a hash inside the all empty files with algorithm extension — e.g., myimage.jpg and an empty myimage.jpg.sha256. If myimage.jpg.sha256 already contains a SHA-256 hash, it will be used to verify myimage.jpg instead.

▪️ Temporary files

To list all temporary files:

datachecker -tf <folder>

This will list files considered temporary based on their extension. Review the results to determine whether they are still needed or can be safely deleted.

▪️ Confidential files

To list all files matching any confidential pattern:

datachecker -cf <folder>

To define which patterns to search for, create or edit a config.json file.

▪️ Compressed files

To list all lossless files with poor compression ratios:

datachecker -c <folder>

Recompressing them at the highest compression ratio could save a significant amount of disk space.

▪️ Repeated characters

To list all files with repeated characters such as .. or __ in their names:

datachecker -dc <folder>

▪️ Empty files

To list all files with no content:

datachecker -ef <folder>

▪️ Large files

To list all files exceeding LARGE_FILE_SIZE bytes (configured in config.json):

datachecker -lf <folder>

▪️ Inactive files

To list all files not accessed for longer than LAST_ACCESS_TIME nanoseconds (configured in config.json):

datachecker -l <folder>

▪️ Legacy or obsolete formats

To list all files in a legacy or obsolete format:

datachecker -legacy <folder>

▪️ Magic number mismatch

To list all files whose magic number does not match their extension:

datachecker -m <folder>

▪️ Files with no extension

To list all files without an extension and attempt to identify their format based on content:

datachecker -n <folder>

▪️ Invalid JSON files

To list all JSON files containing errors:

datachecker -j <folder>

▪️ Wrong timestamps

To list all files with timestamps set in the future:

datachecker -w <folder>

▪️ Empty directories

To list all empty directories:

datachecker -e <folder>

▪️ Oversized directories

To list all directories with more entries than MAX_ITEMS_DIRECTORY (configured in config.json):

datachecker -mi <folder>

▪️ Single-entry directories

To list all directories containing only a single file or subdirectory:

datachecker -o <folder>

▪️ Long names

To list all files or directories whose name exceeds MAX_DIR_FILE_NAME_SIZE characters (configured in config.json):

datachecker -ds <folder>

▪️ Long paths

To list all files or directories whose absolute path exceeds MAX_FULL_PATH_SIZE characters (configured in config.json):

datachecker -f <folder>

▪️ Non-portable or invalid paths

To list all files or directories with invalid or non-portable characters in their path:

datachecker -u <folder>

🔀 Commands aliases

▪️ --duplicate, -d, -D, duplicate, /D, DUPLICATE

Search for duplicate files

▪️ --duplicate_mt, -dmt, -DMT, duplicate_mt, /DMT, DUPLICATE_MT

Search for duplicate files using multithreading

▪️ --links, -ls, -LS, links, /LS, LINKS

Search for links and shortcuts

▪️ --integrity, -i, -I, integrity, /I, INTEGRITY

Create or verify file hashes

▪️ --integrity_mt, -imt, -IMT, integrity_mt, /IMT, INTEGRITY_MT

Create or verify file hashes using multithreading

▪️ --temp, -tf, -TF, temp, /TF, TEMP

Search for temporary files

▪️ --conf, -cf, -CF, conf, /CF, CONF

Search for confidential data in files (create config.json to customize search patterns)

▪️ --compressed, -c, -C, compressed, /C, COMPRESSED

Search for optimization opportunities in lossless compressed files

▪️ --dupchars, -dc, -DC, dupchars, /DC, DUPCHARS

Search for duplicate characters in filenames

▪️ --empty, -ef, -EF, empty, /EF, EMPTY

Search for empty files

▪️ --large, -lf, -LF, large, /LF, LARGE

Search for large files (create config.json to customize size threshold)

▪️ --last, -l, -L, last, /L, LAST

Search for files not accessed recently (create config.json to customize time period)

▪️ --legacy, -lg, -LG, legacy, /LG, LEGACY

Search for files using outdated formats

▪️ --magic, -m, -M, magic, /M, MAGIC

Search for files with mismatched magic numbers

▪️ --noext, -n, -N, noext, /N, NOEXT

Search for files without extensions and attempt to identify them

▪️ --json, -j, -J, json, /J, JSON

Search for JSON files with syntax errors

▪️ --wrong, -w, -W, wrong, /W, WRONG

Search for files with future timestamps

▪️ --emptydirs, -e, -E, emptydirs, /E, EMPTYDIRS

Search for empty directories

▪️ --manyitems, -mi, -MI, manyitems, /MI, MANYITEMS

Search for directories with excessive items (create config.json to customize item threshold)

▪️ --oneitem, -o, -O, oneitem, /O, ONEITEM

Search for directories containing only one item

▪️ --dirsize, -ds, -DS, dirsize, /DS, DIRSIZE

Search for directories or files with excessively long names (create config.json to customize length threshold)

▪️ --fullpathsize, -f, -F, fullpathsize, /F, FULLPATHSIZE

Search for excessively long absolute paths (create config.json to customize path length threshold)

▪️ --uchars, -u, -U, uchars, /U, UCHARS

Search for non-portable characters in absolute paths

▪️ --nocolors, -nc, -NC, nocolors, /NC, NOCOLORS

Disable colored output

📦 Build from source

Requirements:

▪️ FreeBSD: aarch64, arm, powerpc64, powerpc64le, riscv64, x86_64

▪️ Linux: aarch64, arm, loongarch64, powerpc64le, riscv64, s390x, x86, x86_64

▪️ NetBSD: aarch64, arm, x86, x86_64

▪️ Windows: aarch64, x86, x86_64

▪️ Zig compiler: the minimal version required could be found in file build.zig.zon

NOTE: it was only tested in x86_64.

First download or checkout the source code:

git clone --depth=1 https://github.com/mazoti/datachecker

Make sure the Zig compiler is in your path:

export PATH = <zig directory>:$PATH (Linux or Unix)
set PATH = <zig directory>;%PATH% (Windows)

Execute the build command, the binary will be on "bin" folder:

zig build -p . --release=fast  (optimize for speed)
zig build -p . --release=small (optimize for size/power)
zig build run -- .             (runs in debug mode using current folder)
zig build test                 (runs unit tests)

⚙️ Configurations

The following describes the configuration options in the config.json file. If this file does not exist in the same folder as the binary, the system uses the default values.

▪️ INPUT_FOLDER

Specifies the folder to process. Ignored when a folder is provided as a command-line argument:

"INPUT_FOLDER": ".",

▪️ BUFFER_SIZE

Specifies the RAM usage in bytes:

"BUFFER_SIZE": 65536,

▪️ COLOR

Shows colored output; disable this if you are piping the output to a file or your terminal is not ANSI compatible:

"COLOR": true,

▪️ ENABLE_CACHE

Caches file and folder paths, dates and sizes to improve performance:

"ENABLE_CACHE": true,

▪️ ENTER_TO_QUIT

Displays the message "Press enter to quit" and closes the application only after the user presses Enter:

"ENTER_TO_QUIT": false,

▪️ MAX_JOBS

Maximum number of threads to use. Leave 0 to automatically use all available CPU threads:

"MAX_JOBS": 0,

▪️ DUPLICATE_FILES / DUPLICATE_FILES_PARALLEL

Finds all duplicate files in a folder and displays the total number of wasted bytes. There are two algorithms for this task: a single-threaded two-stage and a parallel three-stage filtering method:

"DUPLICATE_FILES": true,
  "DUPLICATE_FILES_PARALLEL": true,

▪️ LINKS_SHORTCUTS

Finds all shortcuts and symlinks. On Linux and Unix, it also checks whether the target exists:

"LINKS_SHORTCUTS": true,

▪️ INTEGRITY_FILES / INTEGRITY_FILES_PARALLEL

Calculates and verifies file integrity. The following algorithms are supported: Ascon, BLAKE, MD5, and SHA families. To calculate a hash, create an empty file in the same folder using the same base name and extension of the target file and append the desired hash extension. Supported hash extensions include:

  • Ascon

    • ascon256
  • BLAKE

    • blake2b128
    • blake2b160
    • blake2b256
    • blake2b384
    • blake2b512
    • blake2s128
    • blake2s160
    • blake2s224
    • blake2s256
    • blake3
  • MD5

    • md5
  • SHA

    • sha1
    • sha224
    • sha256
    • sha256t192
    • sha3_224
    • sha3_256
    • sha3_384
    • sha3_512
    • sha384
    • sha512
    • sha512_224
    • sha512_256
    • sha512t224
    • sha512t256

Example: datachecker.exe.sha256

When such an empty file is present, DataChecker computes the corresponding hash of the target file and writes the hexadecimal ASCII result into the hash file. If the hash file is not empty, DataChecker instead performs an integrity check by comparing the file's contents with the expected hash value:

"INTEGRITY_FILES": true,
  "INTEGRITY_FILES_PARALLEL": true,

▪️ TEMPORARY_FILES

Displays files generated by compilers, browsers, operating systems, servers and databases that are safe to remove:

"TEMPORARY_FILES": true,

▪️ CONFIDENTIAL_FILES

Displays files containing confidential data. You can specify any byte array by serializing it with Base64 and inserting it into the PATTERN_BASE64_BYTES array or any string by inserting it into the PATTERNS array:

"CONFIDENTIAL_FILES": true,
  "PATTERNS": [ "access code", "Access code", "Access Code", "ACCESS CODE", ... ],
  "PATTERN_BASE64_BYTES": [
      "LS0tLS1CRUdJTiBEU0EgUFJJVkFURSBLRVktLS0tLQ==",
      "LS0tLS1CRUdJTiBFQyBQUklWQVRFIEtFWS0tLS0t",
      "LS0tLS1CRUdJTiBFTkNSWVBURUQgUFJJVkFURSBLRVktLS0tLQ==",
      "LS0tLS1CRUdJTiBPUEVOU1NIIFBSSVZBVEUgS0VZLS0tLS0=",
      "LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0t",
      "LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQ==",
      ...
  ],

▪️ COMPRESSED_FILES

Displays losslessly compressed files with low compression level or not optimally compressed. You can save space by recompressing them:

"COMPRESSED_FILES": true,

▪️ DUPLICATE_CHARS_FILES

Displays duplicate characters such as spaces or underscores in file or directory names:

"DUPLICATE_CHARS_FILES": true,

▪️ EMPTY_FILES

Finds all empty files: in most cases, this is unnecessary or indicates poor programming practice:

"EMPTY_FILES": true,

▪️ LARGE_FILE_SIZE

Displays all files larger than 100 GB by default. You can change this value by setting the LARGE_FILE_SIZE variable in config.json (value in bytes):

"LARGE_FILE_SIZE": 107374182400,

▪️ LAST_ACCESS_TIME

Displays all files accessed over a year ago. You can change this value by setting the LAST_ACCESS_TIME variable in config.json (value in nanoseconds):

"LAST_ACCESS_TIME": 31536000000000000,

▪️ LEGACY_FILES

Displays all files using outdated or unused formats:

"LEGACY_FILES": true,

▪️ MAGIC_NUMBERS

Displays all files whose magic number does not match their extension (this can help identify and fix issues):

"MAGIC_NUMBERS": true,

▪️ NO_EXTENSION

Displays all files without extension and attempts to identify their file type:

"NO_EXTENSION": true,

▪️ PARSE_JSON_FILES

Displays all JSON files with errors:

"PARSE_JSON_FILES": true,

▪️ WRONG_DATES

Displays all files with access, creation, or modification date in the future:

"WRONG_DATES": true,

▪️ EMPTY_DIRECTORIES

Finds all empty folders, which is usually not useful:

"EMPTY_DIRECTORIES": true,

▪️ MANY_ITEMS_DIRECTORY / MAX_ITEMS_DIRECTORY

Displays all directories containing more than 10,000 items, which could slow down access. You can adjust this threshold by setting the MAX_ITEMS_DIRECTORY value in config.json:

"MANY_ITEMS_DIRECTORY": true,
  "MAX_ITEMS_DIRECTORY": 10000,

▪️ ONE_ITEM_DIRECTORY

Displays all directories with only one item inside, which is usually not useful:

"ONE_ITEM_DIRECTORY": true,

▪️ DIRECTORY_FILE_NAME_SIZE / MAX_DIR_FILE_NAME_SIZE

Different filesystems support different maximum filename lengths in bytes. By default, DataChecker will warn you about any file or folder whose name exceed 200 bytes to ensure portability. You can change this limit by using the MAX_DIR_FILE_NAME_SIZE variable in config.json. Remember that emojis and other UTF-8 characters may take more than 1 byte:

"DIRECTORY_FILE_NAME_SIZE": true,
  "MAX_DIR_FILE_NAME_SIZE": 200,

▪️ FULL_PATH_SIZE / MAX_FULL_PATH_SIZE

Same as above, but checks the absolute path. By default, DataChecker warns you about any file or folder whose path exceed 1024 bytes to ensure portability. You can change this limit using the MAX_FULL_PATH_SIZE variable in config.json:

"FULL_PATH_SIZE": true,
  "MAX_FULL_PATH_SIZE": 1024,

▪️ UNPORTABLE_CHARS

Displays all files and folders containing characters that are not portable across modern filesystems:

"UNPORTABLE_CHARS": true

💰 Donations

Donations of any amount are welcome here

License