Milkman Matty
05/08/2021

Checksums are a great tool to verify that the file you have received from the server is the same as the file on the server. They can be used for many things of course, after all checksums are just the result of a cryptographic hash function, but the main reason for their existence is verification.

So download a file, generate a checksum and check that the file on the server has the same checksum - that's the basics. If the checksum is the same then you can be sure that you have the entire file, the file hasn't been corrupted in transit and that you haven't downloaded some other file (via bait-and-switch or possibly other nefarious means).

The process above is a bit high-level though, lets dive a little deeper into the nitty-gritty. There are two things to keep in mind when generating checksums:

1. Filesize can have performance concerns. There's no way to know how each checksum generator will load a given file into memory. It might read the entire file into memory before calculating the checksum, or it might stream the file in blocks of 1GiB, 1MiB or even 1KiB. If the file being checksumed is multiple GiB's in size then performance can take a huge hit and possibly even result in thrashing. Similarly the smaller the block-size for streamed files, the longer the operation will take. This can lead to a checksum taking 20 minutes or longer.

2. Ensure that the algorithm used by both parties is the same. For instance if the file on server has a checksum generated in SHA-256 then when you generate your local checksum it must also use SHA-256 otherwise the checksums will be completely different.

The former point became particularly important in a recent electron app I was developing. Reading in a file is pretty straight forward in Electron:


const fs = require('fs')

try {
  const data = fs.readFileSync('/Users/joe/test.txt', 'utf8')
  console.log(data)
} catch (err) {
  console.error(err)
}

That's taken from Reading files with Node.js, which is a great resource. However on the same webpage they also state:

"Both fs.readFile() and fs.readFileSync() read the full content of the file in memory before returning the data.
This means that big files are going to have a major impact on your memory consumption and speed of execution of the program.
In this case, a better option is to read the file content using streams."

Well that rules out the two fs.readFile* methods from use. Attempting to use either of those methods on a 10GiB file could very easily topple under-provisioned systems. That leaves, as suggested, only the streaming option left - an easy choice that solves the problem of filesize performance concerns.

Next step is to settle on an algorithm. The server in my case used SHA-256 which dictates that I needed to use SHA-256 in my app. A quick search on github found js-sha256 which has everything I need:

1. Can compute SHA-256 hashes.

2. The algorithm can be added to progressively before computing the hash.

3. I didn't write it. I really didn't want to write my own cryptographic hashing function.

A quick import of that library and most of the heavily lifting is done:

//Typescript
import { sha256 } from 'js-sha256'; //External lib
import fs from "fs";

// For my scenario 10 MB, but can be anything
let SIZE_CHECKSUM = 10 * Math.pow(1024, 2);

const GetChecksum = async (size: number, fd: number):
    Promise<string> => {
    let algorithm = sha256.create();
	
    //The total amount of SIZE_CHECKSUM "chunks"
    let totalChunks = Math.ceil(size / SIZE_CHECKSUM);

    for (let chunkCount = 0; chunkCount < totalChunks; ++chunkCount) {
        const start = chunkCount * SIZE_CHECKSUM;
		
        //The last chunk is almost certainly not going to be completely full of data
        //In this case grab as much as there is.
        //Otherwise use SIZE_CHECKSUM for all non-last chunks
        const nextChunk = size - start < SIZE_CHECKSUM ? size - start : SIZE_CHECKSUM;
		
        const buffer = Buffer.alloc(nextChunk);
		
        //Read the file bytes into the buffer
        fs.readSync(
            fd,
            buffer,
            0,
            nextChunk,
            start
        );
		
        //update the algorithm with the buffer data.
        algorithm.update(buffer);
    }

    //After the entire file has been piece-fed into the algorithm, calculate the hash
    return new Promise<string>((resolve) => resolve(algorithm.hex()));
}

export default GetChecksum;

Based on a StackOverflow answer by swapnil.

Figuring out the size of SIZE_CHECKSUM is project dependent. The larger the chunk size the faster the operation completes - assuming the system is provisioned enough which is a big assumption. If this was the case then there would be no reason for streaming the file in chunks at all.

So then it becomes a balancing act of picking a size large enough to not impact the speed of the process, while being small enough to not impact the performance of the computer that the app is running on.

Once the checksum is computed its just a simple case of comparing the two hashes and ensuring that they are the same.