Streams For the Win: A Performance Comparison of NodeJS Methods for Reading Large Datasets (Pt 2)
How readFile(), createReadStream() and event-stream Stack Up Against One Another
If you’ve been keeping up with my writing, a few weeks ago, I published a blog talking about a variety of ways to use Node.js to read really large datasets.
To my surprise, it did exceptionally well with readers — this seemed (to me) like a topic many others have already covered in posts, blogs and forums, but for whatever reason, it got the attention of a lot of people. So, thank you to all of you who took the time to read it! I really appreciate it.
One particularly astute reader (Martin Kock), went so far as to ask how long it took to parse the files. It seemed as if he’d read my mind, because part two of my series on using Node.js to read really, really large files and datasets involves just that.
Here, I’ll evaluate the three different methods in Node.js I used to read the files, to determine which is most performant.
The Challenge From Part 1
I won’t go into the specifics of the challenge and solution, because you can read my first post for all the details here, but I will give you the high level overview.
A person from a Slack channel I’m a member of, posted a coding challenge he’d received, which involved reading in a very large dataset (over 2.5GB in total), parsing through the data and pulling out various pieces of information.
It challenged programmers to print:
- Total lines in the file,
- Names in the 432nd and 43243rd indexes,
- Counts for total numbers donations per month,
- And the most common first name in the files and how a count of how often it occurred.
Link to the data: https://www.fec.gov/files/bulk-downloads/2018/indiv18.zip
The Three Different Solutions Possible For Smaller Datasets
As I worked towards my ultimate end goal of processing a large dataset, I came up with three solutions in Node.js.
The first involved Node.js’s native method of
heap out of memory error.
My second solution also involved another couple of methods native to Node.js:
rl.readLine(). In this iteration, the file was streamed through Node.js in an
input stream, and I was able to perform individual operations on each line, then cobble all those results together in the
output stream. Again, this worked pretty well on smaller files, but once I got to the biggest file, the same error happened. Although Node.js was streaming the inputs and outputs, it still attempted to hold the whole file in memory while performing the operations (and couldn’t handle the whole file).
In the end, I came up with only one solution in Node.js that was able to handle the full 2.55GB file I wanted to parse through, at one time.
My solution involved a popular NPM package called event-stream, which actually let me perform operations on the throughput stream of data, instead of just the input and output streams, as Node.js’s native capabilities allow.
You can see all three of my solutions here in Github.
And I solved the problem, which was my initial goal, but it got me thinking: was my solution really the most performant of the three options?
Comparing Them To Find The Optimal Solution
Now, I had a new goal: determine which of my solutions was best.
Since I couldn’t use the full 2.55GB file with the Node.js native solutions, I chose to use one of the smaller files that was about 400MB worth of data, that I’d used for testing while I was developing my solutions.
For performance testing Node.js, I came across two ways to keep track of the file and individual function processing times, and I decided to incorporate both to see how great the differences were between the two methods (and make sure I wasn’t completely off the rails with my timing).
Node.js has some handy, built-in methods available to it for timing and performance testing, called
console.timeEnd(), respectively. To use these methods, I only had to pass in the same label parameter for both
timeEnd(), like so, and Node’s smart enough to output the time between them after the function’s done.
// timer start console.time('label1'); // run function doing something in the code doSomething(); // timer end, where the difference between the timer start and timer end is printed out console.timeEnd('label1'); // output in console looks like: label1 0.002ms
That’s one method I used to figure out how long it took to process the dataset.
The other, tried and well-liked performance testing module I came across for Node.js is hosted on NPM as
7+ million downloads per week from NPM, can’t be too wrong, right??
performance-now module into my files was also almost as easy as the native Node.js methods, too. Import the module, set a variable for the start and end of the instantiation of the method, and compute the time difference between the two.
// import the performance-now module at the top of the file const now = require('performance-now'); // set the start of the timer as a variable const start = now(); // run function doing something in the code doSomething(); // set the end of the timer as a variable const end = now(); // Compute the duration between the start and end console.log('Performance for timing for label:' + (end — start).toFixed(3) + 'ms'; // console output looks like: Performance for timing label: 0.002ms
I figured that by using both Node’s
performance-now at the same time, I could split the difference and get a pretty accurate read on how long my file parsing functions were really taking.
Below are code snippets implementing
performance-now in each of my scripts. These are only snippets of one function each — for the full code, you can see my repo here.
Fs.readFile() Code Implementation Sample
Since this script is using the
fs.readFile() implementation, where the entire file is read into memory before any functions are executed on it, this is the most synchronous-looking code. It’s not actually synchronous, that’s an entirely separate Node method called
fs.readFileSync(), it just resembles it .
But it’s easy to see the total line count of the file and the two timing methods bookending it to determine how long it takes to execute the line count.
Fs.createReadStream() Code Implementation Sample
Input Stream (line-by-line):
Output Stream (once full file’s been read during input):
As the second solution using
fs.createReadStream() involved creating an input and output stream for the file, I broke the code snippets into two separate screenshots, as the first is from the input stream (which is running through the code line by line) and the second’s the output stream (compiling all the resulting data).
Event Stream Code Implementation Sample
Through Stream (also line-by-line):
On Stream End:
event-stream solution looks pretty similar to the
fs.createReadStream(), except instead of an input stream, the data is processed in a throughput stream. And then once the whole file’s been read and all the functions have been done on the file, the stream’s ended and the required information is printed out.
Now on to the moment we’ve all been waiting for: the results!
I ran all three of my solutions against the same 400MB dataset, which contained almost 2 million records to parse through.
Streams for the wins!
As you can see from the table,
event-stream both fared well, but overall,
event-stream has to be the grand winner in my mind, if only for the fact that it can process much larger file sizes than either
The percentage improvements are included at the end of the table above as well, for reference.
fs.readFile() just got blown out of the water by the competition. By streaming the data, processing times for the file improved by at least 78% — sometimes close to almost a 100%, which is pretty darn, impressive.
Below are the raw screenshots from my terminal for each of my solutions.
The solution using only: fs.readFlie()
The solution using fs.createReadStream() and rl.readLine()
The solution using event-stream
Here’s a screenshot of my
event-stream solution churning through the 2.55GB monster file, as well. And here’s the time difference between the 400MB file and the 2.55GB file too.
Look at those blazing fast speeds,even as the file size climbs by almost 6X
event-stream (on the 2.55GB file)
In the end, streams both native to Node.js and not, are way, WAY more efficient at processing large data sets.
Thanks for coming back for part 2 of my series using Node.js to read really, really large files. If you’d like to read the first blog again, you can get it here.
Thanks for reading, I hope this gives you an idea of how to handle large amounts of data with Node.js efficiently and performance test your solutions. Claps and shares are very much appreciated!