Lellansin

TiaossuP正在翻译

Speaking Intelligently about "Java vs Node" Performance

原文链接: rclayton.silvrback.com

TL;DR:

Unless you know the differences in the process and concurrency models of Java and Node, don't write a benchmark comparing the two. It will be silly to those who know the difference and misleading to those who don't.

For those deciding between the two platforms understand that performance isn't the only consideration. A well designed architecture will utilize multiple servers simply out of the need for redundancy. This will make per-instance capacity/performance less important until you've reached a threshold where monthly server maintenance costs outweigh development costs.

With that said, if per-request latency is consistently reasonably (meaning on average, requests are served quickly - say less than 200ms) and each individual server can handle a reasonable amount of simultaneous requests (1,000-2,000 a second), I would argue that developer productivity is far more important than the speed of the underlying VM. Think about it. Is it more important to save a couple hundred dollars a month achieving better server utilization if it takes your engineers significantly longer to deliver a feature? I would argue that it's better to optimize developer cost-per-month over server utilization.

So in the case of Java vs. Node, pick the one you think will deliver the most overall value to your team and keep in mind that performance is only one factor in that calculus.

Speaking Intelligently about Performance

Over the last couple of years, I have noticed a bunch of asinine comparisons between Java and Node. The all generally come in the form of "JAVA IS 30 TIMES FASTER THAN NODE!" and then you read the fine print: ...at serving 1,000 web requests". Generally, I attribute this kind of propaganda to Java holdouts who don't understand why someone might want to leave the JVM platform for something else.

However, I'm finding this question is being asked more often in places like Quora. In these cases, people are truly curious about the platforms because they need to make a decision. For instance, I'm seeing a lot of conversations like the following:

  • I'm going to invest a lot time learning a new language; should it be Java or Node?
  • We're thinking of migrating away from Rails/PHP; what architecture would be best for the future?
  • I really like Node.js, but I'm worried. Will I have to eventually rewrite my Node services to Java for performance reasons?

And why wouldn't they be confused?

A quick Google search shows the myriad of articles pitting Java and Node against each other:

Some of these articles take the time to explain the differences between the process and concurrency models of both languages (which is great). However, many are incredibly biased towards one language or the other. On top of that, the benchmarks you'll find are typically written by someone with a lot of experience in one language but not the other. This begs the question, Was the implementation of the benchmark idiomatic of the languages that were used?

I think a final point to make is that performance isn't everything; and those of us in the know should make a point of emphasizing that. However, I'm going to humor the performance argument by explaining the process and concurrency model differences between both languages to answer any high level questions you might have.

Java Process and Concurrency Model

Java applications tend to be ran as a single process. Within that process, applications can utilize independent "threads"of execution to perform tasks in parallel. A typical Java web server uses hundreds (if not thousands) of threads. Each request to the server is typically handled on it's own thread and modern Java web servers are now generally non-blocking (meaning they yield control of execution back to the processor while they wait for background threads to complete tasks - typically I/O related, like connecting to a database). This is a huge advancement, especially since it (blocking execution) was a major criticism of Java web servers when Node.js first came out (sorry, can't use that excuse anymore).

class PrimeThread extends Thread {
  long minPrime;
  PrimeThread(long minPrime) {
    this.minPrime = minPrime;
}

  public void run() {
    // compute primes larger than minPrime
    //. . .
  }
}

// Somewhere else, in another file 
PrimeThread p = new PrimeThread(143);
p.start();

This is an example of spawning a Thread in Java (taken from the Oracle documentation - https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html).

Threads are a big deal. You will conceptually have as many actively executing threads as you have CPU cores. A single Java process (web server or application) can utilize all of the CPU cores on a computer because it can allocate more than one thread. Another (kind of) advantage of threads is that they can share memory. This means you can access variables created by one thread on another, whereas, separate processes cannot. This makes coordination and communication between threads extremely fast - they can simply access the same variables in their process's memory space.

Shared access to memory between threads can also a be great problem. Since multiple threads can access the same memory at the same time, it is completely possible to create scenarios were your application state becomes inconsistent due to concurrent modification and access. One of these scenarios is deadlocking a process[1], where two competing threads are indefinitely stalled because they have independently locked a resource the other is reliant upon. This means that some state needs to be be handled in a synchronous fashion; which requires the coordination of threads around some kind of state boundary (review the Java documentation on locks, synchronize, etc.).

A common consequence of poor state synchronization is dirty mutation. In Java, it is completely possible for two threads to both modify a long or double (64-bit) at the same time (one changing the most significant 32-bits and the other the least significant 32-bits). This happens because Java cannot modify a 64-bit doubles and integer atomically [2].

Simply put, while you can use Java to write near-realtime (stupid fast) applications, it's really hard to do right. This is why there are books like Java Concurrency in Practice teaching developers how to use the platform correctly.

Node Process and Concurrency Model

Node, on the other hand, has taken a different approach to concurrency, and this is reflected in its process model. In Node, a process consists of one main thread of execution, and a myriad of background threads (typically performing I/O work). Coordination between the background threads and the main thread is performed using a queue. The main thread pulls tasks from the queue (enqueued in the order they were received) and executes them. When you call a function that results in an asynchronous background operation, the callback you supplied is placed in the queue when the operation is finished.

As a Node developer, you don't ever have to think about this queue. You also don't ever have to think about another thread trampling your state because changes to your variables only happen on the main thread. This means concurrency is really simple to reason about in Node.

Simplicity comes at a cost though. In Node, there is only one main thread of execution for your code. You can't just spin up a new thread and have work carried out in the background while the main thread continues executing tasks1. Instead, parallelism in Node is achieved through processes. This is what the Node Cluster library is for.

Node Clustering

Node Clustering essentially means forking new child processes to handle tasks. This forking is basically operating system process forking, with a little plumbing added by Node to allow the parent and child processes to communicate via Inter-Process Communication (IPC). Processes do not share memory2. This means the process needs to be bootstrapped with state or have that state passed to it from the parent. One thing Node.js will do, however, is load balance requests to TCP server ports (e.g. HTTP Server). This is something that is often missed in benchmarks like the 1,000 requests per second comparisons.

While it may seem annoying to have to fork new processes to get parallelism, it's actually quite simple and powerful. More importantly, it enforces safer practices for handling state synchronization (using message passing instead of locks and semaphores).

Here is an example from Node's Cluster documentation:

const cluster = require('cluster');
const http = require('http');
const numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  // Fork workers.
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
}

  cluster.on('exit', (worker, code, signal) => {
    // I changed this line because it affected 
    // syntax highlighting
    console.log('worker', worker.process.pid, 'died');
});
} else {
  // Workers can share any TCP connection
  // In this case it is an HTTP server
  http.createServer((req, res) => {
    res.writeHead(200);
    res.end('hello world\n');
}).listen(8000);
}

If you don't want to have to write this clustering code, there's an even better option: PM2. PM2 (Process Manager v2) is a Node.js framework for managing Node.js (or anything you can run from the shell) processes. PM2 allows you to start processes in Cluster mode via command line option or configuration:

# -i = the number of instances, 
#      usually the number of CPU's available
pm2 start server.js -i 4 

That's it! So my point is, you can achieve greater concurrency in Node.js than what you might have been lead to believe by other people.

Why do you care?

Now that you understand a little bit about the concurrency and process models of Node.js and Java, the next question is, Why do you care?

The fact of the matter is, both are viable platforms for the majority of use cases engineers have. However, in some circumstances though, Java will be more appropriate. For instance, if you are building a real-time system, you should use Java over Node.js. Java will almost always be faster than Node.js, unless it is used stupidly. This is because sharing memory amongst threads is a lot faster than IPC. Also, the Java Virtual Machine has something like 15 years more development than the V8 runtime. Java is also statically typed and precompiled to byte code.

However, the things that make Java fast also make it a beast to work with. You have to deal with static typing, overly verbose syntax, and for the longest time, the lack of first-class functions (lack of first-class-functions made asynchronous programming very painful). Compilation also takes time and can be painful to dissect (if there is a problem like conflicting dependencies). The concurrency model is also hard to get right and troubleshooting synchronization issues can be a real pain.

Also, in Java one does not just write a script and run it. There's just a lot more involved. Node's barrier of entry, particularly for writing and executing a script is much, much lower. This makes Node.js, in my opinion, faster to learn and develop in, especially if your use case is something web-related. It can take years to master Java, especially if you are building highly concurrent applications that make use of the concurrency libraries available in the language. Java also has a huge ecosystem, which makes it awesome if you are comfortable with the language and need to do something cool like Natural Language Processing, machine learning, distributed computing, etc., but it can also be scary as hell to a newbie.

So when you are considering a platform, you should definitely consider a lot more than performance when making a decision:

  1. What kinds of languages are my team comfortable with?
    • C++, C# - Java will probably be easier to pick up.
    • Python, Ruby, PHP, JavaScript - Node will probably be easier.
  2. What kind of applications do I want to build?
    • Website/API - Node.js will be easier to use and faster to develop in supporting a modern Single Page Application. Plus, Isomorphic JavaScript is a pretty awesome feature you can reap when using Node+frontend JavaScript.
    • B2B Web Service with Enterprise Security - Java has much better support for "Enterprisy" frameworks.
    • Simple Worker Queue - it's probably easier to use Node.js, if and only if, the tasks you are performing in the queue have supporting API's written for Node. Basically, probably don't want to use Node.js to call out to the command line to perform a task like video processing when there are native Java bindings to do the same.
    • Big Data Analytic - Java. There's just not a whole lot of libraries and support currently present in the Node.js ecosystem. Also, Java will be better at handling CPU-bound tasks.
    • Scientific Computing - Once again, Java has historically had better libraries for doing scientific computing. This includes better native support for large numbers and precision arithmetic.
    • etc., etc., etc. - I'm not going to list the permutation of things you can build with these languages. My point is, you will probably choose your language based on the thing you want to build, rather than the syntax and concurrency constructs provided by the platform.
  3. In terms of performance:
    • How many requests per second do expect to have?
    • What is the maximum acceptable "average request latency"?
    • What are my environmental constraints?
      • Am I only allowed one server?
      • What is the performance profile of each machine?
      • Is this a mobile device?
      • How much load can my application put on the device?
    • Can I get away with scaling out vice up?
      • The argument in favor of Java for performance reasons tends to go away if the application can be scaled horizontally. This is especially true nowadays, where development time is more expensive than renting server resources.
      • Alternatively, this is untrue if you have a huge number of servers (Facebook, Google, Twitter) and the cost of optimizing the platform is less expensive than the development costs to perform the optimization.
  4. How easy is it to recruit for this platform?
    • Can I cost effectively find and hire developers with this skillset? This is an issue with Java, where the average salary for developer is much higher and the concentration of developers with that skillset tends to be regional.
    • Is it exciting for new recruits? I'd say, in general, people would be more excited to learn and develop in Node.

Conclusion

Use common sense when making a decision about choosing a platform. Consider the team that's going to be using it. Walk through the benefits and disadvantages of each platform and then make the decision. I'm sure you will find that it's not as cut and dry as "Java is faster than Node."

Notes:

  1. There are third party modules that providing a threading implementation: node-webworker-threads. These libraries, however, are not a standard part of the platform, and are arguably not idiomatic Node.js.
  2. That's not exact true. In Linux, threads are implemented as processes that can share memory with their parent.

Sources:

  1. https://en.wikipedia.org/wiki/Deadlock
  2. https://dzone.com/articles/longdouble-are-not-atomic-in-java