透過 Nodejs cluster module 處理併發的 CPU intensive job

簡介

本文將會簡介如何透過 nodejs cluster module 來解決併發的 CPU intensive job 所造成系統阻塞問題

問題描述

一個 Nodejs 所啟動的服務會使用單一執行緒來執行所有任務

透過 libuv 的 Event loop 實踐 Non Blocking I/O

當有多個併發 CPU Intensive Job 將會讓 Nodejs Server 卡住其他 Requests

Nodejs cluster 架構

這時就可以透過 cluster 概念來啟用其他 CPU 的 Core 來做運算

如下圖

紅色的部份這些並行的 Nodejs 執行個體個別使用不同的 CPU core

因此不會卡住彼此的 Request

而這些紅色的 NodeJs 執行個體可以組成一個大的運算單位來當作服務，這概念可以稱作叢集(Cluster)

通常會使用一個主要的 Nodejs 執行個體當作 api gateway 負責分派任務給 cluster 內所有可用得執行個體來處理 Request

範例

以下將會以 Nodejs 範例，示範如何使用 cluster mode 來做均衡 workload

主要分為

使用 cluster module 處理
使用 pm2 來達成這件事情
分別使用 loadtest 來做 benchmark 測試驗證

1. 使用工具

loadtest: 做 benchmark 的 nodejs module
pm2: nodejs 用來做 cluster 的工具

2. 範例 CPU Intensive 的 Task

app.service.ts

import { Injectable } from '@nestjs/common';

@Injectable()
export class AppService {
  doHeavyJob(): string {
    let total = 0;
    for (let i = 0; i < 50_000_000; i++) {
      total++;
    }
    return `The result of the CPU intensive task is ${total}`;
  }
}

app.controller.ts

import { Controller, Get } from '@nestjs/common';
import { AppService } from './app.service';

@Controller()
export class AppController {
  constructor(private readonly appService: AppService) {}

  @Get('heavy')
  doHeavyJob(): string {
    return this.appService.doHeavyJob();
  }
}

3. 使用 loadtest 來做 benchmark

npx loadtest -n 1200 -c 400 -k http://localhost:3000/heavy

總共 latency 是 20 秒左右

總共耗時 71 秒

4. 使用 cluster mode 來處理

primary.ts

import { Logger } from '@nestjs/common';
import * as _cluster from 'cluster';
const cluster = _cluster as unknown as _cluster.Cluster;
import * as os from 'os';
import * as path from 'path';
const cpuCount = os.cpus().length;
const logger = new Logger('Primary');
logger.log({
  message: `The total number of CPUs is ${cpuCount}, primary pid=${process.pid}`,
});
cluster.setupPrimary({
  exec: path.join(__dirname, 'main.js'),
});
for (let i = 0; i < cpuCount; i++) {
  cluster.fork();
}
cluster.on('exit', (worker, code, signal) => {
  logger.log({
    messsage: `worker ${worker.process.pid} has been killed with ${signal}, Starting another worker`,
  });
  cluster.fork();
});

5. 使用 cluster mode 來做 benchmark

使用以下指令執行 server

nest start --entryFile primary

使用 loadtest 做 benchmark

npx loadtest -n 1200 -c 400 -k http://localhost:3000/heavy

6. 透過 pm2 來執行 cluster

使用以下指令執行 server

pm2 start dist/main.js -i max --name nest

使用 loadtest 做 benchmark

npx loadtest -n 1200 -c 400 -k http://localhost:3000/heavy

總共 latency 是 3 秒左右

總共耗時 11 秒

worker thread vs cluster

之前有提到要有效去利用 CPU 資源除了使用 cluster 之外

也可以使用 worker thread

其差別在於 worker thread 的內部資料比 cluster mode 更方便去分享資料

所以當每個 thread 資料需要做分享時，適合使用 worker thread

而 cluster mode 適合需要獨立執行的 workload

參考

[1] cluster

[2] pm2