Fig. 2: Scalability analysis of the PHD platform. | Nature Communications

Fig. 2: Scalability analysis of the PHD platform.

From: A scalable, secure, and interoperable platform for deep data-driven health management

Fig. 2: Scalability analysis of the PHD platform.

In panel a, the workload size was 100MB (historical data) and the CPU utilization for the regional sub-clusters (i.e., east and west) significantly increased for batches with more than 512 concurrent jobs. In panel b, the workload size was 1 MB (real-time data), and the CPU utilization for the regional sub-clusters increased slightly even for 4096 concurrent jobs, with a maximum response time below 10 seconds. In panel c, the workload was a mixture of both historical and real-time jobs. In this figure, 20% of the jobs are 100 MB, and the rest are 1 MB. If we consider a 30-second response time as an acceptable threshold, then for 2048 concurrent jobs, the response time for the 95th percentile of jobs is still below 30 seconds while CPU utilization remains below 40%. Therefore, a good threshold for scaling up the cluster is when the CPU utilization is ≤40%. d In this experiment, five batches of 1000 jobs were submitted back to back, each job needing one vCPU. When we submitted the first batch of 1000 jobs, it took almost 35 minutes for Kubernetes to scale up from 3 nodes to 250 nodes of N1 machine type with 4 virtual CPUs (Google Cloud n1-standard-4) or 1000 vCPU cores, and the maximum response time was 35 minutes for batch 1 (B1). For batch 2 to batch 5 (B2, ..., B5), the cluster had enough computing resources and processed the entire batch of jobs in almost 2 minutes. Finally, after new jobs were no longer sent to the cluster, CPU utilization dropped, and it took 90 minutes for Kubernetes to scale down to 3 nodes. e, f These figures cover weak scaling experiments around the ML cluster using distributed messaging Pub/Sub and Kubernetes cluster. These figures indicate ML cluster scales well under weak scaling and keeps the execution time within the same range. a, b, c, e, f were run five times. Data points, average values and standard deviations were reported.

Back to article page