Abstract
This work develops a novel method for efficient dynamic embedding table sharding during recommendation inference. Although existing works have developed various sharding methods to reduce network overhead, they often focus on training-centric approaches and overlook inference-specific challenges such as evolving co-occurrence patterns and latency sensitivity. As such, these methods can be suboptimal for real-time recommendation inference scenarios. In this paper, we propose a new method named ALIS, short for Aggregate based Light Incremental Sharding, to address these challenges. ALIS features two key mechanisms: aggregate based sharding, which enhances sharding stability and quality by integrating historical and current sharding results, and light incremental sharding, which reduces dynamic sharding costs by combining iterative and statistics-based approaches. The evaluation of ALIS demonstrates its superior performance over the state-of-the-art method in reducing network overhead and improving inference efficiency. Both quantitative results and qualitative analysis verify the superiority and rationality of our ALIS method.
Similar content being viewed by others
Introduction
Recommendation models are essential for predicting user behaviour on online platforms like click-through and conversion rates1,2. Deep learning (DL) models are now the leading framework due to their high performance. Notably, companies like Facebook and Baidu have adopted these extensively. Facebook relies on DNN-based models for up to 80% of their recommendation processes3, while Baidu uses DL models to handle millions of search queries per second4.
However, the deployment of these highly efficient DL-based recommendation models in production settings presents significant challenges, mainly due to the large and sparse embedding tables required. These tables transform sparse inputs into dense vectors for DNN processing. Their size is proportional to the number of users and items, leading to large scales in modern models5. For example, Baidu’s embedding tables for sponsored search and Ads retrieval exceed 10TB, far beyond a single server’s memory capacity6.
To address the memory constraints imposed by such massive embedding tables, horizontal scaling has become a common strategy for the deployment of large-scale inference models7. This approach involves distributing the model across a cluster of multiple nodes, employing hybrid parallelism8. Specifically, model parallelism is applied to the sparse embedding tables, where each node stores only a fraction of the complete embedding table. In contrast, the dense neural network layers are typically handled using data parallelism, where each node runs an identical copy of the DNN component of the model9.
Although horizontal scaling effectively mitigates memory limitations on individual servers, it introduces network communication overhead during the embedding lookup process. The embedding lookup operation necessitates gathering embedding vectors from table shards distributed across multiple nodes, generating network requests for remote data access. Crucially, these network requests are substantially slower than other operations, such as primary memory access or CPU computations. Consequently, they can become a critical bottleneck in the inference pipeline, negatively impacting both latency and throughput10.
Limitations of static sharding in inference scenarios. (a) Performance degradation: distribution of access costs for static sharding shifts higher and becomes more dispersed as data access patterns evolve; (b) quality fluctuation: dynamic updates to static sharding algorithms resulting in significant variations in access latency across different shards.
Current research on mitigating this network overhead often involves sharding strategies that group frequently co-occurring embeddings on the same node to decrease lookup overhead11. However, these strategies are predominantly designed for the relatively static nature of model training, overlooking the distinct challenges of inference. Inference workloads are characterized by a continuous influx and evolution of datasets, leading to temporal shifts in co-occurrence relationships. Consequently, applying static sharding—here partitions are fixed based on an initial data snapshot—to dynamic inference scenarios results in deteriorating performance and quality over time, as illustrated in Fig.1. As depicted in Figure 1a, a notable increase and broadening in the distribution of access costs is observed, signaling a decline in sharding quality and leading to higher, less predictable access latencies. Figure 1b further reveals significant fluctuations in sharding quality, disrupting operational efficiency. This underscores the inadequacy of static approaches for latency-sensitive inference tasks.
The evident need for adapting to these dynamic changes points towards dynamic sharding strategies. However, existing dynamic approaches also present their own set of limitations, as highlighted in our comparative analysis in Table 1. For instance, Periodic Re-sharding, which involves re-running a static sharding algorithm at intervals, suffers from high computational overhead for updates and can lead to significant performance fluctuations or service instability during the re-sharding process. While it offers moderate adaptability, the associated costs and intermittent quality drops are undesirable. Conversely, Simple Heuristic Incremental Sharding methods offer lower overhead per adjustment and higher reactivity to evolving patterns. However, they often rely on greedy, local decisions based primarily on recent data, which can lead to suboptimal global sharding configurations, unstable sharding quality over the long term, and limited leverage of comprehensive historical access patterns. These approaches may minimize immediate computational cost per step but risk performance dips due to this potential instability or suboptimal states. Therefore, an ideal dynamic sharding solution should not only achieve high sharding quality, meaning an effective partition of embedding tables across devices that minimizes inference latency and maximizes throughput by optimizing embedding access and reducing inter-device communication, but also maintain high sharding stability, which refers to the consistency of these sharding outcomes and the resulting inference performance over time, especially in the face of evolving data patterns. Current methods often struggle to balance these two crucial aspects with computational efficiency. Thus, a gap exists for a dynamic sharding technique that can achieve high adaptability and quality stability without incurring prohibitive computational costs or operational disruptions.
This paper describes ALIS, an embedding sharding method that supports efficient and effective sharding during recommendation inference. Specifically, ALIS introduces two new mechanisms:
Aggregate based sharding
To mitigate performance fluctuations caused by the instability of sharding outcomes, we introduce an aggregate based sharding strategy. This approach evaluates current and previous sharding results and integrates high-performing sharding outcomes to minimize randomness-induced performance dips. Additionally, acknowledging that iterative sharding performance is influenced by its initial state, we optimize this state for improved results.
Light incremental sharding
To eliminate redundant computations arising from the inference process, we combine an iterative-based sharding method with a statistics-based sharding strategy. The former is utilized to process short-term data to gain co-occurrence relationships, while the latter preserves sharding information from historical data. This approach significantly reduces the overhead of the sharding process by avoiding the repetitive processing of historical data using the iterative method.
In summary, the contributions are as follows:
-
We introduce an aggregate based sharding strategy, thereby improving the quality of the shard during the dynamic sharding process.
-
We propose a light incremental sharding method that reduces the cost of dynamic sharding while ensuring the quality of the shard.
-
We compare ALIS with the state-of-the-art method using real-world datasets to substantiate its performance.
Problem formulation
We first give the notations used in this paper and then formalize the problem of dynamic embedding table sharding for recommender systems.
Notations
Let E be the embedding table, where \(E = \{e_1, e_2, \dots , e_m\}\) represents the set of embedding entries, with each \(e_i\) corresponding to a unique item. Let \(P_1, P_2, \dots , P_n\) represent the n shards sets produced by each sharding process, let \(C_1, C_2, \dots , C_n\) be the respective access overheads associated with each shards set, and let \(h_1, h_2, \dots , h_s\) be the total s shards within a shards set \(P_i\), (\(i \in \{1, 2, \dots , n\}\)). The data allocated for the purpose of sharding is denoted as D. let T be the time taken to perform the sharding process, including data redistribution and shard creation. Minimizing \(T_{shard}\) is desired to ensure timely updates and adaptability.
Problem definition
The goal of dynamic embedding table sharding is to split the embedding table E such that the sharding minimizes access costs during inference while maintaining high accuracy in sharding, especially under dynamically changing datasets. Specifically, given previous shards sets and associated access costs, the task is to find the optimal sharding strategy incorporating current and historical performance feedback. Formally, the problem can be defined as:
Input
A set of previous shards sets \(P_1, P_2, \dots , P_n\), their corresponding access costs \(C_1, C_2, \dots , C_n\), and a current received inference data \(D_{new}\) containing new embedding entries. We also consider the sharding time incurred in the previous sharding process \(T_{old}\).
Output
New shards set \(P_{n+1}\) and a map function \(f: E \rightarrow h_1 \cup h_2 \cup \dots \cup h_s\), which assigns each embedding entry in E to one of the s shards in a way that minimizes access costs and improves sharding quality and efficiency based on historical performance feedback.
Methodology
The proposed ALIS method presents two novel mechanisms: aggregate based sharding, which enhances the stability and quality of dynamic sharding, and light incremental sharding, which reduces computational effort while maintaining sharding performance.
Aggregate based sharding
Existing iteration-based sharding methodologies exhibit issues with quality. For instance, the direct sharding approach is a basic and uncomplicated strategy for implementing dynamic sharding. This technique involves executing an iteration-based sharding method iteratively, treating each instance separately. As illustrated in Fig. 2a, the direct sharding approach generates three distinct shards sets during the dynamic sharding process. shards set 1 is constituted from the training dataset, whereas shards set 2 and 3 are formed from newly incoming data during inference. Notably, these three direct sharding functions independently of one another. Consequently, each sharding operation is recommended by randomizing the initialization of the embedding shards, followed by iterative optimization utilizing the relevant data.
A significant challenge in employing this sharding algorithm stems from its intrinsic randomness, which introduces considerable variability throughout the dynamic sharding procedure. Specifically, the iteration-based sharding methodology comprises two distinct phases: randomized initialization and iterative optimization. Initially, the algorithm requires randomly initializing the sharding positions for embedding table entries. As subsequent iterative optimizations are contingent upon the initial conditions arising from this randomized initialization, the ensuing sharding outcomes are inherently stochastic. This propensity for randomness results in performance fluctuations during the dynamic sharding process. As illustrated in Fig. 2a, given the complete independence of these three sharding processes, the efficacy of the shards set is intrinsically linked to random initialization. For example, shards set 2 demonstrates a reduced overhead compared to shards set 1 and 3, which is attributed to a fortuitous initialization result. This can precipitate a degradation in overall sharding performance and potentially increase the access overhead of embedding tables, as exemplified by shards set 3.
Dynamic sharding exhibits distinctive attributes that present opportunities to enhance the quality of shards. In contrast to static sharding, which yields a solitary sharding configuration, dynamic sharding produces a multiplicity of sharding configurations throughout the inference process. These configurations are immediately deployed within the recommender system, delivering real-time performance feedback. Consequently, this presents an opportunity to leverage existing sharding configurations and their associated performance feedback to enhance the quality of subsequent shards during the dynamic sharding procedure. Based on this observation, a central challenge emerges: effectively utilizing preceding sharding configurations and performance feedback when generating novel shards.
To address this challenge, we introduce an aggregate based sharding methodology that transcends the paradigm of purely independent sharding based solely on contemporaneous data. Instead, our approach judiciously integrates prior sharding outcomes and performance feedback. This assimilation of historical results is intentionally designed to bolster the efficacy of the present sharding endeavor. As delineated in Algorithm 1, our proposed algorithm augments direct sharding with an aggregate processing stage, effectively leveraging past sharding outcomes and performance feedback to refine the current sharding configuration.
In operational terms, the aggregate based sharding method initially implements preceding n shards sets, denoted as \(P_1\), \(P_2\), . . . , \(P_n\), within the recommender systems, and meticulously records the access overhead associated with each shards set, represented as \(C_1\), \(C_2\), . . . , \(C_n\). When generating a new shards set, symboled as \(\hat{P}\), the aggregate based method deviates from producing a definitive sharding immediately. Instead, it employs a standard direct sharding technique to generate an interim sharding result, \(P_{n+1}\). Subsequently, the algorithm integrates this intermediate result with historical shards sets, informed by the recorded performance metrics, to produce the final shards set. We have elected to adopt a soft voting mechanism to implement the shard integration process for pragmatic simplicity and computational efficiency. Specifically, for each embedding entry, the algorithm examines the position, denoted j, of the entry within each historical shard \(P_i\) and increases the corresponding weight in \(W_{ej}\). Each weight component is derived from two factors: the inverse of the access overhead \(C_i\) and a temporally sensitive decay coefficient \(\alpha (I)\).
Consider the illustrative example in Fig. 2b, in which the third sharding process is examined. The aggregate based method commences by producing an intermediate sharding result. Subsequently, it factors into the access performance metrics of shards set 1 and shards set 2, integrating these insights with the current intermediate result to derive shards set 3. Given that shards set 3 preferentially incorporates shards set 2 (which exhibits demonstrably superior access performance) by assigning it a higher weight, the resultant access performance of shards set 3 surpasses that of the intermediate result alone.
Light incremental sharding
The limitation of greedy-based sharding in dynamic sharding scenario. In the weight matrix, higher weights are underlined, indicating that the embedding will be assigned to the shard with the higher weight. The numbers in red indicate updates made based on the adjacency of the embeddings in the request that was currently received.
When applied in a dynamic sharding scenario, the existing complete sharding method processes the same datasets multiple times. An intuitive solution is to use a greedy-based sharding algorithm instead of applying an iterative-based sharding algorithm to the entire history data to eliminate redundant processing of historical data during dynamic sharding. During dynamic sharding, the greedy-based sharding algorithm records the positions of embedding entries. For each embedding entry, the algorithm considers the positions of its co-occurring embedding entries, i.e., its neighbors. The algorithm tends to assign the embedding entry to the shard, which contains most of its neighbors. The algorithm maintains a weight matrix, representing the tendency of each embedding item to be allocated to various shards. The left of Fig. 3a represents an example of a weight matrix. In this matrix, embedding entries are denoted by a, b, c, d, and e. In addition, a, b, and c have higher weights for shard 1. Thus, they will be allocated to shard 1. In contrast, d and e will be allocated to shard 2. When a new request arrives containing a, b, and d, the algorithm updates the weight matrix, which is illustrated on the right in Fig. 3a. For example, for the embedding entry d, since its neighbors a and b are currently allocated to shard 1, d will be more inclined to be allocated to shard 1, thus increasing its weight. After completing the weight updates, the algorithm greedily allocates the embedding item to the shard with the highest weight.
Compared to the complete sharding method, the greedy-based method only needs to update the weights based on newly arrived data each time a sharding is generated, thus providing a new sharding without the need to process historical data repetitively. Therefore, it exhibits superior efficiency in dynamic sharding. However, the greedy-based sharding method fails to capture the co-occurrence of embedding entries, leading to a significant decline in sharding quality. Figure 3b illustrates the inability of the greedy-based sharding algorithm to capture co-occurrence information. In the former request, a and b often appeared together with c. The sharding algorithm would allocate a, b, and c to sharding 1, while e was allocated to shard 2. However, in the new access pattern(as denoted by red), a and b change to appear together with e. However, the greedy-based sharding algorithm fails to capture the co-occurrence relationship of a and b, and can only update their weights based on their current location (shard 1). This leads to a continuous increase in the weights of a and b in shard 1, preventing their reassignment to shard 2.
The co-occurrence relationships between embedding entries have a significant impact on sharding performance. However, the greedy-based sharding method ignores these co-occurrence relationships, leading to a decline in access performance. A sharding algorithm should be able to capture the co-occurrence relationships in recent data to avoid degradation in sharding quality. At the same time, the method should minimize the repetitive processing of historical data, thereby reducing the sharding cost during dynamic sharding. To achieve these goals, we propose a light sharding method in this section.
Local iterative sharding
The light sharding algorithm employs an iterative-based sharding approach to handle short-term data and uses a statistics-based sharding method to record sharding information in historical data. The implementation is shown in Algorithm 2.
The light sharding process commences with a local iterative sharding phase, specifically designed to handle newly arrived data, denoted as \(D_{new}\). In this stage, we leverage the efficacy of iterative sharding algorithms to analyze the most recent data segment. For each new batch of data \(D_{new}\), we prepare a localized dataset \(D_{local}\) if necessary. This preparation might involve mapping global embedding identifiers to a local namespace to streamline the iterative process. This mapping can be represented as:
where LID(e) is a function that maps a global embedding entry e to its local identifier relevant within \(D_{new}\). This preliminary step is crucial for enhancing computational efficiency. By performing this mapping, we significantly reduce the search space for the iterative algorithm, preventing unnecessary traversals of the entire embedding table, which may contain entries not present in \(D_{new}\).
Furthermore, to foster consistency between the newly generated local shards set and the established historical sharding configurations and to avoid a significant divergence that could lead to a loss of global information relevance, our light method incorporates historical shards sets for initialization. We denote the historical sharding information by \(P_{global\_old}\). This historical context is used to initialize the iterative sharding process, ensuring continuity and preventing the local sharding from being entirely isolated from the global data distribution. This initialization step can be conceptually represented as preparing an initial state \(P_{hist\_init}\) based on the historical shards sets:
The InitializationFunction provides a concrete method for initializing the local iterative sharding process by leveraging the existing global sharding context, as detailed in Algorithm 3. For each unique global embedding ID present in the newly arrived data batch \(D_{new}\), this function queries the most recent global historical sharding configuration, \(P_{global\_old}\), to retrieve the shard to which that embedding was previously assigned. This retrieved shard index then serves as the initial assignment for the corresponding local identifier within the \(P_{hist\_init}\) structure, which is subsequently used by the IterativePartition algorithm. If an embedding in \(D_{new}\) is not found in \(P_{global\_old}\), a default strategy, such as random assignment to one of the S available shards, is employed for its initial local placement. This approach ensures that the local sharding commences from an informed state that reflects established global patterns, promoting consistency and potentially accelerating convergence towards an effective local sharding solution for \(D_{new}\), rather than starting from a purely random or arbitrary configuration.
A critical consideration is the impact of the random assignment strategy (Line 9 in Algorithm 3) on the convergence of the sharding process, particularly for new embeddings not found in \(P_{global\_old}\). This random initialization does not impede, but rather facilitates, stable convergence for several reasons. Firstly, the random assignment serves merely as an ephemeral starting point for the ‘IterativePartition‘ step. The dominant factor driving the final placement of these new embeddings is the rich co-occurrence information present within the local dataset \(D_{local}\). During the iterative optimization (e.g., k-means or graph partitioning), a new embedding will be rapidly “pulled” towards the shard containing the majority of its co-occurring neighbors from the current data batch. Therefore, its initial random placement is quickly corrected based on immediate, relevant data patterns. Secondly, this random assignment typically affects only a minority of embeddings–those that are truly new or have not been seen recently. The majority of embeddings in \(D_{new}\) are initialized based on the stable historical context of \(P_{global\_old}\), which provides a strong, stable structural backbone for the local sharding process. This stable foundation helps guide the placement of the few randomly initialized embeddings into a coherent global structure. In essence, this approach effectively handles the “cold-start” problem for new embeddings with a neutral, unbiased initialization, while relying on the power of the subsequent iterative process to ensure they converge to an optimal placement based on their observed behavior.
With localized data \(D_{local}\) and historical initialization \(P_{hist\_init}\), we apply an iterative sharding algorithm, such as k-means, or a graph-based sharding method tailored to embedding data. The outcome of this phase is a locally optimized sharding, \(P_{local}\), represented as:
where \(IterativePartition(\cdot , \cdot )\) represents the chosen iterative sharding algorithm applied to the local data, initialized with historical context.
Global shards set generation
Following the local iterative sharding phase, the global shards set generation stage is initiated to integrate the insights gained from the local sharding with the accumulated global historical information represented through a weight matrix. The first step involves reversing the local ID mapping if applied to correlate the locally sharded embedding entries back to their global identities. This ensures that the sharding information is correctly associated with the global embedding space.
For each embedding entry e present in \(D_{new}\), we examine its sharding assignment in \(P_{local}\). Let \(P_{local\_index}(LID(e))\) denote the index of the shard in \(P_{local}\) to which the locally identified embedding entry LID(e) is assigned. We then update the global weight matrix, W, based on these local sharding assignments. For an embedding entry e and its assigned sharding index \(j = P_{local\_index}(LID(e))\), the weight matrix update rule can be defined as:
where \(\Delta W\) is a positive weight increment, which could be a constant or a function of factors like the frequency of e in \(D_{new}\) or the confidence in the local shards set. For all other shards \(k \ne j\), the weights for embedding e remain unchanged: \(W_{e, k}^{new} = W_{e, k}^{old}\). This iterative update process effectively incorporates the co-occurrence patterns observed in the recent data into the global sharding criteria.
Finally, after updating the weight matrix based on all embedding entries in \(D_{new}\), we generate the new global shards set, \(P_{global\_new}\). This is achieved by allocating each embedding entry \(e_i\) to the shard corresponding to the maximum weight in the updated weight matrix \(W^{new}\). This greedy allocation step can be formulated as:
where s is the total number of shards. This process produces the updated global shards set, \(P_{global\_new}\).
A key advantage of this light method is that it avoids redundant processing of the entire historical dataset. It only requires iterative processing of the newly arrived small-scale data, \(D_{new}\), making it significantly more cost-effective than complete iterative sharding methods while effectively capturing the co-occurrence information embedded within the most recent access patterns. For greater clarity, we present the whole process more visually in Fig. 4. The descriptions of the relevant annotations in the figure are consistent with those in Fig. 3.
Discussion
Pre-training
Pre-training can be beneficial for initializing the embedding shards before dynamic updates. A static pre-sharding using iterative methods on historical data can provide a good starting point. For both aggregate based and light incremental sharding, pre-training can improve initial shard quality and potentially accelerate convergence during dynamic shards. However, ALIS is designed to function effectively even without pre-training, adapting to dynamic data from random initial shards, albeit possibly requiring a more extended warm-up phase. Pre-training thus serves as an optional acceleration technique rather than a prerequisite for ALIS’s operation.
Time complexity analysis
The overall time consumed by the model can be divided into the training phase and the inference/recommendation phase. The training phase involves the time to train the underlying recommender model (\(T_{model\_train}\)) and potentially an initial, one-time (or infrequent) global sharding of the entire embedding table |E|. The primary focus of ALIS is on dynamic sharding during the inference phase. These dynamic updates are typically performed periodically, not for every recommendation. For Aggregate based sharding, the update time is chiefly determined by the need to process all |E| embeddings against n historical configurations and S current shard options, plus the \(T_{direct}(|E_{new}|)\) cost for new data. Light Incremental sharding aims for faster updates by confining its iterative sharding component \(T_{local\_iter}(|E_{new}|)\) to only the new \(|E_{new}|\) embeddings. However, its global weight update and final shard assignment still involve operations proportional to \(|E| \cdot S\). The actual wall-clock time for these sharding updates depends on specific parameter values (e.g., |E|, \(|E_{new}|\), S, n), hardware, and the chosen frequency of updates, with the cost being amortized across multiple recommendation requests.
Computational complexity analysis
The computational complexity of training the base recommender model is \(T_{model\_train}\). If an initial full sharding of the |E| embeddings into S shards is performed using an iterative method, its complexity is \(O(I_{init} \cdot |E| \cdot S)\), where \(I_{init}\) is the number of iterations for this global initialization. During the inference phase, for dynamic updates using aggregate based sharding, the computational complexity per update is \(O(T_{direct}(|E_{new}|) + |E| \cdot (n + S))\). For light incremental sharding, the complexity per update is \(O(T_{local\_iter}(|E_{new}|) + |E_{new}| + |E| \cdot S)\). In these expressions, \(T_{direct}(|E_{new}|)\) and \(T_{local\_iter}(|E_{new}|)\) represent the costs of sharding the relatively small set of new embeddings \(|E_{new}|\). The terms involving |E| (the total number of embeddings) reflect operations that span the entire embedding table, such as aggregating votes or assigning embeddings based on global weights. Light incremental sharding’s efficiency gain stems from \(T_{local\_iter}(|E_{new}|)\) being significantly less than a full iterative sharding on |E|.
Experiments
This paper undertakes a series of experiments to evaluate the performance of the proposed ALIS method and verify the performance of its fundamental components: light incremental sharding and aggregate based sharding. These experiments are structured to investigate the inference latency, throughput, and sharding efficiency of ALIS. Through empirical analysis, we seek to address the following research questions:
-
RQ1: Does the proposed ALIS achieve superior recommendation inference performance compared to state-of-the-art embedding table sharding strategies, specifically regarding reduced latency and increased throughput?
-
RQ2: Does the ALIS preserve the accuracy of recommendation models, ensuring that performance gains are not achieved at the expense of recommendation quality?
-
RQ3: Do both of the sharding mechanisms of the ALIS collectively contribute to reducing the overhead of dynamic sharding and enhancing the system’s stability and quality of the sharding process?
-
RQ4: How do the key hyper-parameters affect the performance of ALIS?
Experimental setup
Implementation details
Our experiments were conducted on a single node of a GPU computing cluster. The dedicated server node was equipped with two Intel Xeon Platinum 8360Y processors (36-cores each), 512 GB of DDR4-3200MHz RAM, and a 2 TB NVMe SSD. The node featured eight NVIDIA A100 GPUs (40GB HBM2e each), interconnected via NVLink 3.0, providing a total bidirectional bandwidth of 600 GB/s for high-speed intra-node communication. Our implementation was developed using PyTorch 2.3.1 on Ubuntu 20.04, with CUDA 11.8 and cuDNN 8.2.1. The TensorFlow Default baseline was executed using TensorFlow 1.15.
For the recommendation models, we used an embedding dimension of 16. The deep neural network component for DeepFM, DCN, and DLRM consisted of a 3-layer MLP with dimensions [256, 128, 64] and ReLU activations. Models were trained for a single epoch on each dataset with the Adam optimizer, a learning rate of 0.001, and a batch size of 4096.
To ensure consistency across experiments, all sharding strategies, including our proposed ALIS, partitioned the embedding tables into eight shards, with each shard assigned to one of the 8 GPUs on the node. To simulate a realistic production scenario, the dynamic sharding process was triggered deterministically after every one million inference requests were processed, ensuring updates are proportional to data traffic. Based on our sensitivity analysis, the key hyperparameters for ALIS were set to their optimal values: the incremental data ratio, \(\alpha\), was set to 0.5, and the aggregate history weight, \(\beta\), was set to 0.8.
Datasets
We utilized four publicly available, large-scale datasets widely used in recommender system research: the Criteo Advertising Data, Avazu Click-Through Rate Prediction datasets, KDD 2012, and Taobao User Behavior, whose statistics are listed in Table 2.
-
Criteo advertising data: this dataset, provided by Criteo Labs, contains click-through data from advertising campaigns. It is characterized by its massive scale and high dimensionality, featuring categorical features that necessitate large embedding tables. This dataset effectively simulates industrial-scale recommendation scenarios where embedding table size is challenging.
-
Avazu click-through rate prediction: the Avazu dataset, from a Kaggle competition, focuses on predicting click-through rates for mobile advertising. Similarly to Criteo, it features a large number of categorical features and a substantial volume of data. Avazu provides another realistic benchmark to evaluate the performance of embedding table sharding strategies under heavy inference loads.
-
KDD 2012: the KDD Cup 2012 dataset, sponsored by Tencent, centers on predicting the click-through rate of ads on a web search engine. It includes user and ad features and is notable for its large volume of training instances derived from search session logs.
-
Taobao user behavior: to validate the generalizability of ALIS beyond the advertising domain, we also include the Taobao dataset from a large e-commerce platform. This dataset contains user-item interactions and is characterized by a pronounced power-law distribution in item popularity, presenting a different challenge for dynamic sharding strategies by creating highly skewed access patterns.
Model selection
To evaluate the generalizability of our proposed ALIS across different recommendation model architectures, we employed four widely adopted DL-based models. These include DeepFM (Deep Factorization Machine) 12, DCN (Deep & Cross Network) 13, DLRM (Deep Learning Recommendation Model) 14, and DIN (Deep Interest Network) 15. This expanded architectural diversity provides a comprehensive testbed to assess the robustness and wide applicability of ALIS.
Baselines
To comprehensively evaluate the performance of ALIS, we compared it with the following baseline methods:
-
TensorFlow Default (TF Default): this represents TensorFlow’s default embedding handling, serving as a baseline. In TensorFlow 1.15, due to the lack of built-in sharding, large embedding tables are generally constrained to be placed in CPU memory, not GPU. Despite this, the GPU is still utilized during inference after embedding lookups are performed on the CPU, with resulting vectors transferred to the GPU for computation.
-
Range Sharding(Range): In Range Sharding, embedding IDs are sharded based on their numerical ranges. This strategy groups consecutive IDs together, potentially improving data locality for specific access patterns, but can be susceptible to data skew if IDs are not uniformly distributed. We include range sharding to evaluate the impact of locality-focused sharding.
-
HugeCTR: as a GPU-accelerated high-performance framework, HugeCTR employs various acceleration strategies, notably caching embedding vectors in GPU memory. This caching mechanism significantly reduces memory access latency and speeds up lookup operations, dramatically improving inference performance. HugeCTR represents a strong industry baseline optimized for large embedding tables, utilizing its own efficient embedding table management and lookup mechanisms on GPUs.
-
EVStore: a state-of-the-art system designed to optimize embedding table lookups specifically for the inference phase, featuring a multi-tier caching hierarchy. It dynamically manages cache content using novel, group-aware replacement policies and further reduces latency through mixed-precision and approximate lookups, adapting to real-time access patterns to minimize disk I/O.
Evaluation metrics
We primarily focused on the following evaluation metrics to assess the performance of the different sharding strategies:
-
Inference latency: measured as the time taken to process a single inference request. Specifically, for latency evaluation (both average and P99), we used an inference batch size of 1. The latency was recorded on the server-side, measuring the wall-clock time from the moment the input features for a request were passed to the model’s inference function until the corresponding prediction output was generated.
-
Throughput: defined as the number of inference requests processed per second (queries/s). Throughput was measured by sending a continuous stream of inference requests from a dedicated, multi-threaded client application designed to saturate the inference server. For throughput benchmarks, we used an inference batch size of 256 to maximize GPU utilization and better reflect a high-load production scenario. The system was allowed a warm-up period of 60 seconds under this sustained load. Subsequently, the number of successfully completed requests was counted over a measurement window of 5 min.
-
AUC (area under the ROC curve): used to validate that sharding strategies do not degrade recommendation model accuracy. A higher AUC value indicates better model performance in ranking relevant items, ensuring that the system maintains the quality of recommendations.
Inference performance comparison (RQ1)
To evaluate the inference performance and generalizability of our proposed ALIS, we conducted a series of experiments on four large-scale datasets: Criteo, Avazu, KDD Cup 2012, and Taobao. These experiments compare ALIS against a comprehensive set of baselines (TF Default, Range Sharding, HugeCTR, and EVStore) using the DeepFM model. Our evaluation focused on key inference metrics: average and P99 latency, and throughput.
The experimental results, presented in Fig. 5, reveal a clear and consistent performance hierarchy. For any given sharding method, the inference latency is directly correlated with the dataset’s feature field count. The Taobao dataset, with only 4 fields, yields the lowest latency, followed by KDD 2012 (14 fields), Avazu (22 fields), and finally Criteo (26 fields), which is the slowest. This trend establishes a clear performance baseline determined by the inherent workload of each dataset.
More critically, ALIS consistently establishes itself at the top of this performance hierarchy, demonstrating its superiority not just in advertising but also in e-commerce domains. As shown in Table 3, to rigorously validate this, we performed independent two-sample t-tests comparing ALIS against the strong inference-optimized baseline, EVStore. The statistical analysis confirms ALIS’s significant advantages across all datasets. For instance, on the Taobao dataset, which features highly skewed access patterns, ALIS achieves a 15.8% latency reduction over EVStore (\(p < 0.01\)). This significant lead is maintained across all other datasets, including Criteo (10.6% reduction) and Avazu (17.6% reduction).
The performance gap between ALIS and the baselines stems from fundamental design differences. TF Default is slowest due to the severe I/O bottleneck of using CPU memory. Range Sharding is susceptible to the skewed access patterns prominent in datasets like Taobao, leading to load imbalance. HugeCTR uses a more static data placement, while EVStore operates reactively by caching hot embeddings after they are accessed. In contrast, ALIS operates proactively. Its forward-looking planning mechanism analyzes historical access patterns to anticipate which embeddings will be needed together. It then strategically co-locates them, which is particularly effective for Taobao’s power-law distribution, as it can place the few “hot” items and their related features for optimal access. This proactive, global optimization minimizes inter-GPU communication on the critical path, resulting in superior performance over even the most sophisticated reactive caching systems.
Impact on model accuracy (RQ2)
A critical requirement for any system optimization is that it must not degrade the model’s predictive quality. To address this, we evaluated the Area Under the ROC Curve (AUC) for four distinct recommendation models—DeepFM, DCN, DLRM, and DIN—when paired with different sharding strategies. The experiments were conducted on the Criteo dataset, which is a standard benchmark for all these models.
As shown in Table 4, the AUC scores remain remarkably consistent across all sharding strategies for all four models. Our proposed ALIS achieves comparable accuracy to baseline methods, and any minor numerical variations observed are statistically insignificant and likely due to experimental variance. Importantly, these results confirm that ALIS enables significant inference performance gains without compromising the fundamental recommendation accuracy, ensuring practical applicability for large-scale recommender systems.
Ablation study (RQ3)
To investigate the performance of each component in our ALIS method, we conduct ablation studies by evaluating variants of ALIS with specific components removed. Specifically, we focus on the contributions of our proposed Light Incremental Sharding and Aggregate based Sharding mechanisms in this section.
Utility of aggregate based sharding
To evaluate the contribution of the aggregate based sharding strategy in ALIS to performance stability, we perform an ablation study by comparing the complete ALIS method with two variants: \(\textrm{ALIS}_\textrm{base}\), where the aggregate based sharding mechanism is removed (but light incremental sharding is retained), and \(\textrm{ALIS}_\textrm{vanilla}\), where both aggregate based sharding and light incremental sharding are disabled. In \(\textrm{ALIS}_\textrm{base}\) and \(\textrm{ALIS}_\textrm{vanilla}\), the sharding decision is solely based on the current iterative-based sharding result without considering historical sharding information or performance.
We conduct experiments using the Criteo dataset and the DeepFM. We specifically focus on assessing the stability of inference latency, as the aggregate strategy is designed to mitigate performance fluctuations. For each configuration (ALIS, \(\textrm{ALIS}_\textrm{base}\), and \(\textrm{ALIS}_\textrm{vanilla}\)), we run the inference process multiple times and measure the inference latency for each run to capture the variability in performance.
Figure 6 visually compares the distributions of inference latency for ALIS, \(\textrm{ALIS}_\textrm{base}\), and \(\textrm{ALIS}_\textrm{vanilla}\) using box plots. As depicted, ALIS exhibits a significantly more compact box plot than both variants, signifying a reduced spread of latency values. To formally quantify this observation, we performed Levene’s test for equality of variances. The test revealed a statistically significant lower variance in inference latency for ALIS compared to both \(\textrm{ALIS}_\textrm{base}\) and \(\textrm{ALIS}_\textrm{vanilla}\). This confirms that the aggregate based sharding mechanism is effective in enhancing performance stability.
The absence of the aggregate mechanism makes both \(\textrm{ALIS}_\textrm{base}\) and \(\textrm{ALIS}_\textrm{vanilla}\) more susceptible to performance swings caused by the randomness of sharding. Furthermore, a t-test on the mean latencies shows that ALIS is not only more stable but also significantly faster than \(\textrm{ALIS}_\textrm{vanilla}\). The tighter and lower distribution of latency values for ALIS directly validates, with statistical support, the performance of the aggregate based sharding strategy in enhancing the robustness and reliability of ALIS’s inference performance.
Utility of light incremental sharding
To evaluate the performance of our light incremental sharding mechanism in reducing the sharding overhead, we compare the performance of ALIS with two variants: \(\textrm{ALIS}_\textrm{full}\) and \(\textrm{ALIS}_\textrm{vanilla}\). \(\textrm{ALIS}_\textrm{full}\) disables the incremental sharding strategy but retains the aggregate based sharding component. \(\textrm{ALIS}_\textrm{vanilla}\), as previously defined, disables both incremental sharding and aggregate based sharding. This allows us to isolate and measure the impact of the incremental part on the sharding time and overall inference performance, while also observing the combined effect of removing both optimizations.
The results, presented in Table 5, clearly demonstrate the performance of our light incremental sharding mechanism. To validate these findings, we again employed independent two-sample t-tests. The analysis shows that ALIS achieves an 88% reduction in sharding time compared to \(\textrm{ALIS}_\textrm{full}\) (from 2.10s to 0.25s), a difference that is statistically significant (\(p < 0.01\)). This confirms that the incremental approach drastically reduces the computational overhead of dynamic sharding.
Furthermore, ALIS not only reduces sharding time but also maintains superior inference latency. The latency of ALIS (2.04 ms) is significantly lower than that of both \(\textrm{ALIS}_\textrm{full}\) and \(\textrm{ALIS}_\textrm{vanilla}\). This suggests that the reduced sharding overhead may allow for more efficient resource utilization during inference, or that the hybrid approach of ALIS leads to slightly better shard quality. These results convincingly and statistically confirm that our light incremental sharding effectively minimizes the cost of dynamic sharding without sacrificing–and even slightly improving–the overall inference performance.
The ablation study comprehensively demonstrates the individual contributions of both light incremental sharding and aggregate based sharding mechanisms, collectively validating the performance of our proposed techniques in the ALIS for efficient and effective embedding table sharding during recommendation inference.
Hyper-parameter sensitivity analysis (RQ4)
We conducted a sensitivity analysis on key hyper-parameters to understand the robustness and adaptability of our ALIS sharding strategy. Although ALIS is designed to minimize manual tuning, specific parameters influence its dynamic behavior and performance. This section investigates the impact of two crucial hyper-parameters in ALIS, \(\alpha\) and \(\beta\), their joint impact, and the universality of their selection.
Impact of incremental data ratio (\(\alpha\))
The Light Incremental Sharding mechanism in ALIS combines iterative sharding on short-term data with statistics-based sharding on historical data. The ‘Incremental Data Ratio’ parameter, denoted as \(\alpha\), controls the proportion of recent interactions used for the iterative sharding step in each dynamic update. A higher \(\alpha\) emphasizes recent data changes, potentially leading to faster adaptation to evolving data distributions but also increased sharding overhead. Conversely, a lower \(\alpha\) relies more on historical statistics, reducing sharding cost but potentially lagging behind recent distribution shifts.
To evaluate the sensitivity to \(\alpha\), we varied its value across {0.1, 0.3, 0.5, 0.7, 0.9} while keeping other parameters fixed and evaluated the inference latency and sharding time on the Criteo dataset and DeepFM.
Figure 7a illustrates the impact of \(\alpha\) on inference latency and sharding time. We observe that as \(\alpha\) increases, the sharding time tends to increase due to more extensive iterative computation on recent data. Interestingly, the inference latency initially decreases as \(\alpha\) increases from 0.1 to 0.5, suggesting that incorporating more recent data in sharding improves the sharding quality and reduces inter-GPU communication during inference, hence lower latency. However, beyond \(\alpha = 0.5\), the inference latency increases slightly. This could be because a very high \(\alpha\) overemphasizes short-term fluctuations and might make shards less stable and potentially suboptimal overall. \(\alpha = 0.5\) appears to balance adaptation and overhead, yielding the best inference performance in this experiment. This result suggests that carefully choosing \(\alpha\) is essential to achieve optimal performance, and a moderate value is generally preferable.
Impact of aggregate history weight (\(\beta\))
The Aggregate based Sharding mechanism in ALIS leverages historical sharding results to improve stability and robustness. The ‘Aggregate History Weight’ parameter, denoted as \(\beta\), controls the weight assigned to the historical sharding results when integrating them with the current sharding outcome. A higher \(\beta\) emphasizes past shards, promoting stability and potentially mitigating performance fluctuations caused by randomness in current sharding. However, an excessively high \(\beta\) could hinder ALIS’s adaptability to long-term distribution changes. A lower \(\beta\) gives more weight to the current shard, improving adaptability but potentially making the shards less stable.
We conducted experiments by varying \(\beta\) across {0.2, 0.4, 0.6, 0.8, 1.0} while maintaining other parameters constant. We measured the inference latency standard deviation over multiple dynamic sharding updates on Criteo and DeepFM to assess the stability and average inference latency.
Figure 7b shows the impact of \(\beta\) on the standard deviation of inference latency and average inference latency. As expected, increasing \(\beta\) generally reduces the standard deviation of inference latency. This indicates that a higher \(\beta\) contributes to more stable sharding outcomes and less fluctuating inference performance. The average inference latency also shows a slightly decreasing trend as \(\beta\) increases to 0.8, possibly due to improved overall sharding quality from incorporating more historical information. However, when \(\beta\) is very high, the average inference latency increases slightly, suggesting that relying on historical shards might impede adaptation to new data patterns and lead to somewhat suboptimal performance in the long run. A value of \(\beta\) around 0.8 appears to offer a good balance between stability and adaptability in our experiments.
Joint impact
Figure 8a reveals a clear optimal region for inference latency. We observe the lowest inference latencies when \(\alpha\) is in the range [0.3, 0.5] and \(\beta\) is in the range [0.6, 0.8]. For instance, the pair (\(\alpha =0.5, \beta =0.8\)) yields an excellent latency of 25.3 ms. When \(\alpha\) is too low, the model adapts slowly to data changes, leading to higher latency regardless of \(\beta\). Conversely, a very high \(\alpha\) also increases latency, especially with lower \(\beta\) values, likely due to over-sensitivity to recent data fluctuations leading to less stable sharding decisions. Similarly, very low \(\beta\) values result in higher latency as historical knowledge is underutilized, while a \(\beta\) of 1.0 can also be suboptimal if \(\alpha\) is not well-tuned, as it might overly rely on potentially outdated historical shards.
As shown in Fig. 8b , sharding time is predominantly influenced by \(\alpha\). It increases almost monotonically with \(\alpha\), from around 10–15 s at \(\alpha =0.1\) to 45-55 seconds at \(\alpha =0.9\). This is expected, as a higher \(\alpha\) means more data is processed in the iterative sharding step. The impact of \(\beta\) on sharding time is less pronounced, though slightly lower sharding times are observed for intermediate \(\beta\) values, possibly due to more efficient integration of historical and current sharding results. This highlights a trade-off: lower \(\alpha\) reduces sharding cost but, as seen earlier, might compromise inference latency.
Figure 8c illustrates the standard deviation of inference latency, a proxy for sharding stability. Stability significantly improves with increasing \(\beta\). For any given \(\alpha\), moving from \(\beta =0.2\) to \(\beta =1.0\) substantially reduces latency fluctuations. This confirms that incorporating historical sharding information via \(\beta\) effectively stabilizes performance. Regarding \(\alpha\), extremely low or high values tend to slightly increase instability compared to moderate values, especially when \(\beta\) is low. The most stable region appears with high \(\beta\) and moderate \(\alpha\), achieving a standard deviation below 0.8 ms.
The heatmap analysis offers a superior understanding compared to single-parameter sweeps, revealing that optimal inference latency is achieved through a balanced combination of \(\alpha\) and \(\beta\), rather than at their extremes. For instance, while high \(\beta\) ensures stability, it requires an appropriate \(\alpha\) for adaptability and low latency. Notably, the lowest sharding times do not necessarily correlate with the best inference performance. Based on our Criteo-DeepFM experiments, \(\alpha \approx 0.5\) and \(\beta \approx 0.8\) present an excellent trade-off, yielding low inference latency, good stability, and moderate sharding times. This joint analysis underscores that while ALIS demonstrates robustness across a range of parameters, concurrently fine-tuning \(\alpha\) and \(\beta\) can unlock significant performance gains. These recommended values are strong starting points, but dataset characteristics and real-time data dynamics might necessitate minor adjustments within the identified favorable regions.
Generalization to the Taobao dataset
To rigorously test the generalizability, we replicated the entire hyper-parameter sensitivity analysis on the Taobao dataset. This dataset, originating from a large e-commerce platform, presents a fundamentally different challenge. Unlike the ad-tech datasets, Taobao is characterized by a more pronounced power-law distribution in item popularity and strong sequential dependencies in user behavior. Validating our hyper-parameter settings in this distinct domain is crucial for establishing the broad applicability of ALIS. The results are presented in Figure 9.
Remarkably, despite the significant differences in data characteristics and domain logic, the observed trends are highly consistent with our findings on Criteo. As shown in Figure 9a, inference latency is minimized around \(\alpha =0.5\), while sharding time scales linearly with \(\alpha\). Similarly, Fig. 9b demonstrates that stability improves with higher \(\beta\), with the best average latency achieved near \(\beta =0.8\). This consistency across two disparate domains—advertising and e-commerce—provides strong evidence that the principles guiding the tuning of ALIS are robust and not merely an artifact of a specific dataset or application type. We can thus confidently recommend \(\alpha \approx 0.5\) and \(\beta \approx 0.8\) as a reliable starting point for a wide range of recommendation systems.
The trade-off between sharding time (\(T_{\text {update}}\)) and maximum achievable update frequency (\(F_{\text {max}}\)) as a function of the incremental data ratio (\(\alpha\)). A larger \(\alpha\) improves sharding quality but increases update time, thereby limiting the system to lower update frequencies.
Limitations
While ALIS demonstrates significant performance improvements, it is essential to acknowledge its operational constraints and potential areas for future research. We elaborate on three key aspects: the memory footprint of its global state, constraints on dynamic update frequency, and the generalizability of our findings.
Constraints on dynamic update frequency
Our main experiments employed a conservative, data-volume-based update trigger (every 1M requests) to ensure stable performance measurement. However, applications requiring more frequent, time-based updates must consider the sharding overhead. We conducted a targeted experiment to quantify this, measuring the sharding time (\(T_{\text {update}}\)) against the incremental data ratio, \(\alpha\), to determine the maximum sustainable update frequency (\(F_{\text {max}} = 1/T_{\text {update}}\)).
The results, shown in Fig. 10, reveal a critical trade-off. Sharding time increases non-linearly with \(\alpha\), rising from 50ms at \(\alpha =0.1\) to 1.9s at \(\alpha =0.9\). Consequently, the maximum update frequency drops sharply with \(\alpha\), from 20 Hz to nearly 0.5 Hz.
This analysis quantitatively demonstrates that while ALIS is efficient, a trade-off exists between sharding quality (favoring higher \(\alpha\)) and update frequency. The mechanism is therefore best suited for moderately dynamic environments (sub-second to minutes), rather than hard real-time contexts demanding ultra-high-frequency updates.
Generalizability to diverse data characteristics
As demonstrated in our experiments, the generalizability of ALIS is validated across four datasets chosen specifically for their diverse properties: Criteo, Avazu, KDD 2012, and Taobao. The consistent performance reported in Section 4 across these varied data distributions and distinct application domains provides robust evidence of our model’s wide applicability. This conclusion is further corroborated by our hyper-parameter study, which showed that the optimal parameter settings for ALIS are remarkably insensitive to these significant shifts in data characteristics. While these findings strongly support the broad utility of our proposed method, we prudently acknowledge that deploying ALIS to new tasks or datasets substantially different from those evaluated may still necessitate targeted adjustments to achieve peak performance.
Scalability analysis of the global weight matrix
A core component of our Light Incremental Sharding mechanism is the global weight matrix W, which guides sharding decisions by accumulating historical information. While effective, a potential limitation, as briefly mentioned, is its memory footprint, which scales with both the total number of unique embeddings |E| and the number of shards S. The memory required can be formulated as Memory\((W) = |E| \times S \times \text {sizeof(weight\_type)}\). To address the concern about performance under significant increases in features and shards, we provide a quantitative analysis of this aspect.
We conducted a scalability analysis to model the theoretical memory consumption of the matrix W. We assume each weight is stored as a 32-bit floating-point number (4 bytes) and varied the number of unique embeddings |E| from 10 million to 100 million and the number of shards S from 8 to 64, scenarios that are representative of large to very-large-scale industrial recommender systems.
The results are presented in Fig. 11. As expected, the memory footprint scales linearly with both |E| and S. For a system with 50 million embeddings and 32 shards, the matrix W would consume approximately 6.4 GB of memory (\(50 \times 10^6 \times 32 \times 4\) bytes). This requirement grows to a substantial 25.6 GB for a massive-scale system with 100 million embeddings and 64 shards.
This analysis highlights that while ALIS is efficient in terms of computation time for updates, the memory for the weight matrix W can become a significant factor in extremely large-scale deployments. This could pose a challenge for systems with constrained memory resources. However, this limitation is not insurmountable. Future work could explore optimizations such as quantization and sparsification of the weight matrix W. These techniques would not only reduce the memory footprint but could also potentially lower inference latency by enabling faster, lower-precision computations. Critically, this introduces a trade-off, as aggressive optimization might degrade the historical information in W, thereby impacting sharding quality and, consequently, final model accuracy. Evaluating this balance between efficiency and accuracy presents a promising research direction to ensure ALIS remains a viable approach for next-generation recommender systems.
Related work
Embedding table sharding is a foundational technique for scaling large-scale recommender systems by partitioning massive embedding tables across multiple devices16,17. The goal is to overcome single-node memory limits, balance the computational load, and reduce latency. Existing approaches fall into two main categories: static and dynamic sharding.
Static sharding
Static sharding schemes determine the partitioning strategy before execution, and this configuration remains fixed. These methods range from simple heuristics like hash-based partitioning and approximate equitable sharding18 to sophisticated offline optimizations. More advanced techniques perform pre-computation to find an optimal partition; for instance, some methods reduce memory usage by dividing category sets into smaller tables via complementary sharding19, the RecShard20 model uses workload statistics, while Hetero-Rec++21, an extension of Hetero-Rec22, employs hardware-aware cost models. Other approaches utilize formal methods such as Integer Linear Programming (ILP)23. Furthermore, specialized frameworks like G-Meta24 and SERE25 provide system-level support for distributed embedding computation and training on GPU clusters and Spark, respectively. In the notable sub-category of graph-based partitioning, HET-GMP26, which is built upon HET27, optimizes for locality by modeling data-embedding relationships. While effective for stable workloads, these static approaches cannot adapt to the shifting access patterns common in real-world scenarios, which necessitates dynamic solutions.
Dynamic sharding
Dynamic sharding methods adaptively adjust the partitioning scheme during training or inference by leveraging runtime statistics, machine learning, or system feedback. These strategies aim to continuously optimize performance in response to changing conditions. A prominent direction involves using learning-based techniques to discover optimal placement policies. AutoShard28 builds a neural cost model to estimate communication and memory costs, automatically selecting a highly optimized partition configuration. DreamShard29 formulates placement as a combinatorial optimization problem, using reinforcement learning to derive generalizable policies that reduce peak memory usage. Furthering this, NeuroShard30 learns a fully differentiable sharding policy via gradient descent, enabling fine-grained, dynamic control over hot and cold embeddings. Another class of methods focuses on flexible, multi-tiered memory and storage management. FlexShard31 introduces a tier-aware sharding mechanism that dynamically partitions embeddings based on runtime frequency, request context, and memory tier characteristics. EVStore32 introduces a cache-aware sharding architecture that dynamically co-locates frequently co-accessed embeddings to minimize SSD fetches. Real-time adaptability is crucial for production systems. Monolith33 supports collisionless ID space growth with a dynamic embedding design that adapts sharding boundaries based on online statistics. EMBark34 develops tailored dynamic optimizations for training, enabling runtime table resizing and intelligent memory-tier allocation. Other system-integrated approaches include Recom35, which uses compiler techniques to guide placement; EfShard36, which enables flexible state allocation through hierarchical sharding; and the Budgeted Embedding Table (BET)37, a framework where a runtime budget determines if an embedding gets a full, compressed, or skipped representation.
In summary, the field is clearly progressing from fixed, offline-optimized static methods to adaptive dynamic strategies. Our work, ALIS, contributes to this dynamic sharding landscape with a distinct approach. Unlike learning-based methods such as DreamShard or NeuroShard that train complex policies via reinforcement learning or neural networks, ALIS employs more direct and lightweight mechanisms. Specifically, our aggregate based sharding uses an explicit, performance-feedback-driven ensemble method to improve the quality and stability of sharding, addressing the randomness inherent in iterative approaches. Furthermore, while other methods re-partition extensively, our light incremental sharding offers an efficient hybrid solution, combining fast local iterative processing on new data with a global statistical update. This strikes a unique balance between responsiveness to new data patterns and low computational overhead, setting it apart from methods that either require costly full recalculations or rely on simpler greedy updates that can miss co-occurrence shifts.
Conclusion
This paper presents ALIS, a novel method designed to address the challenges of efficient dynamic embedding table sharding in recommendation inference. By introducing aggregate based sharding and light incremental sharding, ALIS significantly reduces the cost of sharding updates and improves sharding stability, ensuring high-quality performance during the dynamic sharding process. Our approach is specifically tailored to meet the unique demands of recommender systems, including the need for low latency and consistent performance. The evaluation results of our ALIS demonstrate its superior performance over existing state-of-the-art methods, validating the performance of our proposed strategies in real-world deployment scenarios. These contributions not only improve the efficiency of large-scale recommender systems but also lay a solid foundation for future advancements in dynamic sharding techniques for distributed inference environments.
Future research could explore more sophisticated approaches for dynamically balancing insights from historical sharding states with current data characteristics, particularly in contexts of prolonged system operation or highly erratic data patterns where the utility of past information may wane. Enhancing the system’s responsiveness to abrupt, substantial shifts in data co-occurrence, beyond routine incremental updates, presents another avenue for improving overall robustness. Furthermore, the underlying principles of our work could be investigated for their applicability to other large-scale sparse model domains or in conjunction with emerging hardware accelerators for inference, offering new avenues for optimization.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Nahta, R., Chauhan, G. S., Meena, Y. K. & Gopalani, D. Deep learning with the generative models for recommender systems: A survey. Comput. Sci. Rev. 53, 100646 (2024).
Fu, Z., Niu, X. & Maher, M. L. Deep learning models for serendipity recommendations: A survey and new perspectives. ACM Comput. Surv. 56(1), 1–26 (2023).
Fei, H., Zhang, J., Zhou, X., Zhao, J., Qi, X. & Li, P. GemNN: Gating-enhanced multi-task neural networks with feature interaction learning for CTR prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2166–2171 (2021).
Di, S. Deep interest network for Taobao advertising data click-through rate prediction. In 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). 741–744 (2021).
Lim, J., Kim, Y. G., Chung, S. W., Koushanfar, F. & Kong, J. Near-memory computing with compressed embedding table for personalized recommendation. IEEE Trans. Emerg. Top. Comput. 12(3), 938–951 (2023).
Wilkening, M., Gupta, U., Hsia, S., Trippel, C., Wu, C.-J., Brooks, D. & Wei, G.-Y. RecSSD: Near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 717–729 (2021).
Wang, Y., Yang, C., Lan, S., Zhu, L. & Zhang, Y. End-edge-cloud collaborative computing for deep learning: A comprehensive survey. In IEEE Communications Surveys & Tutorials (2024).
Shuvo, M. M. H., Islam, S. K., Cheng, J. & Morshed, B. I. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review. Proc. IEEE 111(1), 42–91 (2022).
Mudigere, D., Hao, Y., Huang, J., Jia, Z., Tulloch, A., Sridharan, S., Liu, X., Ozdal, M., Nie, J., Park, J. et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 993–1011 (2022).
Lakhotia, K., Petrini, F., Kannan, R. & Prasanna, V. In-network reductions on multi-dimensional HyperX. In 2021 IEEE Symposium on High-Performance Interconnects (HOTI). 1–8 (2021).
Petit, Q., Li, C. & Emad, N. An efficient and scalable approach to build co-occurrence matrix for DNN’s embedding layer. In Proceedings of the 38th ACM International Conference on Supercomputing. 286–297 (2024).
Guo, H., Tang, R., Ye, Y., Li, Z. & He, X. Deepfm: A factorization-machine based neural network for CTR prediction. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 1725–1731 (2017).
Wang, R., Fu, B., Fu, G. & Wang, M. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. 12–1127 (2017).
Naumov, M., Mudigere, D., Shi, H.-J.M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A.G., et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).
Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H. & Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068 (2018).
Chen, S., Tan, H., Zhou, A.C., Li, Y. & Balaji, P. Updlrm: Accelerating personalized recommendation using real-world PIM architecture. In Proceedings of the 61st ACM/IEEE Design Automation Conference. 1–6 (2024).
Liu, H., Zheng, L., Huang, Y., Liu, C., Ye, X., Yuan, J., Liao, X., Jin, H. & Xue, J. Accelerating personalized recommendation with cross-level near-memory processing. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–13 (2023).
Lin, W., He, F., Zhang, F., Cheng, X. & Cai, H. Initialization for network embedding: A Graph partition approach. In Proceedings of the 13th International Conference on Web Search and Data Mining. 367–374 (2020).
Shi, H.M., Mudigere, D., Naumov, M. & Yang, J. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 165–175 (2020).
Sethi, G., Acun, B., Agarwal, N., Kozyrakis, C., Trippel, C. & Wu, C. RecShard: Statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 344–358 (2022).
Krishnan, A., Nambiar, M. & Singhal, R. Hetero-Rec++: Modelling-based robust and optimal deployment of embeddings recommendations. In Proceedings of the Third International Conference on AI-ML Systems. 1–9 (2023).
Mahajan, C., Krishnan, A., Nambiar, M. & Singhal, R. Hetero-rec: Optimal deployment of embeddings for high-speed recommendations. In Proceedings of the Second International Conference on AI-ML Systems. 1–9 (2022).
Viramontes, R. & Davoodi, A. Neural network partitioning for fast distributed inference. In 2023 24th International Symposium on Quality Electronic Design (ISQED). 1–7 (2023).
Xiao, Y., Zhao, S., Zhou, Z., Huan, Z., Ju, L., Zhang, X., Wang, L. & Zhou, J. G-meta: Distributed meta learning in GPU clusters for large-scale recommender systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4365–4369 (2023).
Gergin, B. & Chelmis, C. Large-scale knowledge graph embeddings in apache spark. In 2024 IEEE International Conference on Big Data (BigData). 243–251 (2024).
Miao, X., Shi, Y., Zhang, H., Zhang, X., Nie, X., Yang, Z. & Cui, B. HET-GMP: A graph-based system approach to scaling large embedding model training. In Proceedings of the 2022 International Conference on Management of Data. 470–480 (2022).
Miao, X. et al. HET: Scaling out huge embedding model training via cache-enabled distributed framework. Proc. VLDB Endow. 15(2), 312–320 (2021).
Zha, D., Feng, L., Bhushanam, B., Choudhary, D., Nie, J., Tian, Y., Chae, J., Ma, Y., Kejariwal, A. & Hu, X. Autoshard: Automated embedding table sharding for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4461–4471 (2022).
Zha, D. et al. Dreamshard: Generalizable embedding table placement for recommender systems. Adv. Neural Inf. Process. Syst. 35, 15190–15203 (2022).
Eldeeb, T., Chen, Z., Cidon, A. & Yang, J. Neuroshard: Towards automatic multi-objective sharding with deep reinforcement learning. In Proceedings of the Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–12 (2022).
Lu, L., Qiu, Y., Yi, S. & Fan, Y. A flexible embedding-aware near memory processing architecture for recommendation system. In IEEE Computer Architecture Letters (2023).
Kurniawan, D.H., Wang, R., Zulkifli, K.S., Wiranata, F.A., Bent, J., Vigfusson, Y. & Gunawi, H.S. EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation systems. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Vol. 2. 281–294 (2023).
Liu, Z., Zou, L., Zou, X., Wang, C., Zhang, B., Tang, D., Zhu, B., Zhu, Y., Wu, P., Wang, K. & Cheng, Y. Monolith: Real time recommendation system with collisionless embedding table. In Proceedings of the 5th Workshop on Online Recommender Systems and User Modeling Co-located with the 16th ACM Conference on Recommender Systems, ORSUM@RecSys 2022, Seattle, WA, USA, September 23rd, 2022 (2022).
Liu, S., Zheng, N., Kang, H., Simmons, X., Zhang, J., Langer, M., Zhu, W., Lee, M. & Wang, Z. Embedding optimization for training large-scale deep learning recommendation systems with embark. In Proceedings of the 18th ACM Conference on Recommender Systems. 622–632 (2024).
Pan, Z., Zheng, Z., Zhang, F., Wu, R., Liang, H., Wang, D., Qiu, X., Bai, J., Lin, W. & Du, X. Recom: A compiler approach to accelerating recommendation model inference with massive embedding columns. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Vol. 4. 268–286 (2023).
Mu, K. & Wei, X. Efshard: Toward efficient state sharding blockchain via flexible and timely state allocation. IEEE Trans. Netw. Serv. Manag. 20(3), 2817–2829 (2023).
Qu, Y., Chen, T., Nguyen, Q.V.H. & Yin, H. Budgeted embedding table for recommender systems. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 557–566 (2024).
Acknowledgements
This work has been supported by the Science Research Project of Anhui Higher Education Institutions (No. 2023AH050914, 2024AH052239), the Quality Engineering Project of Anhui Higher Education Institutions (No. 2023zybj018), Science and Technology Project of Wuhu City under Grant (No. 2023pt07, 2023ly13), and the Quality Improvement Program of Anhui Polytechnic University (No. 2022lzyybj02, 2023jyxm15 and 2024jyxm76).
Funding
This paper is supported by the Science Research Project of Anhui Higher Education Institutions (No. 2023AH050914, 2024AH052239), the Quality Engineering Project of Anhui Higher Education Institutions (No. 2023zybj018), Science and Technology Project of Wuhu City under Grant (No. 2023pt07, 2023ly13), and the Quality Improvement Program of Anhui Polytechnic University (No. 2022lzyybj02, 2023jyxm15 and 2024jyxm76).
Author information
Authors and Affiliations
Contributions
Chao Kong contributed to funding acquisition, methodology, and writing the original draft. Jiahui Chen contributed to validation and visualization and was involved in review & editing. Dan Meng contributed to software. Haibei Zhu contributed to data curation and validation. Yi Zhang contributed to validation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kong, C., Chen, J., Meng, D. et al. Aggregate based light incremental sharding for efficient embedding table management for recommender systems. Sci Rep 15, 31770 (2025). https://doi.org/10.1038/s41598-025-15351-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-15351-8
















