Introduction

Artificial Intelligence (AI) refers to the development of computer systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, and decision-making1. It enables machines to analyze data, recognize patterns, and adapt to new information, thereby automating complex processes across various domains including healthcare, social network analysis2, finance, education, social media analysis3 and human resources. AI systems integrate algorithms and computational models to simulate cognitive functions4, enabling applications like natural language processing5, computer vision, and predictive analytics6. As AI continues to evolve, it plays an increasingly vital role in enhancing efficiency, personalization, and innovation across industries7. Within this broad scope, AI-driven research on career development and job market dynamics has gained significant momentum8. Career satisfaction, a critical factor influencing employee productivity and retention, is increasingly being studied through the lens of AI models that consider both behavioral traits and academic performance9. The integration of these diverse factors offers a holistic understanding of career success, opening new avenues for personalized career guidance and workforce optimization10.

Despite its importance, predicting career satisfaction remains a complex challenge due to the multifaceted nature of influencing factors and their nonlinear interactions11. This motivates the need for advanced AI methodologies capable of capturing intricate patterns in multidimensional data12. This study addresses this gap by applying state-of-the-art transformer-based models, specifically BERT, to predict career satisfaction based on educational and behavioral traits, offering enhanced accuracy and interpretability compared to conventional approaches13.

In this study, we explore a standard and publicly widely used dataset encompassing comprehensive academic and behavioral features to analyze career success rates. Using state-of-the-art transformer models, particularly BERT architecture, the research aims to capture complex, nonlinear relationships within these multifaceted traits. By integrating diverse data points such as educational performance indicators and behavioral metrics, the model provides a nuanced understanding of factors influencing career satisfaction. This approach not only advances predictive accuracy but also facilitates deeper insights into the interplay between academic achievements and personal attributes, thereby contributing to more effective career guidance and workforce development strategies. The key contributions of this study are:

  • Development of a multi-factor predictive model integrating BERT’s transformer architecture to integrate academic and behavioral data for career satisfaction prediction.

  • Comparative evaluation of transformer-based models against traditional machine learning and deep learning methods, demonstrating superior performance.

  • The proposed BERT-based model achieves a highest classification accuracy of 98% by effectively capturing both syntactic and semantic patterns, enabling complex and context-aware prediction of career satisfaction levels.

The remainder of this paper is organized as follows: Sect. 2 reviews related work; Sect. 3 describes the dataset and preprocessing steps; Sect. 4 presents experimental results and analysis; Sect. 5 concludes the paper with future research directions.

Related work

The quest to comprehend career success and its prospects has shifted from conventional sociological and psychological models to one of data-centered analysis and prediction, spurred by the explosion of digital data and the tremendous leaps and bounds in machine learning (ML) and deep learning (DL)14. Recent literature summarizes in Table 1, indicates a shift towards multi-factor predictive models that utilize multiple sources of data.

The study15 applied scores of vocational interest inventories to articulate categories of individuals’ careers. Compared to traditional profile matching, their ML-augmented approach increased the overall accuracy of predicting common occupations, while still underpredicting rare job categories. In the same vein16, developed a Big 5 based ANFIS fuzzy-inference model to categorize experts (software vs. data scientist) and showed how personality patterns can be utilized effectively for placement in a career. A partnered objective17 alumni information (i.e., grades, major, demographics) with subjective survey items and an analysis that integrated a genetic algorithm to predict career success, discovering no structure in terms of the subjective features. These multi-trait models demonstrate worthful in interviewing session based on question-answer to analyze cognitive skills for hiring candidate which encompasses their features that combining hard skills (such as GPA) and soft skills (interests, personality, satisfaction) offers a better resolution of career outcomes. However, they usually target specific communities (e.g., a university, a field) and generalization is limited18. AI in smart education can further be used for career prediction strategies: after analyzing students’ skills, interests, and performance data, they are guided with potential suitable career paths as well. This allows for the development of more personalized guidance to enhance decision making and curricular alignment with workforce demand19. In systems that predict careers, we also face challenges of human and AI collaboration. Like human-machine plan conflicts that could result in increased cognitive burden and decreased trust, unfair or opaque career prediction can have negative effect on the users’ confidence and their decision making. Combining fairness-aware AI models with transparency mechanisms and adaptive feedback loops can help to alleviate these problems ensuring that predictions are explainable, bias-aware, and aligned with their goals. “Moving away from the soft skills, AI-infused career counseling is beneficial for increased trust and acceptance of such a product, so that education and workforce outcomes might better benefit those who can now make informed, equitable choices about their careers20. Workplace analytics driven by AI can detect early signals of exclusion, providing HR with empirical evidence to infuse interventions aimed at fairness. If career prediction tools are bridged with our well-being, AI may facilitate inclusive growth and foster a leveling playing field in careers21.

Academic performance is a good predictor of an individual’s efforts and an indicator overall school performance. The study22 employed several ML algorithms (decision trees, gradient boosting, neural nets) on an alumni survey that connected academic experiences to career satisfaction. They found a Gradient Boosting model to slightly surpass logistic/ordinal regression (approx. 2–3% in accuracy) and highlight the “frequency of applying learned knowledge in work” as the most important predictor of career satisfaction. For salary prediction, an LSTM network (using both max-likelihood and Bayesian tricks) is proposed23 to upgrade and show it to be orders of magnitude better than traditional regressions at predicting alumni salaries. Similarly, ML-based based classifiers are created on students’ high-school grades, college GPA, and socio-economic status to predict career fields, and observed pre-university grades to be particularly predictive24. Even less structured educational data has been utilized25 utilized lexical descriptors of high-school essays, trained a Random Forest model, and achieved “excellent” levels of accuracy when predicting students’ career readiness. They demonstrate that tangible educational outcomes (grades, curriculum, and application of knowledge) have a significant impact on career paths, but such analyses need data rich and sophisticated feature engineering (e.g. mining text).

Table 1 Summary of existing studies.

Also, other behavioral and personal characteristics have entered in prediction models. For instance29, applied Random Forests to survey data of 409 college graduates with disabilities to forecast job and career satisfaction at a later point in time. Their model achieved ~ 72% accuracy, demonstrating that meeting academic accommodation needs during education significantly positively influenced long-term career satisfaction. This result is indicative of how personal context to individuals (e.g., access to support) could be harnessed by ML for career. Personal traits also matter numerous works that verify that augmentation with style or personality features improves the predictive performance30. Another study used AI models indeed rely only on behavioral traits, but these vary markedly among different individuals, so models that are learned on restricted sample tend to be not very generalizable31. Indeed, systematic reviews explored, despite the rise and rise of more complex ML models (random forests, SVMs, neural nets), there is lack of data, class imbalances and (lack of) interpretability issues that prevent such models being widely deployed. In brief, ML can measure the impact that characteristics and experiences have on success but must solve with the empirical and ethical difficulties (fairness, transparency) of modeling human behavior32.

Some work has also introduced new data modalities and deep architectures integrated biomechanical features (gait and posture metrics) together with behavioral questionnaires into an RF model that predicted the career outcomes of 4-year college students; this multimodal model achieved ~ 82.6% prediction accuracy and demonstrated that biomechanical information could significantly increase prediction performance26. On the deep-learning side, utilized an encoder-decoder LSTM with students’ vector of exam scores and demographics as input sequence to recommend study subjects, which obtained very high accuracy achieving close guidance to tradition while allowing for more personalized forecasts27. These approaches demonstrate that rich data and DL can improve prediction, but they demand large samples and often lose interpretability33. Overall, Random Forests, SVMs, and neural nets have relatively everything else beat, and in fact many papers report only marginal improvements in accuracy over simpler methods34.

In brief, the studies under consideration use ML/DL to analyze education and behavior predict career success. They share such strengths (ability to capture non-linear interactions and to personalize predictions) and weaknesses (limited learning due to the size of the dataset, risk of overfitting and opacity)28. For instance, studies show the superiority of big data on accuracy while smaller studies show sensitivity to feature selection. Autonomy and utility concerns (privacy, algorithm bias) are also decorated as major problems35. Therefore, although the literature reflects clear advances in AI-based modeling of career outcomes, it also raises a call for even more generalizable and interpretable models and evaluation on broad, longitudinal data sources.

Research proposed methodology

This analysis follows a data-based approach to forecasting career satisfaction with an extensive collection of educational and behavioral variables. More advanced predictive models are trained and tested after extensive preprocessing and feature engineering, as shown in Fig. 1. The trained models based on BERT architecture because it is known to successfully learn intricate and context-dependent relationships in high-dimensional data.

Fig. 1
figure 1

Framework map of proposed research methodology.

Differing with classic machine learning models and RNNs, BERT features a transformer-based architecture instead of a recurrent structure that leverages multi-head self-attention and deep input representation to capture complex cross-variable interactions in a bidirectional manner. This provides scope in order to allow the model to adequately capture both linear and non-linear relationships present in the multifactorial predictors of career success. The methodical work thus unites solid data preparation and state-of-the-art deep learning to achieve both high predictive accuracy and interpretability regarding career satisfaction classification36.

Unlike previous literature, which mostly relied on deep learning or transformer-based architectures for textual or domain-specific sequence data, we introduce the application of BERT on structured tabular data with educational and behavioral attributes. Parameter sharing has to be managed carefully in order to scale the model to millions of user–job pairs, but the main contribution is on the feature embedding side: for each feature whether it be academic (e.g., GPA, SAT score, university rank) or behavioral (e.g., networking score, work–life balance, internships), they projected into a dense space where values are normalized and placed in context. These embeddings are then fed into BERT’s multi-head self-attention layers that help model the interactions across heterogeneous characteristics. For example, the model is able to optimally weight the overlap of “University GPA” and “Networking Score”, capturing information that standard tree-based or shallow neural techniques cannot. These relationships are then further refined by the residual connections and feed-forward layers which preserve local as well as global dependencies in features. Therefore, BERT’s transformer blocks not only as a sequence model but as a relational learner for tabular data can learn syntactic structures for numeric hierarchies and categorical encoding, as well as semantic relationships for behavioral tendencies associated with career outcomes. This methodological novelty combine embedding and context-aware attention for mixed features is the key consideration of our proposed method, which results in consistently better performance compared to other baseline works.

Data collection and preprocessing

This study utilized a dataset \(\:{\mathcal{D}=\left\{\left({x}_{i},{y}_{i}\right)\right\}}_{i=1}^{N}\) where each instance \(\:{x}_{i}\in\:{\mathbb{R}}^{d}\) represents a feature vector composed of educational and behavioral attributes, and \(\:{y}_{i}\in\:\{\text{0,1},2\)} denotes the categorical career satisfaction class (Low, Medium, High), features displayed in Table 2. The dataset was collected from a longitudinal survey of graduates, incorporating continuous variables such as GPA, test scores, and soft skills scores, as well as categorical variables including gender and field of study. Preprocessing involved data cleansing where missing entries were removed \(\:(D{\prime\:}=\:\{({x}_{i},\:{y}_{i})\::{x}_{i},\:{y}_{i}\:\ne\:null\left\}\right),\) and categorical features \(\:{x}_{j}\)were encoded via one-hot or ordinal mappings to numeric vectors, ensuring \(\:{x}_{i}\in\:{\mathbb{R}}^{d}\:\)is fully numeric. Target variable \(\:y\) was discretized by mapping original satisfaction scores \(\:{s}_{i}\in\:\left[\text{1,10}\right]\) into classes using Eq. 1.

$$\:{y}_{i}=\:\left\{\begin{array}{c}0\:\:if\:\:1\:\le\:{s}_{i}\le\:3\\\:1\:if\:\:4\:\le\:{s}_{i}\le\:6\\\:\:\:2\:if\:\:7\:\le\:{s}_{i}\le\:10\end{array}\right.$$
(1)

Numerical features were standardized via z-score normalization \(\:{\stackrel{\prime}{x}}_{ij}=\frac{{x}_{ij}-{{\upmu\:}}_{j}}{{{\upsigma\:}}_{j}}\), where \(\:{{\upmu\:}}_{j}\) and \(\:{{\upsigma\:}}_{j}\), are the mean and standard deviation of feature \(\:j.\) Outliers beyond \(\:\pm\:3\) standard deviations were removed to ensure robust model training. The final cleaned and transformed dataset \(\:\mathcal{D}"\) was then partitioned into training and testing subsets using stratified sampling with ratio \(\:80:20\), preserving class distributions for reliable evaluation.

Table 2 Dataset feature analysis.

The original data set included student records but with preprocessing, it resulted in valid instances. The data is fully anonymized, no personal identifiable information being divulged, conforming with the Privacy movement. Therefore, characteristics of geographic locality, ethnic affiliation or socio-economic standing are not attainable that reduces the representativeness of the dataset and might have an impact on generalization. In addition, some fields might be over-represented which can cause bias. To prevent overfitting, we adopted a stratified 80–20 hold-out split combined with the 5-fold cross-validation keeping class distribution. Cross validation hyperparameter tuning was conducted separately for each model on the training folds and test folds were held out to prevent data leakage. Despite the high accuracy rate (98%) of the BERT model in this study, this result should be cautiously understood, and further studies need to include external validation using independent data sets to test the generalization.

Feature engineering

In this study, the raw dataset comprises multiple variables that characterize individuals’ academic and behavioral profiles. For modeling purposes, these variables are grouped into two primary categories: educational traits and behavioral traits. Consider an individual indexed by \(\:i\) within the dataset of size \(\:N\). Let the overall feature space be represented as in Eq. 2.

$$\:{x}_{i}\in\:\mathcal{X}\subseteq\:{\mathbb{R}}^{d}$$
(2)

where \(\:d\:=\:m\:+\:n\) is the dimensionality corresponding to \(\:m\) educational and \(\:n\) behavioral traits, respectively.

Educational traits consist of measurable academic indicators reflecting formal learning achievements and credentials. Mathematically, we define the educational feature vector for the \(\:i-th\) individual as mapping shown in Eq. 3.

$$\:\mathcal{E}:\mathcal{X}\to\:{R}^{m},\hspace{1em}{x}_{i}\mapsto\:{E}_{i}={\left({e}_{i1},{e}_{i2},\dots\:,{e}_{im}\right)}^\tau$$
(3)

with each component \(\:{e}_{ij}\) modeled as in Eq. 4.

$$\:{e}_{ij}={{\upphi\:}}_{j}\left({{\uptheta\:}}_{j}\left({x}_{ij}\right)\right),\hspace{1em}{{\uptheta\:}}_{j}:\mathbb{R}\to\:\mathbb{R},\mathbb{\:}\mathbb{\:}{{\upphi\:}}_{j}:\mathbb{\:}\mathbb{R}\to\:\mathbb{R}$$
(4)

where \(\:{{\upphi\:}}_{j}\), is a feature scaling or normalization function (e.g., z-score normalization), using Eq. 5.

$$\:{{\uptheta\:}}_{j}\left({x}_{ij}\right)=\frac{{x}_{ij}-{{\upmu\:}}_{j}}{{{\upsigma\:}}_{j}}$$
(5)

And \(\:{{\upphi\:}}_{j}\:\)is a nonlinear transformation such as polynomial or sigmoid activation enhancing representational richness, computed using Eq. 6.

$$\:{{\upphi\:}}_{j}\left(z\right)=\frac{1}{1+{e}^{-{{\upalpha\:}}_{j}z}}$$
(6)

with \(\:{\propto\:}_{j}>0\:\)controlling nonlinearity intensity.

Behavioral traits capture aspects related to an individual’s skills, experiences, and personal development that influence career outcomes beyond pure academic performance, defined in Eq. 7. The behavioral feature vector is expressed as in Eq. 8.

$$\:\mathcal{B}:\mathcal{X}\to\:{\mathbb{R}}^{n},\hspace{1em}{x}_{i}\mapsto\:{B}_{i}={\left({b}_{i1},{b}_{i2},\dots\:,{b}_{in}\right)}^\tau$$
(7)
$$\:{b}_{ik}={\psi\:}_{k}\left({\gamma\:}_{k}\left({x}_{ik}\right)\right),\hspace{1em}{\gamma\:}_{k}\mathbb{\:}:\mathbb{R}\to\:\mathbb{R},\mathbb{\:}\mathbb{\:}{\psi\:}_{k}:\mathbb{\:}\mathbb{R}\to\:\mathbb{R}$$
(8)

For categorical or ordinal behavioral traits, may denote an embedding or encoding function, computed using Eq. 9.

$$\:{\gamma\:}_{k}\left({x}_{ik}\right)={\sum\:}_{l=1}^{L}{w}_{kl}\cdot\:\mathbb{I}\left({x}_{ik}=l\right)$$
(9)

Where \(\:\mathbb{I}\mathbb{\:}(.)\) is the indicator function and \(\:{w}_{kl}\in\:\:\mathbb{R}\) are learned embedding weights. The combined feature vector is the concatenation, shown as in Eq. 10.

$$\:{x}_{i}={\left[{E}_{i}^\tau,{B}_{i}^\tau\right]}^\tau\in\:{\mathbb{R}}^{d}$$
(10)

Define the continuous career satisfaction score as \(\:{s}_{i}\in\:\left[\text{1,10}\right]\) and the discrete target class \(\:{y}_{i}\in\:\{\text{0,1},2\}\) via the function in Eq. 11.

$$\:{y}_{i}={\sum\:}_{c=0}^{2}c\cdot\:\mathbb{I}\mathbb{\:}\left({s}_{i}\in\:{\mathcal{S}}_c\right)$$
(11)

where the class partitions are intervals defined as in Eq. 12.

$$\:{\mathcal{S}}_{0}=\left[\text{1,3}\right],\hspace{1em}{\mathcal{S}}_{1}=\left(\text{3,6}\right],\hspace{1em}{\mathcal{S}}_{2}=\left(\text{6,10}\right]$$
(12)

The indicator function \(\:\mathbb{I}(.)\) evaluates to \(\:1\) if its argument is true, \(\:0\) otherwise. This formulation embeds preprocessing and feature extraction within nonlinear transforms \(\:{\vartheta\:}_{j}\),\(\:{\gamma\:}_{k}\) and mappings \(\:{\theta\:}_{j}\),\(\:{\gamma\:}_{k}\), to highlight the multistage, nonlinear nature of feature construction. The target mapping uses crisp interval-based indicator functions for categorical class assignment, suitable for classification learning frameworks.

Although the mathematical representation in Eqs. (2)–(12) is a strict definition of the feature space, an intuitive example may help to clarify how it actually works. To illustrate, suppose we have a student with these skill levels:

Educational traits: High_School_GPA = 3.6 (scale 0–4), SAT_Score = 1450 (range 900–1600), University_GPA = 3.4 (scale 0–4), Projects_Completed = 4 (range 0–9), Certifications = 2 (range 0–5).

Behavioral traits: Internships_Completed = 2 (range 0–4), Soft_Skills_Score = 8 (range 1–10), Networking_Score = 7 (range 1–10), Work_Life_Balance = 6 (range 1–10), Entrepreneurship = Yes.

In preprocessing, z-score normalization is applied to the educational traits (Eqs. 36). For example, All features are scaled into the [0, 1] interval, the normalized value are:

$$\:{\theta\:}_{x\left(SAT\right)}=\:\frac{1450-900}{1600-900}= 0.79$$
$$\:{x{\prime\:}}_{\left(CGPA\right)}=\:\frac{3.6}{4.0}= 0.90$$
$$\:{x{\prime\:}}_{\left(CERTIFICATIONS\right)}=\:\frac{2}{5}= 0.40$$

Behavioral traits follow the same rule:

$$\:{x{\prime\:}}_{\left(SOFTSKILLS\right)}=\:\frac{8-1}{10-1}= 0.78$$
$$\:{x{\prime\:}}_{\left(INTERNSHIPS\right)}=\:\frac{2}{4}= 0.50$$

Behavioral traits (Eqs. 79) may be encoded categorically. Like, if we have “Entrepreneurship = Yes” then it is encoded as 1 and if it is a no then as 0. Job levels (Entry, Mid, Senior) are also encoded as embeddings to enable capturing co-occurrence patterns between categories. The educational and behavioral features are combined into a single representation, considering the transformed feature vector (Eq. 10). In this instance, the student’s profile combines normalized test scores, GPAs, project counts and encoded behavioral features in a single feature space.

The resulting feature vector (Eq. 10) concatenates these normalized values as one feature. For this student, the resulted vector contains values such as [0.90, 0.79, 0.85, 0.44, 0.40, 0.50, 0.78, 0.67, 0.56, 1.0] combining academic and behavioral (content-related) indicators.

Finally, the stated Career Satisfaction = 8 is associated to the discrete target class High (2) as values of 7–10 map into High satisfaction class (Eq. 12).

This illustration provides a place for the abstract equations and makes them interpretable as a workflow, demonstrating how raw academic and behavioral inputs are processed to fit within 0–1 feature space suitable for transformer-based modeling.

The target label (Eqs. 1112) represents the given Career Satisfaction score as one of three classes: Low, Medium or High. To illustrate, if this student reports a satisfaction score of 8, then they are categorized in the “High” level of satisfaction.

Proposed model

The proposed model is based on the Bidirectional Encoder Representations from Transformers (BERT) architecture, which essentially uses the multi-head self-attention mechanism to capture complex contextual dependencies among input features, working shown in Fig. 2. Unlike how traditional sequence models are designed, BERT is using transformer encoder blocks which in turn will let each element in the input sequence attends to all elements with both directions considering, so that richer representation can be obtained. At the heart of this architecture is the self-attention function, which calculates attention weights using scaled dot-products between query (\(\:Q\)), key \(\:\left(K\right),\) and value\(\:\:\left(V\right)\:\)vectors extracted from the input embeddings37. Mathematically, the attention output is calculated using Eq. 13.

$$\:\text{Attention}\left(\text{Q},\text{K},\text{V}\right)=\text{softmax}\left(\frac{\text{Q}{\text{K}}^\tau}{\sqrt{{\text{d}}_{\text{k}}}}\right)\text{V}$$
(13)

where​ dimensionality of key vector used as a scaling factor to avoid high gradient. The multi-head attention mechanism concatenates several such involving attention outputs, which allow different representation subspaces to participate in jointly attending other representation subspaces38.

Fig. 2
figure 2

Architecture of BERT Model.

The feature embeddings are iteratively refined by stacked transformer layers, which consists of multi-head attention, position-wise feed-forward networks, layer normalization, and residual connections per layer, all layers defined in Table 3. This contextualization helps in capturing refined correlations between educational and behavioral characteristics, increasing the levels of career satisfaction classification.

Although tree-based models, such as RF and GB, and classical DL architectures such multilayer perceptron’s or recurrent networks achieve good prediction performance for structured data, they may fail to exploit complex interactions and contextual dependencies across disparate attributes. Transformer-based models such as BERT on the other hand use multi-head self-attention mechanisms, enabling mutual weighted and bidirectional interactions between diverse features so that local as well as global dependencies within data can be preserved. This is especially beneficial in career achievement prediction, as the educational (e.g., GPA, university ranking) and behavioral traits (e.g., networking score, work–life balance) interact in nonlinear and context-specific manners.

Table 3 Analysis of BERT working based on layers.

By utilizing attention and embedding we are not only able to improve predictive performance but also provide a more interpretable representation of feature contribution compared to previous works. Pre-trained BERT model and adapted its embedding mechanism for structured tabular data by mapping educational and behavioral traits into embedding vectors rather than raw tokens. Instead of language tokens, each feature dimension was embedded and processed through the transformer blocks, allowing BERT to capture contextual interactions across heterogeneous attributes. The results of this study with BERT achieving 98% accuracy and significantly performing over both tree-based and recurrent baselines provide empirical evidence of transformer models being a good fit for tabular predictive tasks.

Baseline models

The chosen baseline models of this work are comprised of the widely adopted machine learning and deep learning techniques for offering an extensive performance comparison. Support Vector Machine (SVM) functions as a strong linear and nonlinear classifier and maximizes the margin between classes by applying kernel functions and is also able to process high-dimensional data. Logistic Regression (LR) is a basic probabilistic linear model which is a highly applied model in binary and multi-class classification problems because of its simplicity and interpretability. Random Forest (RF), which is based on an ensemble of decision trees, also has the same objective of enhancing model prediction accuracy and preventing overfit and works particularly well in the structured tabular data case with the use of bagging and feature randomness. Finally, Gated Recurrent Unit (GRU) is an RNN model variant that has been developed to model temporal dependencies in sequences, which addressed the vanishing gradient problem with the addition of gating mechanisms and has been proven to outperform RNNs in time-series or ordered feature models. Taken together, these baselines form a strong comparison baseline against a range of modeling paradigms and allow the proposed predictive framework to be effectively evaluated.

For transparency and reproducibility, we highlighted the hyperparameters, training settings, and computational resources across all models employed in this work. Table 4 shows the all-parameter settings of the baseline machine learning and deep learning methods and the BERT based transformer, which are summarized along with our proposed model for ease of comparison inclusion of hardware and software.

Table 4 Model training setup and hyperparameters.

Performance measures

The proposed model was evaluated with a range of metrics to measure the effectiveness and reliability of the classification, defined in Table 5. The principal indices applied to compare the model performance included accuracy, which measures how many instances are classified correctly, precision and recall, which estimate on one hand the capability of the model to correctly label the relevant positive samples and on the other hand to predict all actual positives, offering info on the false positive and false negative, and the F1-score, which is the harmonic mean of two former indices and was used to balance the two, especially when the class is imbalanced39.

Table 5 Analysis of evaluation metrics.

Results and discussion

This section discusses the results of a predictive analysis of career satisfaction based on both compound educational and behavioral characteristics. First, we investigate the dataset by performing EDA to learn about feature distributions and their relationship. Then performance of different predictive models (older machine learning methods and newer deep learning networks) is analyzed and contrasted. The focus is specifically on the original BERT transformer model which has shown improved prediction accuracy and robustness. Comprehensive empirical results deify the behavior of the model and the importance of the features and demonstrate the verification metrics to justify the efficiency of the multi-factor approach in learning complicated patterns that determine career results. The bar plots in Fig. 3 represent the sum of job offers in the field of study. The data shows that graduates from Arts and Mathematics receive the highest volume of job offers, followed closely by Law, Business, Engineering, Medicine, and Computer Science. This in turn indicates that not only the popular fields that are technology focused but also the traditional and broader degree fields such an Arts and Mathematics bring good employability outcomes in this data.

Fig. 3
figure 3

Distribution of job offers by field of study.

This observation could be indicative of industry requirements or linked to networking opportunities in these domains. The salary distribution histogram in Fig. 4, split by the current job level of a user, understandably shows hierarchical trends: junior level jobs are mostly concentrated in the lower salary ranges, and mid, senior, and executive level positions have salaried distributions at increasingly higher peaks. Interestingly, at all but the top level the data skews toward lower salaries, which could indicate early-career leakage or industries in which earning potential is capped. This is yet another direct evidence of the job level starting salary correlation that is so critical to career satisfaction.

Fig. 4
figure 4

Distribution of starting salaries by current job level.

t-SNE plot in Fig. 5 transforms the high dimensional feature space into two dimensions and colors the points based on career satisfaction classes (Low, Medium and High). The clusters overlap between classes suggesting that the extraction of features that unambiguously separates classes is difficult as the underlying feature relationships are complex and non-linear. Nevertheless, there are a few localized areas which suggest partial clustering indicating that certain combinations of traits prefer to co-occur and prefer to be paired with some level of satisfaction. This supports the importance of using complex models such as deep learning or ensemble methods to model these subtle interactions.

Fig. 5
figure 5

t-SNE visualization of students records regarding academics and behavior features and colored by the career satisfaction classes (0 = Low, 1 = Medium, 2 = High). The visualization reveals that the clusters partly overlap based on features that are shared between them, but still clear grouping patterns can be differently distinguished from each other, which implies that distinctness human satisfaction levels is separable in latent space.

Furthermore, Heatmap in Fig. 6 demonstrates weak to moderate correlations between numerical features, though some strong positive relationships e.g. between University GPA and High School GPA, Soft Skills Score and Networking Score. Note that career satisfaction has almost no direct relationship with most properties, which implies that satisfaction may be affected by some complex combination of factors, or by (latent) variables that are not only in a linear relationship. This emphasizes the value of multivariate modeling approaches.

Fig. 6
figure 6

Correlation heatmap for numerical attributes of the original dataset. Except for a few moderate-size effect sizes, most correlations are weak in magnitude close to zero, suggesting that educational-based traits and trait behavioral traits complementary non-redundant information for predicting career satisfaction. Substantial one on the diagonal confirms the integrity of all our features.

The pair plot in Fig. 7 further illustrates the distributions in scatter forms for the behavioral traits such as number of internships completed, soft skills, networking score, and work-life balance of the different classes within career satisfaction.

Fig. 7
figure 7

Pair plot of selected behavioral traits colored by Career satisfaction classes. Distribution-wise, higher career satisfactions in general converge to derive from higher soft skill scores, explained job characteristics and social interactions i.e. networking with work-life balance scores.

It is clear that, satisfaction Skill and Network score correlates highly within any of the satisfaction brackets, indicating that there is a significant behavioral impact on the career experience. The clustering of work-life balance for arbitrary groups can be seen to be discreet, possibly because of survey participants answering in a categorical style, but its distribution generally looks quite similar across each satisfaction group.

The feature of importance analysis in Fig. 8 is insightful to understand which predictors the forecasting model uses the most in determining career satisfaction. Remarkably, SAT Score, College Ranking and College GPA, which are all commonly used academic performance benchmarks, emerge as the top three factors in addition to Publications: First Author. This emphasizes the large extent to which conventional educational achievements determine career paths and satisfaction. Starting Salary and the other highest-ranked features are consistent with the findings that economic reward is positively associated with perceived career success and satisfaction. In addition to these strong academic and financial metrics, behavioral/experiential characteristics– such as Soft Skills Score, Work Life Balance, Networking Score– also show compelling importance but at lower ranks. They also highlight the complexities of what makes a career satisfying — not just grades or paychecks, but also people’s skills, professional relationships and even personal happiness. The model’s addition of these characteristics confirms the hypothesis of the study that the career outcome is a result of both educational and behavioral factors acting in concert. The fact that features like Job Offers and Certifications have moderate important scores but are non-zero reveals that actual achievements or external validations are accounted for in career satisfaction. At the same time Gender and Entrepreneurship have relatively less importance, indicating that the direct effect is less for these features or it would be mediated through some other features. In general, this importance of ranking confirms the utility of comprehensive feature engineering and the importance of employing more advanced models on multiple modalities of data.

Fig. 8
figure 8

Feature Importance Analysis.

Visualize Distributional Properties in the Data via the Summary Heatmap in Fig. 9. The statistical summary heatmap is provided for a summary overview of the dataset’s distributional properties across a range of numerical attributes. Note that the summary reflects anticipated central tendencies and spread for school-performance measures, with the mean of High_School_GPA, SAT Score, and University GPA being at the middle of those measures’ observed range, and indicative of a relatively normal academic profile across the sample. With a maximum above 100,000, one has to say that the salary is not stable in that example. This discrepancy indicates that, although many respondents start out earning intermediate-level salaries, a minority is compensated significantly higher, which may vary based on discipline, location or experience. This broad salary range has consequences for modeling and indicates the importance of normalization or robust regression methods to deal with right skewness. Soft_Skills_Score and Networking Score have rather moderate and not too deviating means, which is also not surprising given their subjective and self-reported nature. Distributions of ordinal-like continuous variables, such as Work_Life_Balance and Entrepreneurship, validate authors ‘informal understanding of discrete or ordinal scales. The low standard deviations for academic measures, in comparison to salary and a few behavior scores, suggest the differing stability of the formal education measures and subjective experiences or situational influences on career satisfaction. Appropriate features handling and the integration of multiple data modalities should be considered for model development considering these statistical trends to guarantee model robustness.

Fig. 9
figure 9

Summary of numeric columns with statistics.

The exploratory phase indicates that educational backgrounds and behavioral traits are twined with career results. Academic metrics such as GPA are the most important feature but do not capture career satisfaction which highlights the significance of behavioral attributes like networking and soft skills. Satisfaction tends to be correlated with salary and job level and, the t-SNE visualization conveys the nonlinearity of the data. These findings corroborate the demand for sophisticated machine learning methods in combining diverse characteristics to accurately forecast job success and satisfaction.

Classification results and comparative analysis of models

The career satisfaction levels were dichotomized into Low, Medium, and High levels by Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Gated Recurrent Unit (GRU) and the BERT model. These models used multi-factor input features including educational and behavioral features (e.g., GPA, mentorship, the rank of the school, soft skills and work-life balance). The Random Forest model outperformed SVM and Logistic Regression slightly in the classical machine learning methods; with an accuracy of 81%, and balanced precision-recall figures around 72–74%. This suggests the robustness of RF for such heterogeneity tabular data and modeling of nonlinear feature interactions. The adaptation of a GRU (a kind of RNN) able to model sequential dependence of the data markedly ameliorated the classification results (accuracy = 85%), with positive values for both recall and precision, which is in line with the RNN’s capacity to encapsulate temporal or sequential relationship between behavioral attitudes. The BERT model (based on transformer architecture) is largely superior to all baselines, reaching an accuracy of 98% and precision, recall and F1 scores over 96–98%, as shown in Table 6. This significant performance gain is evidence of BERT’s strong generalization ability to contextualize feature embeddings in its multi-layer transformer architecture. The transformer blocks in BERT use self-attention mechanisms that enable the model to dynamically assign importance to input features, thus capturing complex interdependencies among behavioral and educational features. Unlike sequential recurrent architectures, no steps are taken during training to explicitly condition on the output on previous sufficiently long sub-sequences or to avoid interference of previous sub-sequences with future input.

Table 6 Results analysis of applied model.

In BERT, this is done by embedding layers, which project he sparse or categorical features into dense high-dimensional vectors, so that the semantic similarities are preserved and the model will generalize better. These embeddings are locally refined by a stack of transformer layers, each comprising multi-head self-attention, feed-forward networks, and normalization layers with residual feedback. This layered attention mechanism helps the model to capture subtle details in the data that are essential for nuanced classification such as their career satisfaction level. The confusion matrices in Fig. 10 show that in the case of the traditional models (SVM and LR) the confusion is mostly concentrated on adjacent classes (e.g. Low being confused with Medium) showing that these groups are very close in the feature space. Historical satisfaction of users per day of the week. RF and GRU alleviate some of these misclassifications, but confusion remains high between Medium and High classes. In contrast, the BERT model achieves almost perfect classification with no confusion between classes, which indicates its strong discriminative capability and the extent to which even subtle difference among categories of career satisfaction are properly separated. The comparison illustrates the importance of transformer architecture in capturing the complexity of multifactorial career outcomes. The stronger performance of BERT indicates that complexly interconnected, nonlinear relationships embodying a combination of educational and behavioral data can be modeled effectively when advanced attention mechanisms and rich contextualized embeddings are in place, which enables highly accurate and stable classification of career satisfaction. This functionality support makes BERT a strong potential for practical career analytics and personalized career development interventions.

Figure 11 is about training and validation loss trends during the progress of 50 epochs on the BERT model training. They both decrease steadily and represent good learning curve and convergence values of the model. The training loss is continuously reduced from approximately 0.7 down to around 0.11 during training sessions, indicating the growing capability of the model to reduce errors on training data. The validation loss has a similar trajectory with an initial very high value before decreasing and flattening out around 0.14 by the final epochs.

Fig. 10
figure 10

Confusion Matrix analysis of applied models. (a) SVM. (b) LR. (c) RF. (d) GRU. (e) BERT.

The little difference between the training and testing cuts will be strong evidence of hardly overfitting, so the model is likely to generalize well in previously unseen data. The zoomed inset in the last epochs comes to support this convergence, with the train and validation loss nearing closely at the validation loss minimum (epoch 50). This convergence suggests that the training is well-augmented, as the model seems able to balance its bias and variance quite well, finally finding the best parameters by proper regularization and learning schedules.

Fig. 11
figure 11

Model training and validation loss analysis over various set of epochs.

The plot in Fig. 12 compares validation accuracy of the proposed BERT-attention model to its baseline GRU model counterparts in terms of difference in training epochs. Overall, the learning curves and final performance of the BERT model are significantly better. Beginning with higher initial accuracy (e.g. 74% vs. 70% of GRU), BERT’s validation accuracy grows faster in the Rapid Learning Phase (epochs 1–10) by 4:6% greater than GRU. With time going on even further toward the Stabilization Phase (epochs from 30 to 50), the BERT model more stably reaches the top, eventually significantly outperforming the GRU model’s plateau around 86%. The graph is shaded to represent these phases, accentuating the performance discrepancy between the models. The labeled percentage increases at different epochs show very good increments for all the models which emphasizes the role of conditional probability and contextual deadline embeddings for learning patterns in the joint educational and behavioral feature space.

Fig. 12
figure 12

Deep and transformer model accuracy comparison.

In the Fig. 13, the model prediction is certain in the three career satisfaction classes (Low, Medium, High). Each bubble is a particular prediction, the size of each bubble being equivalent to the level of confidence. Most importantly, the clusters appear in the confidence regime close to 1.0 for all classes, showing high confidence of the predictions. The tight cluster at high CI values indicates the model’s good performance and high discrimination power. Very low dispersion with less than 0.6 confidence indicates low ambiguous predictions, which also confirms the strong generalization of the model. This highly confident distribution also agrees with accuracy and loss analysis, underlining the adaptability of BERT in classifying fine-grained career satisfaction levels with high precision and confidence.

Fig. 13
figure 13

Model Confidence Score analysis over classes.

Interpretability analysis by LIME and SHAP show that educational and behavioral features are the most important factors in predicting career satisfaction. Starting Salary, Current Job Level, Certifications and Job Offers contribute most towards the class prediction in both the methods and Soft Skills Score along with Networking Score are found only marginally effective. Figure 14 gives a local explanation of a single prediction and shows how each feature contributes to driving the model’s output towards the predicted class. The contributions from predictors on job salary, certification level, internship count, and organization size are positive and the prediction gets higher each, whereas weak or even negative contributions (e.g., age, work-life balance) of other impressive features are not what matter.

Fig. 14
figure 14

LIME-style local explanation illustrating top feature contributions for a predicted Medium satisfaction case.

This highlights that the individual-level factors compound a particular way versus the global trend presenting personalized interpretability.

Figure 15 illustrates the average global importance of features with respect to the trained proposed model in a bar plot. Academic features such as GPA and SAT scores contribute less, while soft skills, networking, and work-life balance have minimal global impact. This emphasizes the importance of practical as well as career factors rather than purely academic performance in influencing destination. Together, these results suggest that measurable career outcomes (salary, promotions, job level) remain decisive predictors of satisfaction, whereas softer indicators, although relevant, play a comparatively minor role.

Fig. 15
figure 15

Global SHAP features importance plot showing average impact of predictors on career satisfaction classification.

Taking in combination, these findings offer convincing proof for the superiority and stability of the proposed BERT model in modeling career satisfaction in term of multivariate educational and behavioral characteristics. Although BERT achieved a very high accuracy of 98%, it is necessary to consider this result with caution. Professional data in real world situations are usually noisy, subjective and with different demographics, and socio-economic backgrounds which may not be complete in the dataset40. Hence, the almost perfect scores we observe here are at least to a certain extent influenced by characteristics or biases specific to this dataset (or set of datasets) and that validation in more diverse cohorts is needed. However, the feature importance analysis provides some useful insights in line with previous career development research. For instance, key educational indicators like SAT scores are robust predictors of early-career success echoed by findings from the field of educational psychology, while behavioral factors such as Networking and work-life balance resonate with those from human capital management and organizational behavior research that have focused on social capital and sustainable professional engagement. This concordance with prior work underscores that, while overfitting may be a concern, the proposed model learns meaningful associations between educational and behavioral attributes and career satisfaction outcomes. The complex attention mechanisms in BERT can effectively capture complex dependencies among the network in its end-to-end learning processes, which leads to fast convergence and high accuracy on confidant predictions, surpassing the classical recurrent models such as GRU. This demonstrates the viability of transformer-model based architectures in intelligent career analytics applications. Our proposed BERT model for career satisfaction prediction is significantly better than the existing state-of-the-art performance of models in the field.

Previous work, displayed in Table 7, used the model Gradient Boosting (GB) train on the data like university alumni surveys and get an AUC of 8915 and used the hybrid LSTM and RF model train on the longitudinal student data to get the accuracy of 89%17, while our BERT model got the accuracy of 98%. The former deep learning solutions, such as the single-layer LSTM for the alumni salary data, reported moderate accuracy around 78%16, while the BERT model applied using the corporate demographic and exam score data reported an accuracy about 73%26. Classic machine learning models such as Random Forest and Support Vector Machines (SVM) also were achieving accuracies between 68% and 77% on similar educational datasets18,19. The improved performance of our model results from the incorporation of behavioral and academic characteristics and from the contextual learning about intricate syntactic and semantic relationships by leveraging attention mechanisms of transformers. This broad feature representation allows to better understand and predict career satisfaction and showcases how state-of-the-art transformer architectures are transformative for educational and career analytics.

Table 7 Comparison of proposed with state-of-the-art.

Conclusion and future work

This study presents a multi-factor predictive model for career success and satisfaction that integrates both educational and behavioral traits using advanced machine learning techniques. Our findings demonstrate that the proposed BERT-based model significantly outperforms traditional classifiers and recurrent neural networks by effectively capturing complex interactions through its transformer-based attention mechanisms. Key educational features such as SAT scores, university ranking, and GPA, alongside behavioral factors like soft skills and networking, collectively contribute to accurate prediction of career satisfaction levels. The model’s superior performance, achieving 98% accuracy, underscores the value of combining diverse data modalities and leveraging deep contextual embeddings for nuanced career analytics. Moreover, the exploratory data analysis revealed intricate relationships between educational background, personal traits, and career outcomes, affirming that career satisfaction is influenced by multifaceted factors rather than isolated indicators. The t-SNE visualization and feature importance analysis further highlighted the non-linear and interdependent nature of these attributes. The current study with its limitations: first, the data is collected through self-reported survey responses, which may be subject to bias or misrepresent actual academic and career performance. Secondly, without external validation or deployment on other datasets prevents generalizability to other populations, hospitals or social-economical settings. Third, although BERT performed with a very high accuracy as mentioned above, transformer models often serve as black-box architectural functions; their decision-making process is not interpreted as clearly compared to tree-based methods. Additionally, incorporating temporal dynamics of career progression and integrating external economic or industry-specific variables may further improve model robustness.