What Underlying Factors Contribute to Employee Burnout?¶

Natalia Blanco¶

Intro¶

Problem: According to a survey by Deloitte, 77% of professionals have encountered burnout in their workplace. Additionally, the study found that 91% of respondents reported that experiencing an overwhelming amount of stress or frustration has a detrimental effect on their work quality.

In this project, I will investigate the underlying factors within the workspace that contribute to the rate of employee burnout.

To answer this question, I will employ regression analysis and predictive modeling techniques using a data base obtained from kaggle which includes an extensive variety of variables.

The variables I will be using to solve my question are:

  • Burn Rate (Outcome variable): Burnout rate while working. (0.0, 10.0) A score of 0 indicates no burnout, while 10 represents extreme burnout.
  • Date of Joining: The date and time when the employee joined the organization.
  • Gender: The gender of the employee.
  • Company Type: Describes if the employee works in a product or service-based company.
  • WFH Setup Available: Indicates if the company permits working from home.
  • Designation: The rank of an employee within the organization. (0.0, 5.0) A larger number equals a higher ranking.
  • Resource Allocation: Number of hours worked per day. (0.0, 10.0) A larger number equals more hours.
  • Mental Fatigue Score: The level of mental fatigue the employee is facing. (0.0, 10.0) A score of 0 means no fatigue, while 10 indicates extreme fatigue.
In [1]:
# libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Cleaning¶

In [65]:
# Import data 
bo_data = pd.read_csv("C:/Users/ntybl/OneDrive/Documents/Data Science for ECON/BurnoutData.csv")
bo_data.head()
Out[65]:
Employee ID Date of Joining Gender Company Type WFH Setup Available Designation Resource Allocation Mental Fatigue Score Burn Rate
0 fffe32003000360033003200 09/30/2008 Female Service No 2 3.0 3.8 0.16
1 fffe3700360033003500 11/30/2008 Male Service Yes 1 2.0 5.0 0.36
2 fffe31003300320037003900 03/10/2008 Female Product Yes 2 NaN 5.8 0.49
3 fffe32003400380032003900 11/03/2008 Male Service Yes 1 1.0 2.6 0.20
4 fffe31003900340031003600 07/24/2008 Female Service No 3 7.0 6.9 0.52
In [66]:
# Delete "Employee ID" since its not relevant for this analysis 
bo_data.drop(columns=['Employee ID'], inplace=True)
In [67]:
# Convert 'Date of Joining' to datetime format
bo_data['Date of Joining'] = pd.to_datetime(bo_data['Date of Joining'])

Exploratory Data Analysis¶

In [68]:
# Summary statistics of the numerical variables 
summary_stats = bo_data[['Designation', 'Resource Allocation', 'Mental Fatigue Score', 'Burn Rate',]].describe()
print(summary_stats)
        Designation  Resource Allocation  Mental Fatigue Score     Burn Rate
count  22750.000000         21369.000000          20633.000000  21626.000000
mean       2.178725             4.481398              5.728188      0.452005
std        1.135145             2.047211              1.920839      0.198226
min        0.000000             1.000000              0.000000      0.000000
25%        1.000000             3.000000              4.600000      0.310000
50%        2.000000             4.000000              5.900000      0.450000
75%        3.000000             6.000000              7.100000      0.590000
max        5.000000            10.000000             10.000000      1.000000

Designation: Rank of an employee within the organization

  • The average designation is 2.18, with a standard deviation of 1.14, indicating some variability in the ranks.
  • The range spans from 0 to 5; the majority of employees fall within the lower to mid-level ranks.
  • We can see that the 25th percentile is 1, which indicates that at least 25% of employees are in lower-ranking positions.

Resource Allocation: Number of hours worked per day

  • On average, employees are assigned approximately 4.48 hours. Given that the highest possible value is 10, this suggests that, on average, employees are allocated a medium amount of hours. A standard deviation of approximately 2.05 indicates some variability.
  • The median (50th percentile) is 4, suggesting that half of the employees are allocated four or fewer hours.

Mental Fatigue Score:

  • On average, employees have a mental fatigue score of 5.73, with a standard deviation of 1.92, indicating moderate variability in fatigue levels.
  • The 25th percentile is 4.6, indicating that at least 25% of employees experience significant levels of mental fatigue.

Burn Rate:

  • The average burn rate is 0.452, with a standard deviation of 0.198, indicating some variability in burnout rates.
  • The median burn rate is 0.45, suggesting that half of the employees experience burnout rates close to the average.
In [73]:
# Histogram of 'Mental Fatigue Score' and 'Burn Rate'

plot_df = bo_data[['Mental Fatigue Score', 'Burn Rate']]
# Plot histograms
plot_df.hist(bins=30, figsize=(10, 6), color='blue', edgecolor='black')
plt.show()

Mental Fatigue Score

  • We can notice a very slight negative skewness in the distribution which suggests that there are a few employees with lower mental fatigue scores than the majority.

Burn Rate

  • We can observe very slight positive skewness in the distribution which suggests that there are some employees with higher burnout rates than the majority.

Overall both histograms exhibit a bell-shaped distribution with slight deviations from symmetry.

In [7]:
# Scatter plot of 'Mental Fatigue Score' against 'Burn Rate'
plt.figure(figsize=(8, 6))
plt.scatter(bo_data['Mental Fatigue Score'], bo_data['Burn Rate'], color='blue', alpha=0.5)
plt.title('Scatter Plot of Mental Fatigue Score vs. Burn Rate')
plt.xlabel('Mental Fatigue Score')
plt.ylabel('Burn Rate')
plt.grid(True)
plt.show()
  • This scatter plot shows a positive correlation between 'Mental Fatigue Score' and 'Burn Rate'; as the 'Mental Fatigue Score' increases, the 'Burn Rate' also tends to increase.
  • The scatter plot also shows a linear relationship between the variables.
In [8]:
# line plot of 'Mental Fatigue Score' against 'Date of Joining'

line_plt = bo_data[['Date of Joining', 'Mental Fatigue Score']]

# Extract month and year from 'Date of Joining'
line_plt['Joining Month'] = line_plt['Date of Joining'].dt.to_period('M')

# Group by 'Joining Month' and calculate average 'Mental Fatigue Score' for each month
monthly_avg_score = line_plt.groupby('Joining Month')['Mental Fatigue Score'].mean()

# Plot line plot
plt.figure(figsize=(10, 6))
plt.plot(monthly_avg_score.index.to_timestamp(), monthly_avg_score.values, color='blue')
plt.title('Average Mental Fatigue Score by Joining Month')
plt.xlabel('Joining Month')
plt.ylabel('Average Mental Fatigue Score')
plt.grid(True)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()
C:\Users\ntybl\AppData\Local\Temp\ipykernel_11756\2710531583.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  line_plt['Joining Month'] = line_plt['Date of Joining'].dt.to_period('M')
  • The line plot shows that employees that joined around July of 2008 have higher levels of Average Mental Fatigue
  • Since this variable does not seem to be relevant i will exclude it from the analysis later on.
In [9]:
# Bar chart of 'Burn Rate' against 'Designation' 
burn_designation = bo_data.groupby('Designation')['Burn Rate'].mean()

# Plot bar chart
plt.figure(figsize=(8, 6))
plt.bar(burn_designation.index, burn_designation.values, color='blue')
plt.title('Average Burn Rate by Designation Level')
plt.xlabel('Designation Level')
plt.ylabel('Average Burn Rate')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()
  • The plot shows that higher designation levels are associated with higher average of Burnout.
In [10]:
# Bar chart of 'Burn Rate' against 'Gender'
burn_gender = bo_data.groupby('Gender')['Burn Rate'].mean()

# Plot bar chart
plt.figure(figsize=(6, 6))
plt.bar(burn_gender.index, burn_gender.values, color='green')
plt.title('Average Burn Rate by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Burn Rate')
plt.grid(axis='y')
plt.show()
  • On average, male workers exhibit higher levels of burnout compared to female workers by approximately one point
In [11]:
# Bar chart of 'Burn Rate' against 'Gender'
burn_WFH = bo_data.groupby('WFH Setup Available')['Burn Rate'].mean()

# Plot bar chart
plt.figure(figsize=(6, 6))
plt.bar(burn_WFH.index, burn_WFH.values, color='pink')
plt.title('Average Burn Rate by WFH')
plt.xlabel('WFH')
plt.ylabel('Average Burn Rate')
plt.grid(axis='y')
plt.show()
  • The plot shows that companies that don't allow work from home tend to have a higher level of Burnout among their employees.
In [16]:
dataplot = sns.heatmap(bo_data.corr(), cmap="YlGnBu", annot=True) 
plt.show() 
  • Based on the results of this corrplot i will be excluding Mental Fatigue Score from my predictive models since it is highly correlated with my outcome variable (Burn Rate).
In [13]:
# Check for null values
null_percentage = (bo_data.isnull().sum() / len(bo_data)) * 100

print("Null Percentage by Column:")
print(null_percentage)
Null Percentage by Column:
Date of Joining         0.000000
Gender                  0.000000
Company Type            0.000000
WFH Setup Available     0.000000
Designation             0.000000
Resource Allocation     6.070330
Mental Fatigue Score    9.305495
Burn Rate               4.940659
dtype: float64
  • There is a relatively small amount of null values in the data set

I will handle the missing values during the Feature engineering phase of this project.

EDA findings¶

Insights

  • Designation Level vs. Burnout Rate: There is a clear trend indicating that higher designation levels correlate with higher average burnout rates. This insight is invaluable for organizations as it sheds light on the impact of job roles and responsibilities on employee well-being. To mitigate burnout, companies should prioritize promoting a better work-life balance, even for employees in high-ranking positions. This can be achieved by closely monitoring workload and providing opportunities for employees to prioritize their mental health.
  • Work from Home vs. Burnout Rate: Companies that do not offer work-from-home options tend to have higher levels of burnout among their employees. To address this issue, companies should consider implementing flexible work policies that allow employees to work from home at least some days of the week. Offering this flexibility can contribute to lower burnout levels and promote overall employee well-being.

Feature engineering¶

In [14]:
from scipy.stats import skew
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
In [15]:
# Convert categorical variables into binary variables 

bo_data.replace({'Gender': {'Male': 1, 'Female': 0},
            'Company Type': {'Service': 1, 'Product': 0},
            'WFH Setup Available': {'Yes': 1, 'No': 0}}, inplace=True)
In [17]:
# Split data into predictors and and classes (outcome variable).
predictors = bo_data[['Gender', 'Company Type', 'WFH Setup Available', 'Designation', 'Resource Allocation']]
classes = bo_data[['Burn Rate']]

# look at the shapes
print(predictors.shape)
print(classes.shape)
(22750, 5)
(22750, 1)
  • By looking at the shapes we can confirm the split was done correctly.
In [18]:
# Use k-nearest neighbors to impute missing values 

# Initialize KNN imputer
imputer = KNNImputer(n_neighbors=3)  # You can adjust the number of neighbors as needed

# Impute missing values
predictors_imputed = imputer.fit_transform(predictors)

# Convert the array back to a DataFrame 
predictors_imputed = pd.DataFrame(predictors_imputed, columns=predictors.columns)
In [19]:
# Impute missing values of classes using mean
classes_imputed = classes.fillna(classes.mean())
In [21]:
# Check for skewness 
columns_to_check = ['Designation', 'Resource Allocation']

# Calculate skewness for each column
for col in columns_to_check:
    skewness = skew(predictors_imputed[col])
    print(f"Skewness of {col}: {skewness}")
Skewness of Designation: 0.09241529095974846
Skewness of Resource Allocation: 0.20676562406952995
  • These skewness values suggest that the distributions of these variables are relatively symmetric, with only minor deviations from a normal distribution. Given this i will not be applying box-cox.
In [22]:
# I will now apply PCA  to my predictors 

# Standardize the predictors
scaler = StandardScaler()
scaled_predictors = scaler.fit_transform(predictors_imputed)

# Apply PCA
pca = PCA()
pca.fit(scaled_predictors)

# Transform the predictors into principal components
predictor_pca = pca.transform(scaled_predictors)
  • Now we can visualize how much each component explains from the data set.
In [23]:
print('Variance explained by PCs {0}'.format(pca.explained_variance_ratio_))
Variance explained by PCs [0.41039126 0.20066018 0.19228763 0.17384157 0.02281937]
In [24]:
# Plot explained variance ratio
plt.figure(figsize=(8, 5))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, color='skyblue')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio of Principal Components')
plt.show()
  • As expected PC1 explains the most out of all the PC's with a value of 46% of the total variance in the data.

Model building¶

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
import statsmodels.api as sm

Linear Regression¶

In [38]:
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(predictors_imputed, classes_imputed, test_size=0.2, random_state=42)

# Create and fit linear regression
X_train_sm = sm.add_constant(X_train)  # Add constant term
model_sm = sm.OLS(y_train, X_train_sm)
results_sm = model_sm.fit()

# Make predictions on testing set
X_test_sm = sm.add_constant(X_test)  # Add constant term
y_pred = results_sm.predict(X_test_sm)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (Linear):", mse)

# Calculate AIC
aic = results_sm.aic
print("AIC (Linear):", aic)
Mean Squared Error (Linear): 0.011598094478036369
AIC (Linear): -29607.080420509024
In [27]:
# Plot actual vs predicted values
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
plt.xlabel('Actual Burnout Rate')
plt.ylabel('Predicted Burnout Rate')
plt.title('Actual vs Predicted Burnout Rate (Linear Regression)')
plt.show()

Lasso Regression¶

In [46]:
# Apply np.ravel
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

# Fit the Lasso regression model
lasso_model = Lasso(alpha=0.1) 
lasso_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lasso_model.predict(X_test)

# Calculate MSE
mse_l = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (Lasso):", mse_l)

# Calculate RSS to calculate AIC later
rss = np.sum((y_pred - y_test) ** 2)

# Number features
n_predictors = X_train.shape[1]

# Calculate the number of observations
n_obs = len(y_test)

# Calculate the degrees of freedom
df = n_obs - n_predictors - 1

# Calculate the AIC 
aic = n_obs * np.log(rss / n_obs) + 2 * n_predictors
print("AIC (Lasso):", aic)
Mean Squared Error (Lasso): 0.013770525810345529
AIC (Lasso): -19487.772757009512
In [43]:
# Fit the Lasso regression model with a lower alpha value
lasso_model2 = Lasso(alpha=0.001)
lasso_model2.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lasso_model2.predict(X_test)

# MSE
mse_l2 = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (Lasso2):", mse_l2)

# RSS
rss = np.sum((y_pred - y_test) ** 2)

# Number of features
n_predictors = X_train.shape[1]

# Observations
n_obs = len(y_test)

# Degrees of freedom
df = n_obs - n_predictors - 1

# AIC
aic = n_obs * np.log(rss / n_obs) + 2 * n_predictors
print("AIC (Lasso2):", aic)
Mean Squared Error (Lasso2): 0.011606493044358195
AIC (Lasso2): -20265.667194807742
  • We can observe a considerable improvement in the lasso with a smaller alpha.

KNN prediction¶

In [48]:
# KNN model
knn_model = KNeighborsRegressor(n_neighbors=5) 

# Fit the model
knn_model.fit(X_train, y_train)

# Make predictions
y_pred = knn_model.predict(X_test)

# Evaluate (MSE)
mse_knn = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (KNN):", mse_knn)
Mean Squared Error (KNN): 0.013328017433556338
In [49]:
# KNN model with 3 neighbors
knn_model2 = KNeighborsRegressor(n_neighbors=3) 

# Fit the model
knn_model2.fit(X_train, y_train)

# Make predictions
y_pred = knn_model2.predict(X_test)

# Evaluate (MSE)
mse_knn2 = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (KNN2):", mse_knn2)
Mean Squared Error (KNN2): 0.014377058149953001
  • We got a higher value for MSE indicating worse performance.

Random Forest¶

In [45]:
# Initialize the Random Forest 
rf_model = RandomForestRegressor(n_estimators=100, random_state=44) 

# Fit the model on train
rf_model.fit(X_train, y_train)

# Make predictions on test
y_pred_rf = rf_model.predict(X_test)

# Calculate MSE
mse_rf = mean_squared_error(y_test, y_pred_rf)
print("Mean Squared Error (Random Forest):", mse_rf)
Mean Squared Error (Random Forest): 0.01117968504080362
In [55]:
# Random Forest with higher n_estimators
# Initialize the Random Forest 
rf_model2 = RandomForestRegressor(n_estimators=300, random_state=44) 

# Fit the model on train
rf_model2.fit(X_train, y_train)

# Make predictions on test
y_pred_rf = rf_model2.predict(X_test)

# Calculate MSE
mse_rf2 = mean_squared_error(y_test, y_pred_rf)
print("Mean Squared Error (Random Forest2):", mse_rf2)
Mean Squared Error (Random Forest2): 0.011178057893738682
  • We got a lower MSE value which indicates better performance.

Model evaluation¶

In [57]:
from sklearn.model_selection import GridSearchCV
In [51]:
# Model Selection
print("Mean Squared Error (Linear):", mse)
print("Mean Squared Error (Lasso):", mse_l)
print("Mean Squared Error (Lasso2):", mse_l2)
print("Mean Squared Error (KNN):", mse_knn)
print("Mean Squared Error (KNN2):", mse_knn2)
print("Mean Squared Error (Random Forest):", mse_rf)
print("Mean Squared Error (Random Forest2):", mse_rf2)
Mean Squared Error (Linear): 0.013770525810345529
Mean Squared Error (Lasso): 0.013770525810345529
Mean Squared Error (Lasso2): 0.011606493044358195
Mean Squared Error (KNN): 0.013328017433556338
Mean Squared Error (KNN2): 0.014377058149953001
Mean Squared Error (Random Forest): 0.01117968504080362
Mean Squared Error (Random Forest2): 0.011178057893738682
  • Based on the MSE of the models we can conclude that the best model is Random Forest2 with a value of 0.011

Model Tuning¶

In [58]:
# Random Forest

# Define parameters
param_grid = {
    'n_estimators': [300, 500, 700, 1000],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15]
}

# Random Forest regressor
rf_model_g = RandomForestRegressor(random_state=44)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model_g, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best model and its parameters
best_rf_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Print best parameters
print("Best parameters:", best_params)

# Make predictions on test data using the best model
y_pred_best_rf = best_rf_model.predict(X_test)

# Calculate MSE for the best model
mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
print("Mean Squared Error (Best Random Forest):", mse_best_rf)
Best parameters: {'max_depth': 10, 'min_samples_split': 15, 'n_estimators': 1000}
Mean Squared Error (Best Random Forest): 0.01116517121922181
In [64]:
# Visualize the variables that explain more than 1.3%

# Get the feature names
feature_names = X_train.columns

# Extract feature from the best Random Forest model
best_rf_feature_importances = best_rf_model.feature_importances_

# Create a DataFrame to store the feature importances
rf_model_var_imp = pd.DataFrame({'features': feature_names, 'importance': best_rf_feature_importances})

# Include only variables with importance > 0.013
important_variables = rf_model_var_imp[rf_model_var_imp['importance'] > 0.013]

# Sort by importance values
important_variables = important_variables.sort_values('importance')

# Plot bar chart
plt.figure(figsize=(8, 5))
plt.barh(important_variables['features'], important_variables['importance'])
plt.xlabel('Share of Total Explained')
plt.ylabel('Variable Name')
plt.title('Variable Importance (Top Variables)')
plt.show()
  • We can clearly observe that "Resource Allocation" stands out as the most influential variable in explaining burnout among all the predictive factors examined.

Conclusion¶

Findings¶

  • Resource Allocation is the variable with the most influence in burn out rate.
  • Work From Home Setup Available(whether the company allows work from home) has a cionsiderable influence in burn out rate.

Insights¶

  • Resource Allocation(hr worked per day): Excessive work hours increase the risk of burnout, leading to mental exhaustion and reduced productivity. This not only impacts the workplace but also employees' personal lives. To foster a healthier work environment, organizations should prioritize workload optimization and establish realistic deadlines to promote a better work-life balance.

  • WFH Setup Available: This confirms the shift in dynamics of modern work, organizations need to adapt their policies to accommodate remote work. It also proves the need for flexibility in the work environment. Working from home allows employees to have a better control over their schedules and reduces the time and effort associated with commuting.

When reviewing the findings and insights I was surprised 'Designation' (ranking within the company) didn't have a high influence in burn out rate since in the exploratory analysis it showed a clear trend between ranking and average burnout. I believe the outcomes of this analysis are relevant for organizations and should be considered since burn out not only affects the productivity of an employee, but it also affects their overall quality of life.

Conclusion regarding the process¶

While working on this project, I encountered unexpected challenges, while doing the box-transformation I wasn't satisfied with the results, ultimately after reviewing the skewness before and after box-cox I decided not to transform my data. Another challenge I encountered was when selecting a model. Initially, after building my predictive models the best model based on the MSE was linear regression, I know this is a possibility but anyway it seemed weird, these results made me go back and look at my data and I realized I had a highly correlated variable, so I decided to drop it. Lastly, I encountered a challenge during model tunning since I decided to set n_estimators to 1000 expecting the model to take 20 minutes at most to run but it took around 40 minutes, I would say it was worth it since it was the best model however it made me realize that in another situation when working with a larger data set or with a time constraint I should set more realistic parameters in my models.

References¶

  • Blurred Machine. (2021). Are Your Employees Burning Out? [Dataset]. Retrieved from https://www.kaggle.com/datasets/blurredmachine/are-your-employees-burning-out?select=sample_submission.csv
  • Deloitte. (n.d.). Burnout Survey: Is employee burnout affecting your bottom line? Deloitte. Retrieved from https://www2.deloitte.com/us/en/pages/about-deloitte/articles/burnout-survey.html
In [ ]: