M4 PROJECT: What are the most influencing factors for employee attrition and whom are those leaveing.

In [1]:
#Importing the libriaries needed 
# Ignore the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#import missingno as msno

#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
%matplotlib inline  
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)

#import the necessary modelling algos.
#from sklearn.linear_model import LogisticRegression
#from sklearn.svm import LinearSVC
#from sklearn.svm import SVC
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import GradientBoostingClassifier
#from sklearn.naive_bayes import GaussianNB

#model selection
#from sklearn.model_selection import train_test_split
#from sklearn.model_selection import KFold
#from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,roc_auc_score
#from sklearn.model_selection import GridSearchCV

#from imblearn.over_sampling import SMOTE

#preprocess.
#from sklearn.preprocessing import MinMaxScaler,StandardScaler,LabelEncoder,OneHotEncoder

# Common sklearn Model Helpers
#from sklearn import feature_selection
#from sklearn import model_selection
#from sklearn import metrics
# from sklearn.datasets import make_classification

# sklearn modules for performance metrics
#from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
#from sklearn.metrics import auc, roc_auc_score, roc_curve, recall_score, log_loss
#from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, make_scorer
#from sklearn.metrics import average_precision_score
# ann and dl libraraies
#from keras import backend as K
#from keras.models import Sequential
#from keras.layers import Dense
#from keras.optimizers import Adam,SGD,Adagrad,Adadelta,RMSprop
#from keras.utils import to_categorical

#import tensorflow as tf
import random as rn

1. Dataset Explaination

This dataset was a fictional dataset created by IBM to indentify important factors that may be influencing attrition for an employee. The dataset contains 1470 rows and 35 coulmns. Our project is focusing on to find the most importance metrics that influence attrition. Firstly we need to do some general statistics to get insight into the dataset, Second using machine learning to predict attrition. Maybe it will give findings that people do not usally think about regarding employee attrition.

Haveing a understanding of what make employees leave is important to know, if a person is leaving replacement cost could be high. Being aware of it will be easier to take action to improve to the employee attrition.

Some of the questions we want to cover during this project

  • What is the likelihood of an active employee leaving the company?
  • What are the key indicators of an employee leaving the company?
  • What policies or strategies can be adopted based on the results to improve employee retention?

Given that we have data on former employees, this is a standard supervised classification problem where the label is a binary variable, 0 (active employee), 1 (former employee). In this study, our target variable Y is the probability of an employee leaving the company.

Some important columns in the dataset with information about personal and employment details, explained in more:

Some important columns:

  1. Attrition: Whether employees are still with the company or whether they’ve gone to work somewhere else.
  2. Age: 18 to 60 years old
  3. Gender: Female or Male
  4. Department: Research & Development, Sales, Human Resources.
  5. BusinessTravel: Travel_Rarely, Travel_Frequently, Non-Travel.
  6. DistanceFromHome: Distance between the company and their home in miles.
  7. MonthlyIncome: Employees' numeric monthly income.
  8. MaritalStatus: Married, Single, Divorced.
  9. Education: 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'.
  10. EducationField: Life Sciences, Medical, Marketing,Technical Degree,Other.
  11. EnvironmentSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  12. RelationshipSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  13. JobInvolvement: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  14. JobRole: Sales Executive,Research Science, Laboratory Tec, Manufacturing, Healthcare Rep, etc
  15. JobSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  16. OverTime: Whether they work overtime or not.
  17. NumCompaniesWorked: Number of companies they worked for before joinging IBM.
  18. PerformanceRating: 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'.
  19. YearsAtCompany: Years they worked for IBM.
  20. WorkLifeBalance: 1 'Bad' 2 'Good' 3 'Better' 4 'Best'.
  21. YearsSinceLastPromotion: Years passed since their last promotion.

2. Data Preparation

In [2]:
df_employee = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

After loading the dataset into a dataframe, using the command under we can get a good understadning how dataset is put together. Some of functions used is listed under.

df_employee.head() 
df_employee.columns
df_employee.decribe()
df_employee.shape()
df_employee.info()

From the those commands we can see that the dataset contains no missing values. It is several numerical and categorical variables. From a HR prepective,these type of data about empoyees is unlikely to feature huge amount of missing data

In [3]:
#Taking a look at the data set
df_employee.head()
Out[3]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

In [4]:
df_employee.columns
Out[4]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [5]:
df_employee.shape
Out[5]:
(1470, 35)
In [6]:
df_employee.describe()
Out[6]:
Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 ... 1470.000000 1470.0 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 802.485714 9.192517 2.912925 1.0 1024.865306 2.721769 65.891156 2.729932 2.063946 ... 2.712245 80.0 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.135373 403.509100 8.106864 1.024165 0.0 602.024335 1.093082 20.329428 0.711561 1.106940 ... 1.081209 0.0 0.852077 7.780782 1.289271 0.706476 6.126525 3.623137 3.222430 3.568136
min 18.000000 102.000000 1.000000 1.000000 1.0 1.000000 1.000000 30.000000 1.000000 1.000000 ... 1.000000 80.0 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 1.0 491.250000 2.000000 48.000000 2.000000 1.000000 ... 2.000000 80.0 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1.0 1020.500000 3.000000 66.000000 3.000000 2.000000 ... 3.000000 80.0 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 1.0 1555.750000 4.000000 83.750000 3.000000 3.000000 ... 4.000000 80.0 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 1.0 2068.000000 4.000000 100.000000 4.000000 5.000000 ... 4.000000 80.0 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000

8 rows × 26 columns

In [7]:
df_employee.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
In [8]:
df_employee.isnull().sum()
Out[8]:
Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

Now is time to check if all variables will give some useful insights or some of them could be deleted. To check this it is possible to loop thought and check if unique value is 1, and them drop the columns.

In [9]:
#this fuction is not test out yet. but will be 
notneeded = []
for col in df_employee.columns:
    if len(df_employee[col].unique()) == 1:
        notneeded.append(col)
        df_employee.drop(col,inplace=True,axis=1)
In [10]:
print(notneeded)
['EmployeeCount', 'Over18', 'StandardHours']
In [11]:
df_employee.drop(['EmployeeNumber'], axis = 1, inplace = True)
In [12]:
print(df_employee.shape)
df_employee.head()
(1470, 31)
Out[12]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender ... PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female ... 3 1 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male ... 4 4 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 4 Male ... 3 2 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female ... 3 3 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 Male ... 3 4 1 6 3 3 2 2 2 2

5 rows × 31 columns

After running this loop, the columns dropped where EmployeeCount, Over18 and StandardHours Also the EmployeeNumber is just a number increaseing so that column is dropped. The Next step would be to look at how the different variables are correlated.

In [13]:
f, ax = plt.subplots(figsize=(20, 20))
corr = df_employee.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax, annot = True)
Out[13]:
<AxesSubplot:>

Conclutions from Data preparing:

  • Several numerical and categorical columns with information about employee's personal and employment details
  • There are no missing values in this dataset, hence HR do not tend to have huge amout of missing data
  • Dropped the not useful cloumns
  • Important correlations:
    • Age - TotalworkingYears
    • Age - JobLevel
    • JobLevel - MonthlyIncome
    • JobLevel - TotalWorkingYears
    • YearsAtCompany -YearsInCurrentRole
    • MonthlyIncome - TotalWorkingYears

3. EDA (Exploratory Data Analysis)

3.1 General feature statistics

Starting the EDA of with some histograms of the for numerical features.

In [14]:
df_hrs = df_employee.copy()
df_hr_cat_name = df_employee.copy()
df_Anumber = df_employee.copy()
In [15]:
df_employee.hist(figsize=(20,20))
plt.show()
  • What we can see from the histograms is that many of them is right skwed
  • Age distribution is more towards the younger genration between 25 and 45 years old
  • A lot of people are in the compnay less than 10 years

Attrtion Rate

In [16]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['Attrition'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['Attrition'])
df_employee['Attrition'].value_counts().to_frame()
Out[16]:
Attrition
No 1233
Yes 237

From the attrition rate graphs we can see that the majority is still there. Important mention, piechart and bar chart interpert the colors differently.

Travelling

In [17]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['BusinessTravel'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['BusinessTravel'])
#df_employee['BusinessTravel'].value_counts().to_frame()
print(df_employee.groupby('BusinessTravel')['Attrition'].value_counts())
#df_employee.groupby('BusinessTravel')['Attrition'].value_counts
#plt.subplot(1,3,3)
#df_employee.groupby('Attrition')['BusinessTravel'].value_counts(df_employee.Attrition.all()).plot.pie(autopct='%1.1f%%',figsize=(11,6))
#print(df_employee.groupby(['BusinessTravel','Gender'])['Attrition'].value_counts(100. * df_employee.Attrition.value_counts() / len(df_employee.Attrition)))
BusinessTravel     Attrition
Non-Travel         No           138
                   Yes           12
Travel_Frequently  No           208
                   Yes           69
Travel_Rarely      No           887
                   Yes          156
Name: Attrition, dtype: int64

The business travels have a clear amount that travel rarely, the person travel frequently or do not travel is small

OverTime

In [18]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['OverTime'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['OverTime'])
#df_employee['OverTime'].value_counts().to_frame()
df_employee.groupby('OverTime')['Attrition'].value_counts()
Out[18]:
OverTime  Attrition
No        No           944
          Yes          110
Yes       No           289
          Yes          127
Name: Attrition, dtype: int64

1/3 of the people tend to have overtime

Department

In [19]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['Department'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['Department'])
#print(df_employee.groupby('Department')['Attrition'].value_counts())
df_employee.groupby('Department')['Attrition'].value_counts()
#pd.pivot_table(df_employee, values = 'Department', index='Attrition').reset_index()
#sns.countplot(df_employee.groupby('Department',)['Attrition'])
#df_employee['Department'].value_counts().to_frame()
Out[19]:
Department              Attrition
Human Resources         No            51
                        Yes           12
Research & Development  No           828
                        Yes          133
Sales                   No           354
                        Yes           92
Name: Attrition, dtype: int64

The majority of the people are part of the research & Development department with 65%, also sales department are big with 30%

Education Level

In [20]:
#need to change the value to the column to get a better understanding of what the graph says 
df_employee.Education.replace({1: 'High School', 2:'College', 3:'Bachelor', 4:'Master', 5:'Doctorate'},inplace=True)
df_hrs.Education.replace({1: 'High School', 2:'College', 3:'Bachelor', 4:'Master', 5:'Doctorate'},inplace=True)
#df_employee.Education.replace({'High School':1, 'Undergrad':2,'Graduate':3, 'Post Graduate':4, 'Doctorate':5},inplace=True)
In [21]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['Education'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['Education'])
plt.xticks(rotation=45)
#df_employee['Education'].value_counts().to_frame()
df_employee.groupby('Education')['Attrition'].value_counts()
Out[21]:
Education    Attrition
Bachelor     No           473
             Yes           99
College      No           238
             Yes           44
Doctorate    No            43
             Yes            5
High School  No           139
             Yes           31
Master       No           340
             Yes           58
Name: Attrition, dtype: int64

in the education level is a clear for persons having bacheolor andmaster degree

Education Field

In [22]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
df_employee['EducationField'].value_counts().plot.pie(autopct='%1.1f%%')
plt.subplot(1,2,2)
sns.countplot(df_employee['EducationField'])
plt.xticks(rotation=45)
#df_employee['EducationField'].value_counts().to_frame()
print(df_employee.groupby('EducationField')['Attrition'].value_counts().to_frame())
                            Attrition
EducationField   Attrition           
Human Resources  No                20
                 Yes                7
Life Sciences    No               517
                 Yes               89
Marketing        No               124
                 Yes               35
Medical          No               401
                 Yes               63
Other            No                71
                 Yes               11
Technical Degree No               100
                 Yes               32

There are two field that are dominant both medical and life science is

JobrRole

In [23]:
plt.figure(figsize=(20,10))
sns.catplot(y='JobRole', kind='count', aspect=4, data=df_employee)
print(df_employee['JobRole'].value_counts())
print(df_employee.groupby('JobRole')['Attrition'].value_counts())
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Director             80
Human Resources               52
Name: JobRole, dtype: int64
JobRole                    Attrition
Healthcare Representative  No           122
                           Yes            9
Human Resources            No            40
                           Yes           12
Laboratory Technician      No           197
                           Yes           62
Manager                    No            97
                           Yes            5
Manufacturing Director     No           135
                           Yes           10
Research Director          No            78
                           Yes            2
Research Scientist         No           245
                           Yes           47
Sales Executive            No           269
                           Yes           57
Sales Representative       No            50
                           Yes           33
Name: Attrition, dtype: int64
<Figure size 1440x720 with 0 Axes>

We can see the divition of different job roles, there are most sales excutive and least working in human resources.

In [24]:
# Changing numeric values to corresponding categorical values
df_employee['EnvironmentSatisfaction'] = df_employee['EnvironmentSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
df_employee['JobInvolvement'] = df_employee['JobInvolvement'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
df_employee['JobSatisfaction'] = df_employee['JobSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
df_employee['RelationshipSatisfaction'] = df_employee['RelationshipSatisfaction'].map({1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High'})
df_employee['PerformanceRating'] = df_employee['PerformanceRating'].map({1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding'})
df_employee['WorkLifeBalance'] = df_employee['WorkLifeBalance'].map({1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best'})
In [25]:
plt.figure(figsize=(18,8))
plt.subplot(2,3,1)
sns.countplot(df_employee['EnvironmentSatisfaction'])
plt.subplot(2,3,2)
sns.countplot(df_employee['JobInvolvement'])
plt.subplot(2,3,3)
sns.countplot(df_employee['JobSatisfaction'])
plt.subplot(2,3,4)
sns.countplot(df_employee['RelationshipSatisfaction'])
plt.subplot(2,3,5)
sns.countplot(df_employee['PerformanceRating'])
plt.subplot(2,3,6)
sns.countplot(df_employee['WorkLifeBalance'])
Out[25]:
<AxesSubplot:xlabel='WorkLifeBalance', ylabel='count'>
  • EnvironmentSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  • RelationshipSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  • JobInvolvement: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  • JobSatisfaction: 1 'Low' 2 'Medium' 3 'High' 4 'Very High'.
  • PerformanceRating: 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'.
  • WorkLifeBalance: 1 'Bad' 2 'Good' 3 'Better' 4 'Best'.

Overall is the metrics, mostly from maybe surveys show that people are happy and scoring high on the metrics.

Wonder why the perfomance rating is only 3 and 4. More people are overall more satified with their situation.

We can see the divition of different job roles, there are most sales excutive and least working in human resources.

3.2 Attrition effected by Age and MaritalStatus?

1. Hypothesis: Single people tend to leave more often than married people?

2. Hypothesis: Male are more active leavers?

In [26]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.violinplot(x ="Gender", y ="Age", hue ="Attrition",
data = df_hrs, split = True)
plt.subplot(1,2,2)
sns.violinplot(x ="Gender", y ="Age", hue ="OverTime",
data = df_hrs, split = True)
plt.xticks(rotation=45)
#df_employee['OverTime'].value_counts()
Out[26]:
(array([0, 1]), [Text(0, 0, 'Female'), Text(1, 0, 'Male')])
In [27]:
plt.subplots(figsize=(20,5))
sns.histplot(data=df_employee, x="Age", hue="Attrition", multiple="stack",bins=5,kde=True)
#df_employee['Gender'].value_counts(normalize=True)
df_employee.groupby('Gender')['Attrition'].value_counts()
Out[27]:
Gender  Attrition
Female  No           501
        Yes           87
Male    No           732
        Yes          150
Name: Attrition, dtype: int64
In [28]:
plt.subplots(figsize=(20,5))
sns.histplot(data=df_employee, x="Age", hue="MaritalStatus", multiple="stack",bins=5)
print(df_employee['MaritalStatus'].value_counts(normalize=True))
print(df_employee.groupby(['MaritalStatus','Gender'])['Attrition'].value_counts())
Married     0.457823
Single      0.319728
Divorced    0.222449
Name: MaritalStatus, dtype: float64
MaritalStatus  Gender  Attrition
Divorced       Female  No           108
                       Yes            9
               Male    No           186
                       Yes           24
Married        Female  No           241
                       Yes           31
               Male    No           348
                       Yes           53
Single         Female  No           152
                       Yes           47
               Male    No           198
                       Yes           73
Name: Attrition, dtype: int64
In [29]:
sns.histplot(data=df_employee, x="MaritalStatus", hue="Gender", multiple="stack",bins=5)
Out[29]:
<AxesSubplot:xlabel='MaritalStatus', ylabel='Count'>

From the part 3.2 we can see:

  • that there are more men (150) leaving than women (87): Hypothesis 2 confirmed
  • 45% Married Workers, 22% Singles and 31% Divorced.
  • More single leave more Married: Hypothesis 1 confirmed
  • Divorced women are stayers only 9 people leaving.
In [30]:
df_Anumber['Attrition'] = df_Anumber['Attrition'].map({'Yes': 1, 'No': 0})
#df_hr_cat_name['Attrition'] = df_hr_cat_name['Attrition'].map({'Yes': 1, 'No': 0})

3.3 Education, EducationField, JobRole and JobLevel

In this Chapther we are taking a look at the different JobRoles and Joblevels and connect that to Education and Education Field. And see who are more likely to leave 3. Hypothesis: Workers in lower JobLevel are morelikely to leave

4. Hypothesis: Lower Education get more training

In [31]:
plt.subplots(figsize=(18,5))
sns.countplot(df_employee.JobRole, hue=df_employee.JobLevel)
plt.xticks(rotation=10)
Out[31]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 [Text(0, 0, 'Sales Executive'),
  Text(1, 0, 'Research Scientist'),
  Text(2, 0, 'Laboratory Technician'),
  Text(3, 0, 'Manufacturing Director'),
  Text(4, 0, 'Healthcare Representative'),
  Text(5, 0, 'Manager'),
  Text(6, 0, 'Sales Representative'),
  Text(7, 0, 'Research Director'),
  Text(8, 0, 'Human Resources')])
In [32]:
plt.subplots(figsize=(20,5))
sns.histplot(data=df_employee, x="JobLevel" ,hue="JobRole", multiple="stack")
print(df_employee.groupby('JobLevel',)['JobRole'].value_counts(normalize=True))
print(df_employee.groupby(['JobLevel', 'JobRole'])['Attrition'].value_counts().sort_index())
JobLevel  JobRole                  
1         Research Scientist           0.430939
          Laboratory Technician        0.368324
          Sales Representative         0.139963
          Human Resources              0.060773
2         Sales Executive              0.436330
          Manufacturing Director       0.168539
          Healthcare Representative    0.146067
          Research Scientist           0.106742
          Laboratory Technician        0.104869
          Human Resources              0.024345
          Sales Representative         0.013109
3         Sales Executive              0.362385
          Manufacturing Director       0.206422
          Healthcare Representative    0.201835
          Research Director            0.128440
          Manager                      0.055046
          Human Resources              0.027523
          Laboratory Technician        0.013761
          Research Scientist           0.004587
4         Manager                      0.443396
          Research Director            0.245283
          Sales Executive              0.132075
          Manufacturing Director       0.094340
          Healthcare Representative    0.084906
5         Manager                      0.623188
          Research Director            0.376812
Name: JobRole, dtype: float64
JobLevel  JobRole                    Attrition
1         Human Resources            No            23
                                     Yes           10
          Laboratory Technician      No           144
                                     Yes           56
          Research Scientist         No           189
                                     Yes           45
          Sales Representative       No            44
                                     Yes           32
2         Healthcare Representative  No            75
                                     Yes            3
          Human Resources            No            13
          Laboratory Technician      No            51
                                     Yes            5
          Manufacturing Director     No            85
                                     Yes            5
          Research Scientist         No            55
                                     Yes            2
          Sales Executive            No           197
                                     Yes           36
          Sales Representative       No             6
                                     Yes            1
3         Healthcare Representative  No            39
                                     Yes            5
          Human Resources            No             4
                                     Yes            2
          Laboratory Technician      No             2
                                     Yes            1
          Manager                    No            10
                                     Yes            2
          Manufacturing Director     No            40
                                     Yes            5
          Research Director          No            28
          Research Scientist         No             1
          Sales Executive            No            62
                                     Yes           17
4         Healthcare Representative  No             8
                                     Yes            1
          Manager                    No            47
          Manufacturing Director     No            10
          Research Director          No            26
          Sales Executive            No            10
                                     Yes            4
5         Manager                    No            40
                                     Yes            3
          Research Director          No            24
                                     Yes            2
Name: Attrition, dtype: int64

From JobLevel and JobRole we can see that:

  1. There are only Managers and Research Directors at level 5
  2. The majority of people in Level 1 are only are low level workers
  3. Only 2 roles Total Nr: 69, Roles: 62% Manager, 38% Resea
In [33]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.JobRole, hue=df_employee.EducationField)
plt.xticks(rotation=10)
Out[33]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 [Text(0, 0, 'Sales Executive'),
  Text(1, 0, 'Research Scientist'),
  Text(2, 0, 'Laboratory Technician'),
  Text(3, 0, 'Manufacturing Director'),
  Text(4, 0, 'Healthcare Representative'),
  Text(5, 0, 'Manager'),
  Text(6, 0, 'Sales Representative'),
  Text(7, 0, 'Research Director'),
  Text(8, 0, 'Human Resources')])

The dominant Education field are Life science and medical. for sales executives marking are important

In [34]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.JobRole, hue=df_employee.Education)
plt.xticks(rotation=10)
Out[34]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 [Text(0, 0, 'Sales Executive'),
  Text(1, 0, 'Research Scientist'),
  Text(2, 0, 'Laboratory Technician'),
  Text(3, 0, 'Manufacturing Director'),
  Text(4, 0, 'Healthcare Representative'),
  Text(5, 0, 'Manager'),
  Text(6, 0, 'Sales Representative'),
  Text(7, 0, 'Research Director'),
  Text(8, 0, 'Human Resources')])

more people have bachelors ad masters degree

In [35]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.EducationField, hue=df_employee.Education)
plt.xticks(rotation=10)
Out[35]:
(array([0, 1, 2, 3, 4, 5]),
 [Text(0, 0, 'Life Sciences'),
  Text(1, 0, 'Other'),
  Text(2, 0, 'Medical'),
  Text(3, 0, 'Marketing'),
  Text(4, 0, 'Technical Degree'),
  Text(5, 0, 'Human Resources')])

Moving over to Traing for the employees

In [36]:
print(df_employee['TrainingTimesLastYear'].value_counts(['JobRole']))
print(df_employee.groupby(['TrainingTimesLastYear'])['Attrition'].value_counts(normalize=True).sort_index())
print(df_employee['TrainingTimesLastYear'].value_counts(['Attrition']))
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.JobRole, hue=df_employee.TrainingTimesLastYear)
#print(df_employee['TrainingTimesLastYear'].value_counts(['JobRole']))
2    0.372109
3    0.334014
4    0.083673
5    0.080952
1    0.048299
6    0.044218
0    0.036735
Name: TrainingTimesLastYear, dtype: float64
TrainingTimesLastYear  Attrition
0                      No           0.722222
                       Yes          0.277778
1                      No           0.873239
                       Yes          0.126761
2                      No           0.820841
                       Yes          0.179159
3                      No           0.859470
                       Yes          0.140530
4                      No           0.788618
                       Yes          0.211382
5                      No           0.882353
                       Yes          0.117647
6                      No           0.907692
                       Yes          0.092308
Name: Attrition, dtype: float64
2    0.372109
3    0.334014
4    0.083673
5    0.080952
1    0.048299
6    0.044218
0    0.036735
Name: TrainingTimesLastYear, dtype: float64
Out[36]:
<AxesSubplot:xlabel='JobRole', ylabel='count'>
In [37]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.JobLevel, hue=df_employee.TrainingTimesLastYear)
Out[37]:
<AxesSubplot:xlabel='JobLevel', ylabel='count'>
In [38]:
#df_employee.groupby(["JobRole"]).count().sort_values(["TrainingTimesLastYear"]==0)
#df1 = df_employee.melt(var_name='JobRole', value_name='TrainingTimesLastYear')
#df1
#df1 = df.melt(var_name='columns', value_name='index')
#df.apply(lambda x: x.value_counts())
print(df_employee['TrainingTimesLastYear'].value_counts()[0])
#df2
print(df_employee.JobRole[df_employee.TrainingTimesLastYear == 0].count())
#print(df_employee.groupby('JobLevel')['TrainingTimesLastYear'].value_counts()[0])
54
54

Training the employees have avarage on 2,3 times per year, indepentent on the JobLevel and JobRole

In [39]:
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: lightgreen' if v else '' for v in is_max]

#df.style.apply(highlight_max)

def highlight_min(s):
    is_min = s == s.min()
    return ['background-color: lightblue' if v else '' for v in is_min]

#df.style.apply(highlight_max)
In [40]:
TrainAtWork = df_hrs.groupby(['JobRole','Attrition'],as_index=False)[['JobInvolvement','JobSatisfaction','TrainingTimesLastYear','YearsInCurrentRole']].mean().sort_values(by=['TrainingTimesLastYear','JobSatisfaction'])
TrainAtWork.style.apply(highlight_max).apply(highlight_min)
Out[40]:
JobRole Attrition JobInvolvement JobSatisfaction TrainingTimesLastYear YearsInCurrentRole
11 Research Director Yes 3.000000 2.500000 1.000000 15.000000
3 Human Resources Yes 2.500000 2.166667 2.083333 2.000000
7 Manager Yes 2.200000 2.400000 2.200000 7.800000
1 Healthcare Representative Yes 2.666667 2.777778 2.222222 5.000000
9 Manufacturing Director Yes 2.600000 2.600000 2.600000 3.500000
15 Sales Executive Yes 2.526316 2.526316 2.649123 4.192982
13 Research Scientist Yes 2.510638 2.425532 2.659574 2.191489
5 Laboratory Technician Yes 2.532258 2.435484 2.661290 2.129032
12 Research Scientist No 2.853061 2.840816 2.665306 3.481633
2 Human Resources No 2.775000 2.675000 2.700000 3.475000
8 Manufacturing Director No 2.688889 2.688889 2.755556 5.081481
0 Healthcare Representative No 2.737705 2.786885 2.786885 4.852459
10 Research Director No 2.769231 2.705128 2.820513 6.064103
6 Manager No 2.804124 2.721649 2.845361 6.381443
14 Sales Executive No 2.754647 2.802974 2.869888 4.996283
17 Sales Representative Yes 2.454545 2.484848 2.939394 1.242424
4 Laboratory Technician No 2.746193 2.771574 3.040609 3.538071
16 Sales Representative No 2.780000 2.900000 3.060000 2.520000
In [41]:
pd.pivot_table(TrainAtWork, values = 'TrainingTimesLastYear', index='Attrition', columns = 'JobRole').reset_index()
Out[41]:
JobRole Attrition Healthcare Representative Human Resources Laboratory Technician Manager Manufacturing Director Research Director Research Scientist Sales Executive Sales Representative
0 No 2.786885 2.700000 3.040609 2.845361 2.755556 2.820513 2.665306 2.869888 3.060000
1 Yes 2.222222 2.083333 2.661290 2.200000 2.600000 1.000000 2.659574 2.649123 2.939394
In [42]:
plt.subplots(figsize=(20,5))
sns.histplot(data=df_employee, x="JobRole", hue="Attrition", multiple="stack")
Out[42]:
<AxesSubplot:xlabel='JobRole', ylabel='Count'>
In [43]:
print(df_employee.groupby('Attrition')['JobRole'].value_counts(normalize=True).sort_index())
print(df_employee.groupby('Attrition')['JobLevel'].value_counts(normalize=True).sort_index())
print(df_employee.groupby(['Attrition','JobLevel'])['Education'].value_counts().sort_index().sort_values())
print(df_employee.groupby('Attrition')['EducationField'].value_counts(normalize=True).sort_index())
Attrition  JobRole                  
No         Healthcare Representative    0.098946
           Human Resources              0.032441
           Laboratory Technician        0.159773
           Manager                      0.078670
           Manufacturing Director       0.109489
           Research Director            0.063260
           Research Scientist           0.198702
           Sales Executive              0.218167
           Sales Representative         0.040552
Yes        Healthcare Representative    0.037975
           Human Resources              0.050633
           Laboratory Technician        0.261603
           Manager                      0.021097
           Manufacturing Director       0.042194
           Research Director            0.008439
           Research Scientist           0.198312
           Sales Executive              0.240506
           Sales Representative         0.139241
Name: JobRole, dtype: float64
Attrition  JobLevel
No         1           0.324412
           2           0.390916
           3           0.150852
           4           0.081914
           5           0.051906
Yes        1           0.603376
           2           0.219409
           3           0.135021
           4           0.021097
           5           0.021097
Name: JobLevel, dtype: float64
Attrition  JobLevel  Education  
Yes        5         Master           1
           4         Master           1
           2         Doctorate        1
           5         High School      1
           3         Doctorate        2
           1         Doctorate        2
No         5         Doctorate        2
Yes        2         High School      3
           5         Bachelor         3
           4         Bachelor         4
           3         College          4
                     High School      4
No         5         High School      5
           1         Doctorate        6
           3         Doctorate        7
           4         High School      8
Yes        3         Master           9
No         4         Doctorate        9
           5         College         13
Yes        3         Bachelor        13
           2         Bachelor        14
                     College         15
No         3         High School     16
           4         College         17
Yes        2         Master          19
No         5         Master          19
           2         Doctorate       19
Yes        1         High School     23
No         5         Bachelor        25
Yes        1         College         25
No         4         Master          27
Yes        1         Master          28
No         3         College         29
           4         Bachelor        40
           2         High School     44
           3         Master          49
Yes        1         Bachelor        65
No         1         High School     66
                     College         69
           3         Bachelor        85
           1         Master          93
           2         College        110
                     Master         152
                     Bachelor       157
           1         Bachelor       166
Name: Education, dtype: int64
Attrition  EducationField  
No         Human Resources     0.016221
           Life Sciences       0.419303
           Marketing           0.100568
           Medical             0.325223
           Other               0.057583
           Technical Degree    0.081103
Yes        Human Resources     0.029536
           Life Sciences       0.375527
           Marketing           0.147679
           Medical             0.265823
           Other               0.046414
           Technical Degree    0.135021
Name: EducationField, dtype: float64
In [44]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.Attrition, hue=df_employee.JobRole)
df_employee.groupby('Attrition')['JobRole'].value_counts(normalize=True).sort_index()
Out[44]:
Attrition  JobRole                  
No         Healthcare Representative    0.098946
           Human Resources              0.032441
           Laboratory Technician        0.159773
           Manager                      0.078670
           Manufacturing Director       0.109489
           Research Director            0.063260
           Research Scientist           0.198702
           Sales Executive              0.218167
           Sales Representative         0.040552
Yes        Healthcare Representative    0.037975
           Human Resources              0.050633
           Laboratory Technician        0.261603
           Manager                      0.021097
           Manufacturing Director       0.042194
           Research Director            0.008439
           Research Scientist           0.198312
           Sales Executive              0.240506
           Sales Representative         0.139241
Name: JobRole, dtype: float64
In [45]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.Attrition, hue=df_employee.Education)
Out[45]:
<AxesSubplot:xlabel='Attrition', ylabel='count'>
In [46]:
#plt.subplots(figsize=(20,5))
#sns.countplot(df_employee.Attrition, hue=df_employee.EducationField)
In [47]:
#plt.subplots(figsize=(20,5))
#sns.countplot(df_employee.MaritalStatus, hue=df_employee.JobRole)

3.4 JobSatisfaction and Commitment

In [48]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.Attrition, hue=df_employee.JobInvolvement)
Out[48]:
<AxesSubplot:xlabel='Attrition', ylabel='count'>
In [49]:
plt.subplots(figsize=(20,5))
sns.histplot(data=df_employee, x="JobRole", hue="JobInvolvement", multiple="stack",bins=5, kde=True)
Out[49]:
<AxesSubplot:xlabel='JobRole', ylabel='Count'>
In [50]:
df_employee.groupby('JobLevel')['Age'].value_counts(bins=6).sort_index()
Out[50]:
JobLevel                            
1         (17.958, 24.833]               87
          (24.833, 31.667]              192
          (31.667, 38.5]                157
          (38.5, 45.333]                 68
          (45.333, 52.167]               24
          (52.167, 59.0]                 15
2         (21.961, 28.333]               65
          (28.333, 34.667]              172
          (34.667, 41.0]                174
          (41.0, 47.333]                 75
          (47.333, 53.667]               28
          (53.667, 60.0]                 20
3         (26.965999999999998, 32.5]     40
          (32.5, 38.0]                   72
          (38.0, 43.5]                   34
          (43.5, 49.0]                   38
          (49.0, 54.5]                   22
          (54.5, 60.0]                   12
4         (28.968999999999998, 34.0]      5
          (34.0, 39.0]                    6
          (39.0, 44.0]                   23
          (44.0, 49.0]                   25
          (49.0, 54.0]                   31
          (54.0, 59.0]                   16
5         (38.978, 42.5]                 18
          (42.5, 46.0]                   14
          (46.0, 49.5]                    8
          (49.5, 53.0]                   15
          (53.0, 56.5]                   10
          (56.5, 60.0]                    4
Name: Age, dtype: int64
In [51]:
workForce =df_hrs.groupby(['JobRole','Attrition'], as_index=False)[['PerformanceRating','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance']].mean().sort_values(by=['JobRole'])
workForce.style.apply(highlight_max).apply(highlight_min)
Out[51]:
JobRole Attrition PerformanceRating JobSatisfaction EnvironmentSatisfaction WorkLifeBalance
0 Healthcare Representative No 3.155738 2.786885 2.819672 2.704918
1 Healthcare Representative Yes 3.111111 2.777778 2.111111 2.666667
2 Human Resources No 3.150000 2.675000 2.675000 2.925000
3 Human Resources Yes 3.083333 2.166667 2.333333 2.916667
4 Laboratory Technician No 3.147208 2.771574 2.822335 2.817259
5 Laboratory Technician Yes 3.209677 2.435484 2.387097 2.403226
6 Manager No 3.206186 2.721649 2.814433 2.762887
7 Manager Yes 3.000000 2.400000 1.800000 3.000000
9 Manufacturing Director Yes 3.000000 2.600000 2.600000 2.700000
8 Manufacturing Director No 3.200000 2.688889 2.940741 2.770370
10 Research Director No 3.102564 2.705128 2.487179 2.858974
11 Research Director Yes 3.000000 2.500000 3.000000 3.000000
12 Research Scientist No 3.151020 2.840816 2.746939 2.669388
13 Research Scientist Yes 3.255319 2.425532 2.617021 2.723404
14 Sales Executive No 3.130112 2.802974 2.732342 2.858736
15 Sales Executive Yes 3.105263 2.526316 2.385965 2.543860
16 Sales Representative No 3.160000 2.900000 2.760000 2.780000
17 Sales Representative Yes 3.121212 2.484848 2.696970 3.060606
In [52]:
workForceD =df_hrs.groupby(['Department','Attrition'], as_index=False)[['PerformanceRating','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance']].mean().sort_values(by=['Department'])
workForceD.style.apply(highlight_max).apply(highlight_min)
Out[52]:
Department Attrition PerformanceRating JobSatisfaction EnvironmentSatisfaction WorkLifeBalance
0 Human Resources No 3.156863 2.705882 2.764706 2.921569
1 Human Resources Yes 3.083333 2.166667 2.333333 2.916667
2 Research & Development No 3.157005 2.769324 2.787440 2.748792
3 Research & Development Yes 3.195489 2.458647 2.473684 2.578947
4 Sales No 3.144068 2.810734 2.734463 2.836158
5 Sales Yes 3.108696 2.521739 2.467391 2.739130
In [53]:
workForceG =df_hrs.groupby(['Gender','Attrition'], as_index=False)[['PerformanceRating','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance']].mean().sort_values(by=['Gender'])
workForceG.style.apply(highlight_max).apply(highlight_min)
Out[53]:
Gender Attrition PerformanceRating JobSatisfaction EnvironmentSatisfaction WorkLifeBalance
0 Female No 3.157685 2.728543 2.782435 2.760479
1 Female Yes 3.172414 2.425287 2.367816 2.781609
2 Male No 3.150273 2.812842 2.763661 2.795082
3 Male Yes 3.146667 2.493333 2.520000 2.586667
In [54]:
workForceE =df_hrs.groupby(['Education','Attrition'], as_index=False)[['PerformanceRating','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance']].mean().sort_values(by=['Education'])
workForceE.style.apply(highlight_max).apply(highlight_min)
Out[54]:
Education Attrition PerformanceRating JobSatisfaction EnvironmentSatisfaction WorkLifeBalance
0 Bachelor No 3.135307 2.701903 2.864693 2.737844
1 Bachelor Yes 3.181818 2.414141 2.353535 2.686869
2 College No 3.172269 2.823529 2.785714 2.785714
3 College Yes 3.159091 2.477273 2.340909 2.659091
4 Doctorate No 3.209302 2.744186 2.604651 2.767442
5 Doctorate Yes 3.000000 2.000000 3.000000 3.200000
6 High School No 3.187050 2.820144 2.726619 2.755396
7 High School Yes 3.129032 2.709677 2.838710 2.870968
8 Master No 3.144118 2.841176 2.670588 2.850000
9 Master Yes 3.137931 2.465517 2.500000 2.448276
In [55]:
#workForcelevel =df_hrs.groupby(['JobLevel','Education','Attrition'], as_index=False)[['PerformanceRating','JobSatisfaction','EnvironmentSatisfaction','WorkLifeBalance']].mean().sort_values(by=['JobLevel'])
#workForcelevel

3.5 Enpowerment at Work

In [56]:
enpowerments =df_hrs.groupby(['JobRole','Attrition'], as_index=False)[['PerformanceRating','StockOptionLevel','JobInvolvement','MonthlyIncome','YearsSinceLastPromotion','YearsAtCompany']].mean().sort_values(by=['JobRole'])
enpowerments.style.apply(highlight_max).apply(highlight_min)
Out[56]:
JobRole Attrition PerformanceRating StockOptionLevel JobInvolvement MonthlyIncome YearsSinceLastPromotion YearsAtCompany
0 Healthcare Representative No 3.155738 0.844262 2.737705 7453.557377 2.885246 8.188525
1 Healthcare Representative Yes 3.111111 0.666667 2.666667 8548.222222 4.111111 10.888889
2 Human Resources No 3.150000 0.725000 2.775000 4391.750000 1.400000 5.675000
3 Human Resources Yes 3.083333 0.833333 2.500000 3715.750000 0.833333 4.166667
4 Laboratory Technician No 3.147208 0.913706 2.746193 3337.223350 1.548223 5.604061
5 Laboratory Technician Yes 3.209677 0.516129 2.532258 2919.258065 1.016129 3.161290
6 Manager No 3.206186 0.752577 2.804124 17201.484536 4.835052 14.371134
7 Manager Yes 3.000000 0.600000 2.200000 16797.400000 4.800000 15.600000
9 Manufacturing Director Yes 3.000000 0.800000 2.600000 7365.500000 1.700000 8.700000
8 Manufacturing Director No 3.200000 0.814815 2.688889 7289.925926 2.148148 7.518519
10 Research Director No 3.102564 0.871795 2.769231 15947.346154 2.910256 10.538462
11 Research Director Yes 3.000000 0.000000 3.000000 19395.500000 14.000000 26.500000
12 Research Scientist No 3.151020 0.836735 2.853061 3328.122449 1.440816 5.265306
13 Research Scientist Yes 3.255319 0.446809 2.510638 2780.468085 1.851064 4.319149
14 Sales Executive No 3.130112 0.881041 2.754647 6804.617100 2.360595 7.665428
15 Sales Executive Yes 3.105263 0.526316 2.526316 7489.000000 3.070175 6.701754
16 Sales Representative No 3.160000 0.740000 2.780000 2798.440000 1.360000 3.460000
17 Sales Representative Yes 3.121212 0.454545 2.454545 2364.727273 0.606061 2.090909
In [57]:
enpowermentsD =df_hrs.groupby(['Department','Attrition'], as_index=False)[['PerformanceRating','JobInvolvement','MonthlyIncome','YearsSinceLastPromotion','YearsAtCompany']].mean().sort_values(by=['Department'])
enpowermentsD.style.apply(highlight_max).apply(highlight_min)
Out[57]:
Department Attrition PerformanceRating JobInvolvement MonthlyIncome YearsSinceLastPromotion YearsAtCompany
0 Human Resources No 3.156863 2.803922 7345.980392 2.000000 7.960784
1 Human Resources Yes 3.083333 2.500000 3715.750000 0.833333 4.166667
2 Research & Development No 3.157005 2.771739 6630.326087 2.179952 7.171498
3 Research & Development Yes 3.195489 2.556391 4108.075188 1.872180 4.954887
4 Sales No 3.144068 2.762712 7232.240113 2.395480 7.745763
5 Sales Yes 3.108696 2.467391 5908.456522 2.195652 5.510870
In [58]:
enpowermentsG =df_hrs.groupby(['Gender','Attrition'], as_index=False)[['PerformanceRating','JobInvolvement','MonthlyIncome','YearsSinceLastPromotion','YearsAtCompany']].mean().sort_values(by=['Gender'])
enpowermentsG.style.apply(highlight_max).apply(highlight_min)
Out[58]:
Gender Attrition PerformanceRating JobInvolvement MonthlyIncome YearsSinceLastPromotion YearsAtCompany
0 Female No 3.157685 2.746507 7019.429142 2.339321 7.459082
1 Female Yes 3.172414 2.528736 4769.735632 2.034483 5.919540
2 Male No 3.150273 2.786885 6704.964481 2.162568 7.307377
3 Male Yes 3.146667 2.513333 4797.160000 1.893333 4.673333
In [59]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.JobRole, hue=df_employee.StockOptionLevel)
print(df_employee['JobRole'].value_counts(['StockOptionLevel']))
print(df_employee.groupby('StockOptionLevel')['JobRole'].value_counts().sort_index(ascending=True))
Sales Executive              0.221769
Research Scientist           0.198639
Laboratory Technician        0.176190
Manufacturing Director       0.098639
Healthcare Representative    0.089116
Manager                      0.069388
Sales Representative         0.056463
Research Director            0.054422
Human Resources              0.035374
Name: JobRole, dtype: float64
StockOptionLevel  JobRole                  
0                 Healthcare Representative     51
                  Human Resources               24
                  Laboratory Technician        108
                  Manager                       37
                  Manufacturing Director        59
                  Research Director             33
                  Research Scientist           136
                  Sales Executive              139
                  Sales Representative          44
1                 Healthcare Representative     59
                  Human Resources               21
                  Laboratory Technician        110
                  Manager                       57
                  Manufacturing Director        60
                  Research Director             31
                  Research Scientist           104
                  Sales Executive              125
                  Sales Representative          29
2                 Healthcare Representative     13
                  Human Resources                3
                  Laboratory Technician         21
                  Manager                        5
                  Manufacturing Director        20
                  Research Director             11
                  Research Scientist            34
                  Sales Executive               44
                  Sales Representative           7
3                 Healthcare Representative      8
                  Human Resources                4
                  Laboratory Technician         20
                  Manager                        3
                  Manufacturing Director         6
                  Research Director              5
                  Research Scientist            18
                  Sales Executive               18
                  Sales Representative           3
Name: JobRole, dtype: int64
In [60]:
pd.pivot_table(df_employee, values = 'StockOptionLevel', index='Attrition', columns = 'JobRole').reset_index()
Out[60]:
JobRole Attrition Healthcare Representative Human Resources Laboratory Technician Manager Manufacturing Director Research Director Research Scientist Sales Executive Sales Representative
0 No 0.844262 0.725000 0.913706 0.752577 0.814815 0.871795 0.836735 0.881041 0.740000
1 Yes 0.666667 0.833333 0.516129 0.600000 0.800000 0.000000 0.446809 0.526316 0.454545
In [61]:
plt.subplots(figsize=(20,5))
sns.countplot(df_employee.Attrition, hue=df_employee.StockOptionLevel)
df_employee.groupby('Attrition',)['StockOptionLevel'].value_counts(normalize=True).sort_index()
Out[61]:
Attrition  StockOptionLevel
No         0                   0.386861
           1                   0.437956
           2                   0.118410
           3                   0.056772
Yes        0                   0.649789
           1                   0.236287
           2                   0.050633
           3                   0.063291
Name: StockOptionLevel, dtype: float64

Have you reached Stock OptionLevel 2 then the chance for attirion is much lower

In [62]:
df_hrs[df_hrs.JobLevel ==3].groupby('Education', as_index=False)[['JobSatisfaction']].mean().sort_values(by=['JobSatisfaction'])
Out[62]:
Education JobSatisfaction
2 Doctorate 2.333333
4 Master 2.655172
0 Bachelor 2.663265
1 College 2.787879
3 High School 2.800000
In [63]:
#need to change the value to the column to get a better understanding of what the graph says 
#df_employee.Education.replace({1: 'High School', 2:'Undergrad', 3:'Graduate', 4:'Post Graduate', 5:'Doctorate'},inplace=True)
sns.lineplot(x = 'JobLevel', y = 'Attrition', data=df_employee, hue='Education')
Out[63]:
<AxesSubplot:xlabel='JobLevel', ylabel='Attrition'>
In [64]:
df_employee.groupby('StockOptionLevel')['Age'].value_counts(bins=6).sort_index()
Out[64]:
StockOptionLevel                            
0                 (17.956999999999997, 25.0]     74
                  (25.0, 32.0]                  169
                  (32.0, 39.0]                  187
                  (39.0, 46.0]                  102
                  (46.0, 53.0]                   66
                  (53.0, 60.0]                   33
1                 (21.961, 28.333]               84
                  (28.333, 34.667]              145
                  (34.667, 41.0]                169
                  (41.0, 47.333]                101
                  (47.333, 53.667]               53
                  (53.667, 60.0]                 44
2                 (21.962999999999997, 28.0]     25
                  (28.0, 34.0]                   47
                  (34.0, 40.0]                   44
                  (40.0, 46.0]                   21
                  (46.0, 52.0]                   15
                  (52.0, 58.0]                    6
3                 (23.964, 29.833]               20
                  (29.833, 35.667]               23
                  (35.667, 41.5]                 20
                  (41.5, 47.333]                 11
                  (47.333, 53.167]                6
                  (53.167, 59.0]                  5
Name: Age, dtype: int64

Looking at which age tend to have different joblevels is really clear that young people start in level 1 most, of them is in this level for 10 years. If you show some encouagement you will get to an other level faster. most of the people for level 3 are in their late 30s. from 40 years old, both level 4 and 5 are more dominant.

From the two graph above we can see that the youngest people working are single and are morlelylike to leave the company. there are also more people around 30 years old leave but the their relation ship tend to be married, but people then are seeking a new direction or a new job after working a few years at one place.

In [65]:
# Need to Have Numeric Values for Attrition
In [66]:
happy_job = df_Anumber.groupby('JobRole', as_index=False)[['JobSatisfaction','EnvironmentSatisfaction','JobInvolvement']].mean().sort_values(by=['JobInvolvement'])
happy_job.style.apply(highlight_max).apply(highlight_min)
Out[66]:
JobRole JobSatisfaction EnvironmentSatisfaction JobInvolvement
8 Sales Representative 2.734940 2.734940 2.650602
4 Manufacturing Director 2.682759 2.917241 2.682759
2 Laboratory Technician 2.691120 2.718147 2.694981
1 Human Resources 2.557692 2.596154 2.711538
7 Sales Executive 2.754601 2.671779 2.714724
0 Healthcare Representative 2.786260 2.770992 2.732824
3 Manager 2.705882 2.764706 2.774510
5 Research Director 2.700000 2.500000 2.775000
6 Research Scientist 2.773973 2.726027 2.797945
In [67]:
df_Anumber['OverTime'] = df_Anumber['OverTime'].map({'Yes': 1, 'No': 0})
In [68]:
df_employee.groupby('JobRole', as_index=False)[['Age']].mean().sort_values(by=['Age'])
Out[68]:
JobRole Age
8 Sales Representative 30.361446
2 Laboratory Technician 34.096525
6 Research Scientist 34.236301
1 Human Resources 35.500000
7 Sales Executive 36.889571
4 Manufacturing Director 38.296552
0 Healthcare Representative 39.809160
5 Research Director 44.000000
3 Manager 46.764706
In [69]:
df_Anumber[df_Anumber.JobLevel ==3].groupby('Education', as_index=False)[['JobSatisfaction']].mean().sort_values(by=['JobSatisfaction'])
Out[69]:
Education JobSatisfaction
4 5 2.333333
3 4 2.655172
2 3 2.663265
1 2 2.787879
0 1 2.800000
In [70]:
df_employee['Attrition'] = df_employee['Attrition'].map({'Yes': 1, 'No': 0})
df_Anumber.Education.replace({'High School':1, 'Collage':2,'Bachelor':3, 'Master':4, 'Doctorate':5},inplace=True)
In [71]:
role_income = df_employee.groupby('JobRole', as_index=False)[['MonthlyIncome', 'Attrition']].mean().sort_values(
    by=['MonthlyIncome'])
role_income.style.apply(highlight_max).apply(highlight_min)
Out[71]:
JobRole MonthlyIncome Attrition
8 Sales Representative 2626.000000 0.397590
2 Laboratory Technician 3237.169884 0.239382
6 Research Scientist 3239.972603 0.160959
1 Human Resources 4235.750000 0.230769
7 Sales Executive 6924.279141 0.174847
4 Manufacturing Director 7295.137931 0.068966
0 Healthcare Representative 7528.763359 0.068702
5 Research Director 16033.550000 0.025000
3 Manager 17181.676471 0.049020

People eith doctor degree on level 3 have lower overall job satifacation than other education levels

In [72]:
df_Anumber[df_Anumber.JobLevel ==4].groupby('Education', as_index=False)[['YearsSinceLastPromotion','TrainingTimesLastYear','JobSatisfaction']].mean().sort_values(by=['YearsSinceLastPromotion'])
Out[72]:
Education YearsSinceLastPromotion TrainingTimesLastYear JobSatisfaction
3 4 3.821429 2.714286 2.571429
1 2 4.823529 2.352941 3.176471
2 3 4.863636 2.431818 2.636364
4 5 5.555556 2.777778 2.888889
0 1 7.500000 2.875000 2.625000
In [73]:
plt.figure(figsize=(20,10))
sns.boxplot(data=df_employee, x='JobRole', y='YearsAtCompany',hue='Attrition')
Out[73]:
<AxesSubplot:xlabel='JobRole', ylabel='YearsAtCompany'>
In [74]:
plt.figure(figsize=(20,10))
sns.boxplot(data=df_employee, x='JobRole', y='MonthlyIncome',hue='Attrition')
Out[74]:
<AxesSubplot:xlabel='JobRole', ylabel='MonthlyIncome'>

It takes time for people with doctor degree to reach level 4 among people with higher education

In [75]:
plt.figure(figsize=(20,10))
sns.boxplot(data=df_employee, x='JobLevel', y='MonthlyIncome',hue='Attrition')
Out[75]:
<AxesSubplot:xlabel='JobLevel', ylabel='MonthlyIncome'>

People in level 4 that leave has much less income than those who stay

3.7 Surprisng/Unique Findings

These plots need Attrition as a numeric feature.

In [76]:
sns.catplot(x = 'NumCompaniesWorked', y = 'Attrition', data=df_employee, aspect= 3, kind = 'bar')
Out[76]:
<seaborn.axisgrid.FacetGrid at 0x7f1918b51310>

Have you worked for 2-4 companiesyou are less likely to leave.

In [77]:
sns.factorplot(x = 'NumCompaniesWorked', y = 'Attrition', hue = 'Gender', data=df_employee, aspect= 3, kind = 'bar')
Out[77]:
<seaborn.axisgrid.FacetGrid at 0x7f1918bfa610>

Splitting on gender, we can clearly see that the attrtion rate stays up for male working for many companies, but woman are lower

In [78]:
sns.factorplot(x = 'JobLevel', y = 'Attrition', hue = 'Education', data=df_employee, aspect= 4, ci=None)
Out[78]:
<seaborn.axisgrid.FacetGrid at 0x7f1914b60550>

A high education at job level 3 is increasing the attrition rate.

In [79]:
fig, ax1 = plt.subplots(figsize=(20,6))

sns.lineplot(data = role_income, x='JobRole',y='Attrition', sort = False, ax=ax1)
ax2 = ax1.twinx()

sns.barplot(data = role_income, x='JobRole', y='MonthlyIncome', alpha=0.7, ax=ax2)
Out[79]:
<AxesSubplot:xlabel='JobRole', ylabel='MonthlyIncome'>

Sales have hghest attrition and lower income, attrition increase for HR, due to higher income and good jobsatifaction, least attrition for those with most income

From the EDA we obtained some interesting findings.

From the EDA we can some interesting things

  1. SAles representatives tend to be promoted pretty fast, but most of them have low income and are more likely to quite due to to their relative high job satisfation. and worklife balance, but their education siuation says they are undergradeuated so that why they leave.
  2. Many doctors are left on the level 3, for them to reach 4 it take a really long time compared to other people with higher education
  3. have you worked for some companies, you are less likely to leave

END Part 1


In [80]:
sns.factorplot(x = 'Education', y = 'YearsSinceLastPromotion', hue = 'Attrition', data=df_employee, aspect= 4, ci=None)
Out[80]:
<seaborn.axisgrid.FacetGrid at 0x7f19148f8d90>
In [81]:
sns.factorplot(x = 'JobRole', y = 'YearsSinceLastPromotion', hue = 'Department', data=df_employee, aspect= 4, ci=None)
Out[81]:
<seaborn.axisgrid.FacetGrid at 0x7f1914982c10>
In [82]:
sns.factorplot(x = 'JobRole', y = 'TrainingTimesLastYear', hue = 'Attrition', data=df_employee, aspect= 4, ci=None)
Out[82]:
<seaborn.axisgrid.FacetGrid at 0x7f191480c350>
In [83]:
sns.factorplot(x = 'JobRole', y = 'YearsSinceLastPromotion', hue = 'Attrition', data=df_employee, aspect= 4, ci=None)
Out[83]:
<seaborn.axisgrid.FacetGrid at 0x7f1914756f50>
In [83]:

In [84]:
sns.factorplot(x = 'YearsSinceLastPromotion', y = 'TrainingTimesLastYear', hue = 'Attrition', data=df_employee, aspect= 4, ci=None)
Out[84]:
<seaborn.axisgrid.FacetGrid at 0x7f19147efa10>
In [85]:
sns.factorplot(x = 'YearsSinceLastPromotion', y = 'TrainingTimesLastYear', hue = 'JobInvolvement', data=df_employee, aspect= 4, ci=None)
Out[85]:
<seaborn.axisgrid.FacetGrid at 0x7f1914663450>
In [86]:
sns.factorplot(x = 'TrainingTimesLastYear', y = 'YearsSinceLastPromotion', hue = 'JobRole', data=df_employee, aspect= 4, ci=None)
#sns.factorplot(x = 'TrainingTimesLastYear', y = 'YearsSinceLastPromotion', hue = 'Education', data=df_employee, aspect= 4, ci=None)
Out[86]:
<seaborn.axisgrid.FacetGrid at 0x7f191450df10>
In [87]:
sns.factorplot(x = 'YearsSinceLastPromotion', y = 'TrainingTimesLastYear', hue = 'Attrition', data=df_Anumber, aspect= 4, ci=None)
Out[87]:
<seaborn.axisgrid.FacetGrid at 0x7f191460f950>
In [88]:
sns.factorplot(x = 'TrainingTimesLastYear', y = 'JobSatisfaction', hue = 'JobRole', data=df_Anumber, aspect= 4, ci=None)
Out[88]:
<seaborn.axisgrid.FacetGrid at 0x7f1914449c90>