Introduction

The pay gap between men and women in various positions continues to plague many industries, including adademia. Publicly funded universities, like Iowa State University in Ames, Iowa, are mandated by law to publish the salary information of all their employees. This provides a unique opportunity to examine the pay patterns across various departments at a large academic instution. While other websites and databases exist to examine similar datasets or compare pay across universites, these resources often aren’t user friendly or don’t show the kind of interactive graphics users may wish to see. CyChecks, an R package, was created to change this. This package provides users with publicly available datasets in a tidy form and interactive web graphics to get them started, but it also allows the users to access the data they want so that they are able to apply any R functions they wish to see.

Some of the services CyChecks could offer include:

  1. Aiding job seekers in negotiating starting salaries.
  2. Shedding light on possible pay inequities with respect to gender.
  3. Identifying the highest paid positions at the university.

Package contents

  1. Datasets consistent of:
    1. Employee/salary/department dataset for 2007-2018
    2. Employee/salary/department/college dataset assumed to be valid for 2018
  2. Functions to:
    1. download data directly from the iowa.gov website
    2. anonymize names of downloaded data
    3. simplify professor position titles
    4. quickly identify departments with possible gender pay inequities
    5. Launch a Shiny App to help users explore the dataset provided in our package.
  3. A Shiny app for interactively, visualizing data

Datasets

The state of Iowa offers a large amount of public data at this site. You can access the data through the website, and it’s recommended that you sign up for an API token here if you’d like to scrape lots of data. Iowa state employee salaries are available at this site. CyChecks provides a function to easily get data for any given year through the above websites. The data from the government web site does not include the employee’s home department within the University, and a dataset consisting of the entire faculty and staff directory is not made publicly available. However, Iowa State Univerisity’s Human Resources Department kindly provided a list of employees with their home departments and associated colleges valid as of January 1st 2019. Since acquiring this information is not reproducible to the average user, we’ve included a full dataset (with names anonymized) of all salary info (from years 2007 to 2018) cross-referenced by department in the CyChecks package. Additionally, we’ve included a dataset of employee salary data and department information (again, with names anonymized) for fiscal year 2018. Since the directory information is only valid as of 2019, employees that had left before that year will not listed under their respective department in this dataset. This isn’t ideal, but a reality of trying to display University data.

Here is an example of a subset of one of our built in datasets - sals18. We are filtering by department (Agronomy) and position (Prof).

data("sals18")
sals18 %>%
  filter(department == "AGRONOMY" & position == "PROF") %>%
  select(gender, total_salary_paid, department, position, id) %>%
  head()%>%
  kable()
gender total_salary_paid department position id
M 105984.0 AGRONOMY PROF d0c1b5af
M 141439.1 AGRONOMY PROF 8e137768
M 119634.0 AGRONOMY PROF 8c54610e
F 114736.0 AGRONOMY PROF 08f1e76c
M 132729.0 AGRONOMY PROF 203b875b
M 99764.0 AGRONOMY PROF 325fa758

Functions

Function 1: Web Scraping (sal_df)

This function allows users to access salary information from the State of Iowa government, scraping that data from the web and converting it into a tibble for the user to examine. In the process of using this function, two columns from the original dataset have been deleted: department and base_salary. Department is “Iowa State University” for all rows, so it is redundant, and base_salary, while perhaps more useful in terms of examining what the University chooses to pay their employees, is a blank field for most rows and also consists of text entries when employees are paid by the hour, for example. Therefore we used total_salary_paid in all our subsequent figures and analyses, even though we know this metric can be mislading sometimes due to supplemented funded by summer grants, for instance.

While the user may enter their API as an argument to this function, having an API isn’t necessary for grabbing small amounts of data. Other arguments that the user can change include the number of rows in their datatable (limit), where those rows start (offset), and the year from which their data originates (fiscal_year).

ex <- sal_df(limit = 10, offset = 1100, fiscal_year = 2010) %>%
  select(-c(base_salary_date, travel_subsistence)) # trimming the dataframe so it's more easily readable
kable(ex, format = "html")
fiscal_year gender name place_of_residence position total_salary_paid
2010 M BYAMUKAMA EMMANUEL STORY POSTDOC RES ASSOC 36000.00
2010 F BYARS JANA LENA STORY ASST PROF 47280.00
2010 M BYERLY JOHN MARSHALL ARCHITECT III 57307.48
2010 M BYERS JOHN LOUIS STORY CASUAL HOURLY 1551.12
2010 F BYERS KAREN S DICKINSON CASUAL HOURLY 6666.92
2010 F BYG ALTA JEAN STORY CLERK III 40586.78
2010 M BYRD CHRISTOPHER JAM STORY CASUAL HOURLY 1329.00
2010 M BYRD DAVID JOSEPH BOONE ASST SCIENTIST I 45477.00
2010 M BYRD SCOTT GARRISON STORY CASUAL HOURLY 1088.00
2010 M BYRD WILLIAM J BOONE PROGRAM DIRECTOR 99543.89

Function 2: Anonymize names (anonymize)

This function allows the user to anonymize an aspect of the dataframe. While salary information is publicly available data, we also realize that this is sensitive, personal information and want to give the use the opportunity to anonymize something like names if they wish. In this function we can convert each individual’s name to an alphanumeric id to mask the individual’s real identity. The alphanumeric id is consistent across names so the user may still use a group_by function successfully, for instance. This function was used to anonymize all the datasets included in this package.

Here are some examples:

df_exmple <- data.frame("name" = c("Brianna","Gina","Lydia","Stephanie","Yones"),
                        "Salary" = c(5456, 5698, 5647, 5842, 5910)) # Create a dataframe
kable(df_exmple)
name Salary
Brianna 5456
Gina 5698
Lydia 5647
Stephanie 5842
Yones 5910
anon <- anonymize(df_exmple) # Anonymizing the column name 

knitr::kable(anon, col.names = c("name" , "Salary", "Anonymous Name"))
name Salary Anonymous Name
Brianna 5456 c462a236
Gina 5698 1b9d8880
Lydia 5647 7509a610
Stephanie 5842 04d467d9
Yones 5910 0f22005a

Function 3: Get professor info (get_profs)

This function creates a new data frame with professor positions grouped into 12 categories of simplified position titles. The function was created after the developers noticed that there were many varying titles of professor at the University, which made summarizing data difficult. get_profs() filters the position variable of the sals_dept dataset for any entries that contain the string ‘PROF’. Then it creates a new variable called position_simplified, which contains the simplified position category. The table below shows the string that is searched for in the position variable and the corresponding position_simplified category assigned to it. This function was created specifically to run on the sals_dept dataset. If the function is used on a dataset with different position abreviations than those listed in the table, the resulting data frame will be suboptimal.

string position_simplified
EMER emeritus
DIST distinguished
UNIV university
MORRILL Morrill
ADJ adjunct
AFFIL affiliate
VSTG visting
ASST assistant
ASSOC associate
COLLAB collab
CHAIR or CHR chair
PROF professor

The following code runs the get_profs function on the sals_dept dataset and filters for associate professors. Only the first six rows are displayed.

fiscal_year gender place_of_residence position_simplified
2007 M STORY associate
2007 M STORY associate
2007 M STORY associate
2007 M ROUTT associate
2007 F STORY associate
2007 F STORY associate

Function 4: Run basic statistics (stats_mf)

This function uses one fiscal year of data to identify departments with possible gender pay dispartities. Within a department, the function identifies positions that have more than 1 female and male in that position. Within a department, the function then fits a simple linear model of the following form:

total_salary ~ position + gender

The function then sorts the departments by the p-value associated with the gender term of the linear model, with the lowest p-values appearing first. The function then assigns a verdict, with a p-value less than 0.20 illiciting a ‘boo’ verdict, and a p-value higher than 0.20 earning an ‘ok’. While this may seem like a high p-value, we feel that 0.20 is an unliklely enough result to warrant further investigation. stats_mf() allows users to quickly filter and find departments with possible pay inequities associated with genders.

data(sals18)
sals18 %>%
  stats_mf() %>%
  filter(verdict == 'boo')%>%
  head()

Figures produced with package contents

The following figures are used in the Shiny App to help the user visualize pay patterns.

The figure below demonstrates gender versus the total salary paid ($thousands) for different professor positions.

Professor 2018 salaries by gender and position in the Agronomy Department

Professor 2018 salaries by gender and position in the Agronomy Department

From this figure you can see that only 3 of the 6 professor positions have both male and females respresented. The gray lines connect mean salaries of each position - you can quickly see there is a negative slope to the lines, indicating females in the same positions on average earn less than their male counterparts in the same positions. You can also see there are less dots in the female category, indicating there are more males in the professor positions compared to females. This type of graph is accessible in our Shiny App PROF tab.

An additional figure that looks at the gender make-up of departments is also included in the Shiny App.

Number and gender of employees in professor positions in the Agronomy Dept. based on departmental affiliations in 2018

Number and gender of employees in professor positions in the Agronomy Dept. based on departmental affiliations in 2018

From this figure you can see the department has not hired a woman to a professor position since 2014. This is consistent with the previous figure, which showed there were no women in the ‘Assistant Professor’ position.

Shiny App

Feel free to check out the shiny app for this package by running the following code.

#runShiny()

Conclusion

From the CyChecks package, the user is equipped with four different functions that can help them work through the above specified Iowa salary data in order to run analysis and cross-compare wages amoungst Iowa State University departments, while investigating gender and career position equity. This package, CyChecks, allows the user to scrape the dataset from the web, anonymize the names of invidivudals to mask identity for the sake of privacy, select and filter specific professor data, and run basic statistical analysis against the data for years 2007 to 2018.

Across the board, when graphically visualizing the data, it can be seen that there is a disparity between men and women’s average yearly salary in STEM departments (i.e. Agronomy, Engineering, Mathematics) where there is a defined scarcity in female representation across these departments’ academic professional careers (i.e. professor, assistance professor, professor emaritus, etc.). Conversely, some departments within the university, like the social sciences (i.e. education, language), are majority female. Either way, this data visualization can make it easier for users to understand and interpret the trends in the data, as well as identify where an initiative should be undertaken to close the pay gap within departments that lack diversity and gender representation at each position level.

Limitations & Future Work

We were only granted access to departmental affiliations for people employed by Iowa State University as of Jan 1 2019. If an employee left the university before that date, they will not be included in our dataset. This makes interpreting post-doctoral information particularly problematic, due to the transitory nature of that position. Despite these limitations, we believe this work opens the door to conversations about pay inequalities at ISU, and can serve as a resouce for individuals looking to negotiate starting salaries or request raises. Future work will include providing a way to quickly identify positions within a department that lack gender diversity, incorporating individuals who do not identify with one of the binary gender categories, and including a way to access departmental affiliations from the web.

Package Website/Vignette

Follow the link to checkout the CyCheck package to discover the pay pattern in your department!

Enjoy Exploring!