Here's a portfolio of data science and analytics projects. These analyses are done using R and associated packages such as ggplot2, shiny, plotly, dplyr, magrittr, caret, randomforest, e1070, etc.

Example of using Data Science for a California School District (2019)

  • In this analysis, I worked with HMH's Read180U data that contained data such as lexile score, total # of sessions in the software, average # of session minutes, and total minutes spent.  The purpose of the analysis was to design a clustering analysis as an exploratory means of finding different groupings of students.  I also ran a linear multiple regression analysis using lexile score as the outcome variable. Results demonstrated that the total # of sessions was statistically significantly related to lexile score.  RandomForest also demonstrated that students should be spending at least 17 sessions in software, and a total of between 407 and 539 minutes (between about 7 hours and 9 hours). The law of diminishing returns also applies in that anything more than 9 hours in software did not yield higher lexile scores.
  • Tools and algorithms used: R / R Studio, multiple regression, randomforest, dplyr, decision tree, caret, magrittr, ggplot2.
  • Full analysis and report here.


Text Analysis of Kavanaugh Testimony from 9/27/2018

  • This text and NLP analysis examines Kavanaugh's testimony from the 9/27 SCOTUS hearing. Are there any patterns in Kavanaugh's testimony?  This analysis allows me to apply some text mining analysis on a new data set. The testimony was downloaded and processed using the "tm" R package, and then I created a corpus and term document matrix. I was able to display frequencies of work usage during his opening statement and also generated a word cloud.  A sentiment analysis concludes the report.
  • Tools and algorithms used:  R/RStudio, tm, ggplot2, sentimentr, wordcloud
  • Full analysis and report is here.


kMeans Clustering using HR-related Data

  • In this analysis, I worked with fictitious HR data that contained engagement scores, absences, performance scores, terminations, pay rates, recruitment sources, gender, department, and several other pertinent HR data. Before clustering, data needed to be normalized since clustering is a distance-based algorithm, and normalizing the data would help the algorithm perform much better. Results demonstrated three distinct clusters - primarily separating employees into individual contributor roles and leadership roles. A significant gap was identified in compensation between individual contributors and leaders. Further, gaps were found in pay equity across different racial profiles.
  • Tools & algorithms used: R/R Studio, multiple regression, logistic regression, dplyr, decision tree, randomforest, rpart, rpart.utils, ggplot2
  • Full analysis and report here.


Using Shiny and Plotly for HR Data

  • This project creates a data-driven web site in which user can explore the different pay rates per department based on HR data. It was built for a project for a Coursera course on Developing Data Products.
  • Use of shiny and plotly with HR-related data. Click here.