Exploratory Data Analysis (EDA for short) is the process of investigating, analyzing, and learning more about a dataset prior to performing any sort of statistical analysis. The goal of EDA is to gain a deeper understanding of the data you are working with, typically using graphical visualizations or outputting summary statistics. EDA typically involves looking for trends in the data, spotting outliers, checking for unreliable variables, looking for variables/rows with a high proportion of missing data, checking assumptions about the data, and so much more along these lines.
Exploratory Data Analysis is usually the first step in the data analysis process, and is often regarded as one of the most critical steps to perform in order to get an accurate understanding of the data you are working with. Without this step, it will be difficult to start to form hypotheses about the data, understand what manipulation may need to occur, and understand the reliability of your variables, among other things.
There isn’t a step by step playbook on how to complete EDA on a dataset and what specifics to look at, which can sometimes make this process tedious and difficult when it comes to figuring out where to start and what to explore. There are some general rules of thumb for what tools can be utilized to explore the data, but this exploration is usually left up to the person exploring the data for how they want to approach the dataset. Personally for me, I often work with large datasets and find it quite difficult to figure out the best way to approach exploring the data in order to pull initial insights out of it. I have been searching for a way to help streamline this initial exploratory analysis and take some of the heavy lifting out of EDA, which is where the DataExplorer R package can come into play.
The DataExplorer R package’s main purpose is to help automate the data exploration process so that you as the user can spend less time worrying about what techniques you are going to employ to explore the data and spend more time actually understanding the data. This package includes functions that scan each variable in your dataset to produce graphical visualizations of the data, functions that summarize the data in the dataset, functions that help with reviewing of missing data, and functions that aid in data manipulation to make any data processing and formatting quick and easy.
Since the purpose of this package is to help automate exploratory data analysis, you will need a dataset in order to use the functions in the package. The package has functions for both continuous and discrete data, so any type of variables can be included in the dataset. Some of the functions prefer that the data is in table form, so it is recommended to transform your data into a data table before utilizing any functions from this package using the data.table package and data.table() function.
For purposes of this project, I will be walking through most of the functions included in this package that pertain to summarizing the dataset and producing visualizations of the data, functions that can be used to perform missing data analysis, and functions that can be used to generate a report that includes output from most of the functions in this package. I will call out the other functions I will not be covering that may be of use during EDA at the very end.
To demonstrate the DataExplorer package, I will be using a dataset of the 1918 to 2022 National Hockey League (NHL) Stanley Cup Playoff results. This data was obtained from Kaggle, but was originally collected from www.hockey-reference.com then cleaned for data analysis.
The format of the data is that there is one row for each team that made the playoffs for all years back to 1918. Each row contains information about the team’s gameplay during the playoffs for that year, such as goals for, goals against, and number of wins and loses.
The breakdown of the number of teams included per year are as follows:
1918-1925 (2 teams per year)
1926 (3 teams per year)
1927-1942 (6 teams per year)
1943-1967 (4 teams per year)
1968-1974 (8 teams per year)
1975-1979 (12 teams per year)
1980 - Present (16 teams per year)
The column information is as follows:
rank: Rank the team finished
team: Name of the team
year: Year of playoffs
games: Total games played during the playoffs
wins: Total wins during the playoffs
losses: Total losses during the playoffs
ties: Total ties during the playoffs
shootout_wins: Total shootout wins during the playoffs
shootout_losses: Total shootout losses during the playoffs
winlosspercentage: Win percentage of team during the playoffs
goals_scored: Goals scored by team during the playoffs
goals_against: Goals scored on team during the playoffs
goal_differential: Goal differential between scored and against during the playoffs
After I downloaded the data, I did some additional manipulation before reading the data in.
SalaryCapEra- I added a column as an indicator if the year played was within the current ‘Salary Cap Era’, which is the time period where teams have a certain amount they can spend on players. I added this to introduce an additional binary variable that may have some trends with the other variables. (Y if Year= 2006-> Present, blank otherwise)
WonCup?- I added a column as an indicator if the row corresponds to a team that won the Stanley cup for the given year. For example, the Colorado Avalanche won the Stanley cup in 2022, therefore for the Colorado Avalanche for the playoff year 2022, the WonCup? variable= “Y”. I added this for similar reasons to the Salary Cap Era variable. (Y if Rank=1, blank otherwise)
Now that we have an understanding of why EDA is such an important phase of the data analysis process, how the DataExplorer package can help automate EDA, and what the data we plan to utilize looks like, we can start our walk through of what EDA can look like with the DataExplorer package and our NHL data. This example comprises of the following parts:
Installing and Loading Relevant Packages
Importing the NHL Data
Completing Initial Exploration into the Data
Completing a Missing Data Analysis
Exploring Variable Using Graphical Visualizations
Creating Reports Using the DataExplorer Package
A Brief Overview of Other Functions in DataExplorer
Package Conclusions
We first need to make sure our packages are loaded so that we can utilize the functions. Since we are using the DataExplorer package, we can use library(DataExplorer) to attach the DataExplorer package so that we can get access to the functions within the package. I’m also loading the readr package in order to read in my dataset, and the data.table package to update my dataset to a table. If you do not have any of these packages installed, you will first need to run install.packages(“PackageOfInterest”).
#This is our main package that we are exploring
library(DataExplorer)
#This is needed to read in the dataset appropriately
library(readr)
#This is needed to update the data to table form
library(data.table)
Now that we have the workspace set up, we can read in our dataset that we will use for this example. I have updated my working directory of this session to the location of the dataset I would like to read in. Since I have the file saved as a csv, I can use read_csv to import the data in. Once the data is loaded, str() can be used for a quick glance to make sure there were no issues on import.
#set the wd to the location of the dataset file
setwd("~/Desktop/Stat 484/Project")
#read in the data
nhlplayoffs <- read_csv("nhlplayoffsEdit.csv")
#set the data as a data table for use later
nhlplayoffs<-data.table(nhlplayoffs)
#ensure the data was read in correctly
str(nhlplayoffs)
## Classes 'data.table' and 'data.frame': 1009 obs. of 15 variables:
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ team : chr "Colorado Avalanche" "Tampa Bay Lightning" "New York Rangers" "Edmonton Oilers" ...
## $ year : num 2022 2022 2022 2022 2022 ...
## $ games : num 20 23 20 16 14 12 12 10 7 7 ...
## $ wins : num 16 14 10 8 7 6 5 4 3 3 ...
## $ losses : num 4 9 10 8 7 6 7 6 4 4 ...
## $ ties : num 0 0 0 0 0 0 0 0 0 0 ...
## $ shootout_wins : num 5 1 1 1 1 1 1 2 0 1 ...
## $ shootout_losses : num 1 2 2 2 0 1 1 0 0 0 ...
## $ win_loss_percentage: num 0.8 0.609 0.5 0.5 0.5 0.5 0.417 0.4 0.429 0.429 ...
## $ goals_scored : num 85 67 62 65 37 40 35 23 20 17 ...
## $ goals_against : num 55 61 58 59 40 38 39 32 24 27 ...
## $ goal_differential : num 30 6 4 6 -3 2 -4 -9 -4 -10 ...
## $ SalaryCapEra : chr "Y" "Y" "Y" "Y" ...
## $ WonCup? : chr "Y" NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
Now that we have the data loaded in our session, we can start our exploratory data analysis. The first function we can use from the DataExplorer package is introduce(). This function outputs basic information about our dataset, such as the number of rows, columns, discrete columns, continuous columns, columns with missing data, missing observations, and estimated memory allocation in bytes. This function only has one argument, which is the dataset we are analyzing.
introduce(nhlplayoffs)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1: 1009 15 3 12 0
## total_missing_values complete_rows total_observations memory_usage
## 1: 1635 17 15135 128672
Looking at this data, we can see we have 1009 rows with 15 columns. 3 of the columns are discrete, while 12 are continusous. None of the columns only contain missing data, however there are 1635 observations that are missing from our dataset. On top of that, we have 17 rows that have a value for each column, 15135 total observations in the dataset, and this dataset takes up an estimated 128672 bytes of memory. All of these pieces of data are good to know as we start to understand what our dataset is comprised of.
We can next this to the next level and plot this information returned from the introduce() function using the plot_intro() function. The output of this function returns much of the information we saw from the previous output, but in a graphical form to help visualize the data.
The plot_intro() function follows the form:
plot_intro(data,geom_label_args = list(),title = NULL,ggtheme = theme_gray(),theme_config = list())
where the only argument you are required to pass in is the input dataset. The Geom_label_args argument can be used to help label the plot, the title argument can be used to add a title to the output, the ggtheme argument can be used to tweak the display of the theme used, and the theme_config argument can be used to add a theme to the output.
plot_intro(nhlplayoffs)
As mentioned before, this plot outputs much of the same summary information from the introduce() function, but in a graphical form.
We can draw some major conclusions from this plot for our exploratory data analysis. We can see that we have a much larger number of continuous columns than we do discrete, about 10% of all our observations are missing, and we have a very small percentage of rows that actually all have complete data. It might be logical from this plot that we want to get a better picture of the missing data before continuing further, and luckily there are functions in DataExplorer that can assist with that.
In the previous plot, we can see that 10.8% of our observations are missing, and a small percentage of rows actually have complete data. But what if we want to know more about what variables are the biggest contributor to these missing observations?
We can use the plot_missing() DataExplorer function to plot the frequency of missing values for each variable in the data. The plot_missing() function follows the form:
plot_missing(data, group = list(Good = 0.05, OK = 0.4, Bad = 0.8, Remove = 1), missing_only = FALSE, geom_label_args = list(), title = NULL, ggtheme = theme_gray(), theme_config = list(legend.position = c(“bottom”)))
where the only argument you are required to pass in is the input dataset. The arguments Geom_label_args, title, ggtheme, and theme_config all have the same purpose as the arguments of the same name covered in plot_intro(). The group argument can be used to update the upper and lower bounds for what we consider to be a good/bad percentage of missing data, and the missing_only argument specifies if we want to only plot variables with missing data, or all variables. By default, we consider 0 to 5% missing ‘good’, 5 to 40% missing ‘Okay’, 40 to 80% missing ‘bad’, and you should consider variable removal if 80 to 100% of data is missing. Also by default, we show all variables, not just those with missing data.
plot_missing(nhlplayoffs, group=list(Good=0.05, OK=0.5, Bad=0.85, Remove=1))
Here we can see that for our dataset, I set the boundaries of “good data” to have between 0 and 5% missing data, “okay data” is 5 to 50% missing, “bad data” is between 50 and 85% missing, and anything above that should be considered for removal. The output shows us each variable, and how it falls compared to the boundaries we set.
In the NHL dataset, we can see that SalaryCapEra and WonCup? both have a large number of missing rows, and WonCup should be considered for removal. However, I know that these variables currently have values set for cases where they are true, such as SalaryCapEra=Y if the year falls after 2005, and WonCup?=Y if the row corresponds to a the team that won the Stanley cup for the respective year. All other observations are blank for these variables. Lets say I wanted to keep these variables and fill in the missing observations with “N”. We can do that with the set_missing() function in the data explorer package.
The set_missing() function allows us to quickly set all missing values to a value we pass into the function. This function follows the form:
set_missing(data, value, exclude = NULL)
where the data argument is our input dataset (which has to be in data.table form) and the exclude argument is a column index or name of the columns we would like to exclude from the function. The value argument takes in a single value or a list of two values that we want to set the missing values to. When we supply two values, one should be numeric and one should be non-numeric to account for all data in the dataset. One thing to note is that this function updates the table directly, so we do not need to assign the output to another object.
Here we want to set the missing data for the last two columns to “N” since all missing rows indicate a “no”.
set_missing(nhlplayoffs,"N")
## Column [SalaryCapEra]: Set 729 missing values to N
## Column [WonCup?]: Set 906 missing values to N
table(nhlplayoffs$SalaryCapEra)
##
## N Y
## 729 280
table(nhlplayoffs$`WonCup?`)
##
## N Y
## 906 103
As we can see, our two variables that had significant missing data now have the previously blank observations set to “N”.
Another common step in EDA is to plot variables to check assumptions in the data, see if there are any trends or outliers, and start to formulate hypotheses about the variables. The DataExplorer packages has quite a few functions where the sole purpose is to graph variables for exploration.
The plot_bar() function plots a bar chart for each discrete feature in the dataset. This function follows the form:
plot_bar(data, with = NULL, by = NULL, by_position = “fill”, maxcat = 50, order_bar = TRUE, binary_as_factor = TRUE, title = NULL, ggtheme = theme_gray(), theme_config = list(), nrow = 3L, ncol = 3L, parallel = FALSE)
The only argument that is required is the data argument, which is the input dataset that houses the variables we would like to graph. Some of the other important arguments are as follows:
by: using this argument, you can specify a discrete argument to break the graph down by. This can be helpful if you are investigating trends in a data and want to separate the graph by a variable to see if there is a pattern.
with: this is the continuous feature we are summing the data by. The default is that we are summing the data by the frequency
maxcat: this argument specifies the maximum categories allowed for each discrete variables, with the default being 50
binary_as_factor: This argument specifies if we should treat any binary variables as categorical variables, and the default is true.
plot_bar(nhlplayoffs)
The output of this function is a barplot of the discrete variables. Here we can check what values have the highest frequency in the data. From this plot on the left, we can see that the Canadiens, Bruins, and Leafs have had the most appearances in the NHL playoffs since 1918. The middle plot and the right plot indicate many more games have been played before the salary cap era, and a very small proportion of our data includes a team that won the cup (which is to be expected since one team wins per year)
If we were interested in seeing our discrete data broken down by Stanley cup wins, we can pass “WonCup” into the by argument of this function:
plot_bar(nhlplayoffs,by = "WonCup.")
HEre we can see that including a feature in the by argument breaks down our output by this specified variable. From this graph on the left, we can look at relative frequency of the ‘WonCup’ variable to see the relative frequency that a team has won the cup in regards to the number of appearances they have had in the playoffs. The graph on the right is a little less useful here, but gives us a clue that there have been less Stanley Cups won in the salary cap era.
The plot_boxplot() function creates a boxplot for each continuous feature grouped by a variable we pass in. This function follows the form:
plot_boxplot(data,by,binary_as_factor = TRUE,geom_boxplot_args = list(),scale_y = “continuous”,title = NULL,ggtheme = theme_gray(),theme_config = list(),nrow = 3L,ncol = 4L,parallel = FALSE)
The two arguments that are required are the data and by arguments. The data argument is our input dataset. The by argument is a feature or variable name (either continuous or discrete) to group the data by. All other arguments that would be of interest in this function have been discussed in this demonstration thus far.
Since we need to select a variable to group the data by, we can select the SalaryCapEra variable to view to see if there are any trends in this variable among the continuous variables.
plot_boxplot(nhlplayoffs,by= "SalaryCapEra")
The output of this function is a collection of boxplots for the continuous variables in the dataset. We can see that the distribution of some of the variables, such as the goal_differential, goals_scored, wins, and win_loss_percentage have very similar distributions regardless of if the game was in the salary cap era.
The plot_histogram() function creates a histogram for each continuous feature in the dataset.This function follows the form:
plot_histogram( data,binary_as_factor = TRUE, geom_histogram_args = list(bins = 30L), scale_x = “continuous”, title = NULL, ggtheme = theme_gray(), theme_config = list(), nrow = 4L, ncol = 4L, parallel = FALSE)
The only argument that is required is the data argument, where we need to pass in our dataset of interest. All other arguments that would be of interest have been discussed in this demonstration thus far.
plot_histogram(nhlplayoffs)
This function is great for exploring the distribution of each of the continuous variables. With the output, you can see if any of the variables are majorly skewed, or if any follow a normal distribution. Each of the continuous variables will be output, with the y axis being the frequency, and the x axis adjusting to the scale of each variable.
For our NHL data, we can see a majority of our variables have a right skew to them, or a majority of the data falls to lower values, with a smaller percentage of the data falling to the right tail. This is expected because when you think of hockey scores, typically the goals scored fall around 0 to 3, with significantly less games having much larger scores.
The correlation() function outputs a heatmap of the correlation for the variables in our dataset. This function follows the form:
plot_correlation( data, type = c(“all”, “discrete”, “continuous”), maxcat = 20L, cor_args = list(), geom_text_args = list(), title = NULL, ggtheme = theme_gray(), theme_config = list(legend.position = “bottom”, axis.text.x = element_text(angle = 90)))
The only argument you are required to supply is the dataset. Some of the other important arguments are as follows:
type: the type argument specifies the type of columns we would like to include in our correlation calculation. We can supply “all”, “discrete”, or “continuous” based on what features we are interested in.
cor_args: This argument allows us to include a list of arguments from cor(). One thing this may be used for is to specify the type of correlation method we would like use to find the correlation coefficient.
maxcat: This argument specifies the maximum categories allowed for each discrete feature. If we have a variable with more than the number of categories specified here, it will be ignored. 20 is the default
plot_correlation(nhlplayoffs)
## 1 features with more than 20 categories ignored!
## team: 48 categories
We can see that the output of this function is a heatmap with all of the correlation coefficients of the variables in our dataset. The deeper the red, the stronger the positive correlation is, and the deeper the blue the stronger the negative correlation.
For our NHL data, we can see some correlation trends may be worth exploring more in the data analysis phase, such as goals scored vs. wins, goals scored vs. losses, wins vs. games, and goals against vs goals scored.
There are five additional functions within the DataExplorer package that can be used to visualize data for exploration. The arguments are very similar to the other functions, and any special arguments to consider will be called out below if applicable.
The plot_density function plots density function estimates for each continuous variable in the dataset. This output looks very similar to the plot_histogram() function, but returns a more smoothed distribution visualization of each continuous variable. The function format is as follows:
plot_density( data,binary_as_factor = TRUE,geom_density_args = list(),scale_x = “continuous”,title = NULL,ggtheme = theme_gray(),theme_config = list(),nrow = 4L,ncol = 4L,parallel = FALSE)
Once again, the only argument required is the data argument.
plot_density(nhlplayoffs)
We can see the output of this function is the density plot for each of the continuous variables. This helps us understand the distribution of values for each variable in a similar way to plot_histogram().
The plot_prcomp function visualizes the output of the principal component analysis of the given data, which can be run by using the prcomp() function. The function format is as follows:
plot_prcomp(data, variance_cap = 0.8, maxcat = 50L,prcomp_args = list(scale. = TRUE),geom_label_args = list(),title = NULL,ggtheme = theme_gray(),theme_config = list(),nrow = 3L,ncol = 3L,parallel = FALSE)
The data argument is the only argument that is required. The variance_cap argument can be used to set the maximum variance allowed for all of the principal components, and the prcomp_args can be used to pass in a list of other arguments to prcomp() to customize the principal component analysis being computed.
plot_prcomp(nhlplayoffs)
The output of this data shows which variables are contributing the most variability to the specific principal components. The first plot includes a breakdown of the % variance explained by each principal component. Each remaining plot is a different principal component, each variable is list on the y axis of each plot, and the relative importance is listed on the x axis. For purposes of space, I have chosen to not output this function to this html since it can get quite long, but have included two of the plots that are generated so that you can get an idea of the output:
The plot_qq() function creates a quantile-quantile plot for each continuous feature. QQ plots are typically used to determine the distribution of a variable. Normal data will see data points on the qq plot following a straight diagonal line. When data is skewed, or not normal, we will see data points fall far from the diagonal straight line on the plot. The general form of the function is:
plot_qq(data,by = NULL,sampled_rows = nrow(data),geom_qq_args = list(),geom_qq_line_args = list(),title = NULL, ggtheme = theme_gray(), theme_config = list(), nrow = 3L, ncol = 3L, parallel = FALSE)
The data argument is the only argument that is required. The sampled_rows argument can be used to select the number of rows to sample and include in the plot if there are too many in the dataset. These plots can also be broken out by continuous feature using the by argument.
plot_qq(nhlplayoffs)
We can see that many of the variables have a slight skew to the data, which is shown by the points at the ends of the graph falling far from the line, with points in the middle falling on the line. This is consistent with our observation earlier that many of the variables have a right skew to their distribution.
The plot_scatterplot() function creates scatterplots for all features in the dataset, fixing on a feature to use as the y axis variable. This function’s general format is:
plot_scatterplot( data, by, sampled_rows = nrow(data), geom_point_args = list(), scale_x = NULL, scale_y = NULL, title = NULL, ggtheme = theme_gray(), theme_config = list(), nrow = 3L, ncol = 3L, parallel = FALSE)
The two arguments that are required are the data and by arguments. The data argument is our input dataset. The by argument is a feature name to fix the data on. In other words, this will be the variable used as the y variable. All other arguments that would be of interest have been discussed in this demonstration thus far.
split_columns()
It is not uncommon to only look at continuous variables for when looking at scatterplots. We can easily go through and index our data to only pass in continuous columns, but what if we had a large number of variables and that would be difficult? Luckily the DataExplorer package has a function called split_columns() that splits the dataset into its continuous and discrete variables. The only argument that needs to be passed in is the dataset you are splitting into the two objects. The output of this function is a dataset of discrete variables, a dataset of continuous variables, the number of discrete variables, the number of continuous variables, and the number of variables with no observations. Once we have our output, we can then call upon the new continuous or discrete datasets within our funcitons.
plot_scatterplot(split_columns(nhlplayoffs)$continuous,by="win_loss_percentage")
Here we can see we fixed the plot by the win_loss_percentage variable, which is why that variable is included as the y variable for all of the plots. The output of this function is a scatterplot of all of the continuous variables (since those were the variables passed in via our split_columns function) graphed compared to the fixed by variable. Within EDA we can use this to see if there are any trends or correlation in the data.
The last data visualization function in this package is the plot_str() function. This function creates a network diagram of the data structure. This can be used to view the relationship of variables within datasets. This is more useful when you are looking at related datasets with the same variable information. The general format of the function is:
plot_str(data,type = c(“diagonal”, “radial”),max_level = NULL,print_network = TRUE)
The data argument is the only argument that is required. The type argument determines what type of diagram we would like to output, the max_level argument takes in an integer of nested levels we would like to include in the output, and the print_network argument specifies if the graph should be plotted.
plot_str(nhlplayoffs)
Here we can see all of the variables and how they are structured within the dataset. This can be used within EDA when there are variables connecting multiple datasets, and it would be beneficial to see how they relate.
While all of these functions are incredibly helpful on their own, the DataExplorer developers recognized that it would be common for users of the package to want to run all of the functions together. Because of this, there are two functions that can be used together to create a report that includes all of the information covered thus far from the package.
The first of these functions is configure_report(). This function is used to configure and customize the report template for the eventual report of all the data exploration output that will be generated. The format of this function is very long, and is as follows:
configure_report(
add_introduce = TRUE,
add_plot_intro = TRUE,
add_plot_str = TRUE,
add_plot_missing = TRUE,
add_plot_histogram = TRUE,
add_plot_density = FALSE,
add_plot_qq = TRUE,
add_plot_bar = TRUE,
add_plot_correlation = TRUE,
add_plot_prcomp = TRUE,
add_plot_boxplot = TRUE,
add_plot_scatterplot = TRUE,
plot_intro_args = list(),
plot_str_args = list(type = “diagonal”, fontSize = 35, width = 1000, margin =list(left = 350, right = 250)),
plot_missing_args = list(),
plot_histogram_args = list(),
plot_density_args = list(),
plot_qq_args = list(sampled_rows = 1000L),
plot_bar_args = list(),
plot_correlation_args = list(cor_args = list(use = “pairwise.complete.obs”)),
plot_prcomp_args = list(),
plot_boxplot_args = list(),
plot_scatterplot_args = list(sampled_rows = 1000L),
global_ggtheme = quote(theme_gray()),
global_theme_config = list())
The purpose of all of these arguments is to allow customizability to the report that we plan to generate. The arguments mostly fall into two different buckets:
The ‘add_plot_plottype’ arguments indicate if we would like to include a given plot, or not include it in the report output. All arguments are set to TRUE by default. This might be useful if there is one plot that is very long and not useful to the analysis, we may want to update the argument to FALSE for that plot.
The ‘plot_plottype_args’ argument allow you to pass in any specifying arguments for a given plot. This adds additional flexibility to the plots.
The global_ggtheme and global_theme_config arguments are used for setting the theme and configuration of the report.
Once we have our report configuration complete, we can generate a report of all plots and output from this package using create_report(). This function creates a report that includes anything that was specified in the report configuration. By default, it will include all the data visualization or summary functions included in the DataExplorer package. This function has the following general format:
create_report( data, output_format = html_document(toc = TRUE, toc_depth = 6, theme = “yeti”), output_file = “report.html”, output_dir = getwd(), y = NULL, config = configure_report(), report_title = “Data Profiling Report”,)
The only argument that is required is the data argument, which should take in the dataset you are interested creating a data profiling report for. Other important arguments are:
output_format: this is the output format that is generated. The deafult is a html document with a table of contents and a theme of ‘yeti’
output_file: This is the name that you would like to call the output file. The default is “report.html”
output_dir: This is the directory to output the report to. By default, it will output to the current working directory.
y: this argument is the name of a response variable to pass to any appropriate plotting functions if needed. For example, if we need a y variable to pass to a scatterplot function, we can specify that here.
config: This argument takes a named listed of functions that can added to the report. This can be used in place of configure_report() to configre the content within the report.
report_title: This can be used to set the report title
create_report(nhlplayoffs)
These last two functions speed up data exploration significantly since all of the functions withing DataExplorer are called automatically and output to a report file for quick and easy data exploration. With several clicks, and practically one line of code, a large majority of the EDA process is complete. This saved time can then be used to focus on understanding the data and starting to form hypotheses and statistical tests.
These two functions will not be run in this demonstration since all of the information that would have been included has already been covered, and the output of create_report is an output file.
DataExplorer has several additional functions not covered in the example walk through that might be beneficial during the EDA process. They will be covered at a very high level here:
drop_columns(): This function quickly drops variables from a dataset. The function has two arguments, the dataset we are dropping columns from, and a index vector to tell the function what columns to drop. You can either pass in column names or column positions to the index vector argument. Note that the function updates the data.table directly when called.
update_columns():This function quickly updates variables within a dataset. For example, we can use this function to change the variable type of columns to a factor type, or we can apply some sort of function to the columns like x^2 to square all the data within the columns. The function has three arguments,the dataset we are applying this to, and a index vector to tell the function what columns update, and the last argument is what we want to do with the variables. This should be a function or a character string naming the function that should be called.
dummify(): This function turns any categorical variable into distinct binary columns. For example, if we have a column with three distinct values, when we dummify the column, we will now have three columns where the observations are binary. There will be a 1 in the column that represents the origonal value of the observation. The function has three arguments,the dataset we are applying this to, the maximum categories allowed for the each feature, and the names of the variables that should be dummified.
profile_missing():This function is similar to the plot_missing() function, but the output is in a tabular form. The output includes the frequency, percentage and suggested action of missing values. The function’s only argument is the dataset that you would like to do the missing data analysis for.
group_category():This function aids in data manipulation of sparse categories. This function can be used when there are variables with categories that do not have many observations populated. This function will group any sparsely populated categories together for further analysis. This can be useful when you have ‘other’ categories that are sparsely populated but you would like to analyze them together. This function has seven arguments. The first is the dataset you are looking to group categories within. The next argument is the name of the variable in which you’d like this grouping to take place. The next argument is the threshold percentage to group the data by. That is, if we specify 15% here, the categories that make up the bottom 15% of the data will be grouped. The measure argument is next, and can be used if we want to group data in one variable by another variable in the dataset. This other variable is what we specify here. The final three arguments are the update argument, category_name argument, and the exclude argument. The update argument allows you customization if the original data table should be updated directly (TRUE) or not (FALSE) . The category name argument is the name of the new grouped category. If not specified, the default is other. Finally, the exclude argument specified categories to be excluded from grouping. This can be useful if you have sparse categories you do not want to group, they can be specified here.
Hopefully with this demonstration you can see how the DataExplorer package can help automate and simplify one of the most critical components of the data analysis process. With practically one line of code, create_report(your dataset), you will have access to a report full of data relevant to your dataset. This allows you to spend less time focusing on how you will approach data exploration, and more time actually exploring your data and drawing insights from it. If there are specific trends or patterns you are interested in exploring further, you have the ability to run any of the functions from the package to create your own exploration output. And finally, if you choose to manipulate your data in any way as a result of your exploration, the DataExplorer package has numerous functions to help make it easy to update your dataset.