Exploratory Data Analysis in Python using pandas
Exploratory Data Analysis in Python using pandas
This video provides a clear and concise demonstration of how to perform data pre-processing and exploratory data analysis (EDA) using the Pandas library in Python. The presenter uses a real-world dataset of NBA Basketball player stats data obtained via web scraping and performs various data cleaning and analysis tasks to gain insights into the dataset. The tutorial covers important EDA concepts such as summarizing the data using summary statistics, counting unique values in a column, and grouping data by a specific column.
Data gathering, shape and type
1: Data Scarping
1-This block of code will scrape data directly from the basketball reference website and the table data will be put on into the data frame called DF2019 and then it will clean all of the headers that are redundant and then it will be contained within the raw variable
2-Check Missing values:
we use the isnull() function to check the table for any missing values
3-Data shape:
The data table contains 708 rows and 30 columns
4-Write the table data to a csv file
We use the to_csv() function to write the data into a csv file
5-Show the entire dataset
To show all the data we need to run the set_option() function
6-Data Type
To know the data type of each column in our dataset we need to run the df.datatypes() function
Ask questions and find out which commands can help us answer our questions
1-Condidnotnal Selection:
We use it if we want to show specific rows or column in the dataset that matches a particular condonation
Ex: Which player scored the most Points (PTS) Per Game?
In order to find out which player cord the most point we need to use the max()
Function which will tell us the maximum for a given variable (in our expale it will be the PTS coulmn)
2-Example Question
For example, let’s say we want to find out which player had the highest 3-Point Field Goals Per Game (3P) ?
This function will give us the maximum value of the 3P column and then it will return the rows matching the condition we define
Visuals
create a histogram to visuals our dataset
1-Creat a new data frame from our dataset
The function below will create a subset from our dataset, we will select two column (position and point)
And then it going to define it into a new data frame called PTS and we will only select fine position form the position columns as showing below
2-Show the histogram
To show the histogram visuals, we will use the pandas built on function
We define our new data frame which is PTS and we define pts columns
And we then use the built-in function hist() which will display the histogram
According to our data frame we have five pistons, so we will get five different histograms
3-We can change the histogram layout
We can use the layout() function to show one row and 5 column
3-Using the Box Plots built-in pandas’ function to visuals our data frame
The boxplot() function to visuals our data frames as boxes in the histogram
We will use the PTS data frame. boxplot() which is the built-in panda’s function
And as argument we will define the pts column and by position
As we have five positions in our dataset , each position will have its own box in the data visualization
Heatmap
To make the heatmap, we will use the heatmap() function and as argument we will pass the corr data frame which contains the correlation coefficient matrix values
The code above will give us a heatmap of the inter correlation matrix of each vibrable with one another
The white colored line on the middle represent that the diagonal correlation confession of 1
Scatter Plot
1-We will use the select_dtypes() function so select only the numerical data types
2- we will assign the above values to a variable called number and we will only skeet tie first five column using the below code
We will define our variable which is number and then we use the. iloc function which mean index location number then on the practice[] we define colon : which means we are going to select all of the rows and the we add colon : 5 which means we will elect from column 1 to 5
3-Create Scatterplot grid for the first five columns
We will use the sns.pairgrid() function to create scatterplot grid from our dataset