Exploratory Data Analysis in Python using pandas

Exploratory Data Analysis in Python using pandas

This video provides a clear and concise demonstration of how to perform data pre-processing and exploratory data analysis (EDA) using the Pandas library in Python. The presenter uses a real-world dataset of NBA Basketball player stats data obtained via web scraping and performs various data cleaning and analysis tasks to gain insights into the dataset. The tutorial covers important EDA concepts such as summarizing the data using summary statistics, counting unique values in a column, and grouping data by a specific column.

 

Data gathering, shape and type

    

1: Data Scarping

1-This block of code will scrape data directly from the basketball reference website and the table data will be put on into the data frame called DF2019 and then it will clean all of the headers that are redundant and then it will be contained within the raw variable

 

2-Check Missing values:

we use the isnull() function to check the table  for any missing values

 

 

3-Data shape:

The data table contains 708 rows and 30 columns  

4-Write the table data to a csv file

We use the to_csv() function to write the data into a csv file

 

5-Show the entire dataset

To show all the data we need to run the set_option() function

 

6-Data Type

To know the data type of each column in our dataset we need to run the df.datatypes() function

 

 

 

 

 Ask questions and find out which commands can help us answer our questions

 

1-Condidnotnal Selection:

We use it if we want to show specific rows or column in the dataset that matches a particular condonation

Ex: Which player scored the most Points (PTS) Per Game?

 In order to find out which player cord the most point we need to use the max()

Function which  will tell us the maximum for a given variable  (in our expale it will be the PTS coulmn)

 

 

2-Example Question

For example, let’s say we want to find out which player had the highest 3-Point Field Goals Per Game (3P) ?

This function will give us the maximum value of the 3P column and then it will return the rows matching the condition we define

 

Visuals

create a histogram to visuals our dataset

 

1-Creat a new data frame from our dataset

The function below will create a subset from our dataset, we will select two column (position and point)

And then it going to define it into a new data frame called PTS and we will only select fine position form the position columns as showing below

 

 

 

2-Show the histogram

To show the histogram visuals, we will use the pandas built on function

We define our new data frame which is PTS and we define pts columns

And we then use the built-in function hist() which will display the histogram  

 

 

According to our data frame we have five pistons, so we will get five different histograms

 

 

 

3-We can change the histogram layout

 

We can use the layout() function to show one row and 5 column

 

3-Using the Box Plots built-in pandas’ function to visuals our data frame

 

The boxplot() function to visuals our data frames as boxes in the histogram

We will use the PTS data frame. boxplot() which is the built-in panda’s function

And as argument we will define the pts column and by position    

 

As we have five positions  in our dataset , each position will have its own box in the data visualization

 

 

 

 

 Heatmap

To make the heatmap, we will use the heatmap() function and as argument we will pass the corr data frame which contains the correlation coefficient matrix values

 

The code above will give us a heatmap of the inter correlation matrix of each vibrable with one another   

 

The white colored line on the middle represent that the diagonal correlation confession of 1

 

 

 Scatter Plot

 

1-We will use the select_dtypes() function so select only the numerical data types

 

 

 

2- we will assign the above values to a variable called number and we will only skeet tie first five column using the below code

We will define our variable which is number and then we use the. iloc function which mean index location number then on the practice[]  we define colon : which means we are going to select all of the rows and the we add colon : 5 which means we will elect from column 1 to 5

 

3-Create Scatterplot grid for the first five columns

We will use the sns.pairgrid() function to create scatterplot grid from our dataset