Predictive Analytics is the branch of data analytics used to predict what will happen in the future. Predictive analytics involves predictions on data from existing data sets and helps to identify new trends and patterns. Predictive analysis is used to predict future outcomes based on historical data. Rains are an essential part of our lives. The weather department is trying to forecast when it will rain. Through each rainfall, we can collect an enormous amount of data. In this assignment, we will predict rain for the next day by using the historical dataset.
The primary predictive analytics steps involved:
Data Collection: This step involves the collection of data. This data is used to gather some necessary details required for the analysis. It involves the historical or past data information on which predictive analysis will be performed.
Data Cleaning: Data Cleaning is identifying incorrect data and fixing or removing inaccurate records from the dataset to make the data easier to use. In data cleaning, we must remove the erroneous data, add the missing values, find the hidden values, and make the data accurate because working with inaccurate data is a massive waste of time.
Data Analysis: Data analysis involves the exploration of data. We explore the data and analyze it to identify the patterns. In this stage, we discover useful information by identifying the details from the historical data.
Predictive Modeling: It is a mathematical process to predict future outcomes. It is a commonly used statistical technique used to make predictions. In this step, we used various algorithms to build models based on the patterns.
Validation: In this step, we check the efficiency of our model by performing several tasks. In this stage, the model needs to be evaluated for its accuracy.
Predictive Deployment: In deployment provides an option to deploy the analytical results.
Model Monitoring: We will monitor the model to check its performance.
In this assignment, the Jupyter Notebook tool is used to do a predictive analysis of the given dataset. We can use the Kaggle website as we will find various datasets through which we can demonstrate Predictive Analysis. The sample chosen was the weather dataset from the Kaggle website, which will help analyze data quickly. This dataset contains about ten years of daily weather observations from many locations across Australia. The sample dataset contains various important attributes like wind speed at different timings, wind direction, predictions of rain today and tomorrow, and other relevant data.
Step 1: Install and import Libraries
The first step is to install and import some basic libraries such as numpy, pandas, seaborn, and matplotlib. PIP is a package manager for Python packages so we can use pip function to install the packages.
Step 2: Data Collection
This step involves the collection of data. This data is used to gather some necessary details required for the analysis. We named the dataset as df and we import the .csv file of dataset into our notebook so that we can analyze the data easily. We can import the package named warnings because the warnings filter maintains an ordered list of filter specifications; any specific warning is matched against each filter specification in the list in turn until a match is found; the filter determines the disposition of the match. Name the given dataset as weatherAUS and change it to .csv file before following the given steps.
We can upload the dataset in our workbook by using the commands given below:
After importing the dataset, we can check the details of the data. We can see that there are 22 columns in the dataset which contains several information about the weather conditions. We can use more commands such as df.shape, df. describe to check the information about the dataset.
Step 3: Data Cleaning
This step involves the cleaning of the data. Data Cleaning is the process of identifying incorrect data and fixing or removing inaccurate records from the dataset to make the data easier to use. Common inaccuracies in data include missing values, misplaced entries, and wrong attribution. Data that contains these inaccuracies is called wrong data. When combining different data sets, there are more chances of data duplicity or mislabeling. To remove these types of errors, we can use a data cleaning process to make our data more reliable. By using the data cleaning steps, we can improve the quality of the data for making accurate predictions.
In the first step, we will check for the null values by using the isnull() command and we can see that there are many null values in all the columns other than 4 columns.
We can fill those null values by zero.
Now when we checked again, we see that there is no null values in the dataset.
Step 4: Data Analysis
Data analysis involves the exploration of data. We explore the data and analyze it to identify the patterns. In this stage, we discover useful information by identifying the details from the historical data. After checking the details of the dataset such as columns and rows, we can create a visualization which will be useful in making predictions.
Step 5: Predictive Modeling
Data visualization is the graphical representation of different pieces of information or data, using visual elements such as charts, graphs, or maps. Data visualization tools provide the ability to see and understand data trends, outliers, and patterns in an easy, intuitive way.It is a mathematical process to predict future outcomes. It is a commonly used statistical technique used to make predictions. In this step, we used various algorithms to build models based on the patterns. We will create a bar graph to predict the status of the future rainfall for tomorrow by using the details from the "RainTomorrow" column in the dataset.
Step 6: Validation:
In this step, we check the efficiency of our model by performing several tasks. In this stage, the model needs to be evaluated for its accuracy.
The first step was to upload the dataset into the Jupyter Notebook tool. The next step in building the model is to import the necessary libraries. By df.head() and df.tail() commands, we can see the top and bottom list of the dataset, respectively. Firstly, we will check the dataset by putting some commands to analyze the data correctly. After collecting data, we can do data cleaning, and then we will do data analysis. Df.describe() command is used to view the statistical properties of the variables. Using statistical data, we can visualize the frequency distribution of the RainTomorrow variable.
The graph consists of two columns which include No and Yes. No predicted probability that there will be no rain tomorrow, and yes, a predicted likelihood that there will be rain tomorrow. The graph shows that the No variable has more entries than yes. So, we will predict that there will be no rain tomorrow. So, by following all the steps, we can analyze the data quickly and predict the future.
Dataset:
You can download the dataset for practice by using the link below-
After using the dataset and following the steps given above, you can easily make predictions by using the historical data.
Thank You and Happy Learning!