Data-Driven Dispatch. Using supervised learning to predict… | by John Lenehan | Aug, 2023

Data-Driven Dispatch: Utilizing Supervised Learning for Accurate Predictions | Written by John Lenehan | August 2023

Introduction:

Introduction to Exploratory Data Analysis (EDA):

Exploratory Data Analysis (EDA) is an essential step in preparing data for machine learning models. It involves visualizing and analyzing different features of the dataset to gain insights, identify outliers, and make informed decisions regarding feature engineering.

In this process, histograms are plotted to show the distribution of numerical columns in the dataset. These histograms provide an overview of the data distribution and help in identifying outliers. For instance, in the given dataset, the latitude data is found to be bimodal, and the longitude data is right-skewed. To standardize this data, scaling techniques are used, such as logarithmic transformation or trigonometric functions.

Data encoding is another important step in data preprocessing where non-numerical data is converted into a numerical format. This is typically achieved through label encoding, where each category in a column is assigned a numerical value.

Once the data is preprocessed and encoded, it is split into training and test sets to build a machine learning model. In this project, a K-Nearest Neighbors (KNN) classification model is used to predict results based on the given features. The KNN model utilizes the values of the nearest known data points to classify unknown data points.

The performance of the model is evaluated using metrics like accuracy, precision, recall, and F1 score. These metrics provide insights into the model’s ability to make accurate predictions. In the initial evaluation, the model exhibited high test accuracy, precision, recall, and F1 score. However, the significant difference between the test accuracy and train accuracy indicated overfitting, leading to poor performance on unseen data.

To improve the model’s fit, hyperparameter tuning is performed using k-fold cross-validation. Different distance metrics, such as Euclidean and Manhattan, are explored to find the best combination of hyperparameters that result in optimal model performance.

Overall, the EDA process, data encoding, and hyperparameter tuning play crucial roles in preparing the data and building an effective machine learning model for predicting crash types based on the given features.

Full Article: Data-Driven Dispatch: Utilizing Supervised Learning for Accurate Predictions | Written by John Lenehan | August 2023

Exploratory Data Analysis

Before we can proceed with the machine learning model, we need to perform some exploratory data analysis (EDA). In this step, we will plot histograms of each column in the data frame to understand the distribution of the data. Histograms are useful in EDA because they provide an overview of the data distribution, help identify outliers, and assist in making decisions about feature engineering.

Histograms of columns in the final dataset

To visualize the distribution of the data, we plot histograms of each column in the final dataset. By setting the number of bins to 50, we ensure that the histogram provides a detailed view of the data distribution. The histograms are shown below:

A cursory look at the column histograms reveals some interesting observations. The latitude data appears to be bimodal, indicating that there are two distinct groups within the data. On the other hand, the longitude data is rightly skewed, suggesting that the data is concentrated towards lower values. These observations will need to be taken into consideration and addressed appropriately for machine learning purposes.

You May Also Like to Read  AI for Data Centers: Unlocking the Power of Predictive Maintenance

Standardizing Latitude-Longitude Data

To ensure that the latitude and longitude data can be better applied for machine learning, we need to standardize these features. Standardization is a technique used in data preprocessing to transform features so that they have similar magnitudes. This is particularly important for machine learning models, as they are generally sensitive to the scale of input features.

To standardize the latitude and longitude data, we use the StandardScaler function. This function transforms the data so that it has a mean of 0 and a standard deviation of 1. In addition, since the longitude data is all negative, we apply a logarithmic transformation before scaling it.

Scaling and Transformation

Scaling is a common technique used in data preprocessing to handle skewed or bimodal data distributions. In this case, we use logarithmic functions to scale the latitude and longitude data. Logarithmic transformation makes skewed data more symmetrical and reduces the impact of outlier values.

By applying logarithmic transformation to the longitude data and normalizing the latitude data, we achieve the desired effect. The scaled latitude-longitude data is shown below:

Scaling Crash Hour Data

Another transformation that needs to be applied is to the crash hour data. It appears that the crash hour column has a cyclic nature, with periodic peaks and troughs. To encode this cyclic data, we use trigonometric functions such as sine and cosine.

In this case, we apply a sine transformation to the crash hour data. By converting the input to radians and calculating the sine of the input, we can capture the cyclic nature of the data. The transformed crash hour data is shown below:

Finally, we remove the unscaled data from the model to avoid any interference with model predictions. We drop the previous latitude and longitude columns, as well as the crash hour column from the dataset.

Data Encoding

Data encoding is an important step in data preprocessing, where non-numerical data is represented in a numerical format suitable for machine learning algorithms. In this model, we use a technique called label encoding to encode categorical data.

First, we segment the columns we want to keep from the original dataset and make a copy of the dataframe (collisions_ml). Then, we define the categorical columns and use the LabelEncoder function from sklearn to fit and transform the categorical columns.

By encoding the categorical values in the dataset, we ensure that all features are represented in a numerical format compatible with machine learning algorithms. This allows us to proceed with the next steps of model building and evaluation.

Splitting the Train & Test Data

To build an effective machine learning model, it is important to separate the data into training and test sets. The training set is used to train the model on the correct responses, while the test set is used to evaluate the model’s performance. This separation helps to reduce the risk of overfitting and model bias.

In this case, we split the data into an 80-20 ratio, with 80% of the data used for training and the remaining 20% used for testing. We define the crash_type column as the target variable to be predicted, while all other features are used as input variables. The train_test_split function from sklearn is used for this purpose.

K-Nearest Neighbors Classification

For this project, we chose to use a K-Nearest Neighbors (KNN) classification model to predict the results. KNN models work by checking the value of the K nearest known data points around an unknown data point, and then classifying the data point based on the values of those “neighbor” points. It is a non-parametric classifier that does not make any assumptions about the underlying data distribution.

You May Also Like to Read  Enhancing Manufacturing Competitiveness: AI-powered Real-time Quoting in Action

To implement the KNN model, we instantiate the KNeighborsClassifier with an initial number of neighbors (n_neighbors) set to 3 and the distance metric set to Euclidean. We then fit the model to the training data.

Predictions

Once the model is fitted to the training data, we make predictions on the test data. This allows us to evaluate the model’s performance and assess its ability to accurately predict the crash_type variable.

Evaluation

The evaluation of a machine learning model is typically done using several metrics, including accuracy, precision, recall, and F1 score. These metrics provide insights into different aspects of the model’s performance.

Accuracy is the percentage of true positive predictions out of all model predictions. Precision is the percentage of true positive predictions out of all positive model predictions. Recall is the percentage of true positive predictions out of all positive cases in the dataset. F1 score is an overall metric that combines the precision and recall scores.

In this case, we calculate the accuracy, precision, recall, and F1 score of the KNN model using the appropriate functions from sklearn. We also compare the model’s performances on the train and test sets to assess the model fit.

The initial metrics of the KNN model are as follows:

– Training Accuracy: 93.1%
– Test Accuracy: 79.6%
– Train-Test Accuracy Difference: 13.5%
– Precision Score: 82.1%
– Recall Score: 91.1%
– F1 Score: 86.3%

These metrics indicate that the model performed well on the test data, with a high accuracy, precision, recall, and F1 score. However, the significant difference between the train and test accuracies suggests that the model is overfitting the data.

Hyperparameter Tuning

To address the issue of overfitting, we need to fine-tune the model by selecting the best set of hyperparameters. In this case, we focus on two hyperparameters: the number of neighbors (n_neighbors) and the distance metric.

To find the best hyperparameters, we use k-fold cross-validation. This technique involves splitting the data into k subsets or folds, using each fold as the validation set while the remaining data is used as the training set. This helps to reduce the risk of bias introduced by a particular choice of training and test sets.

The initial hyperparameters for the KNN model are n_neighbors = 3 and metric = ‘euclidean’. These will be fine-tuned using the k-fold cross-validation technique.

In conclusion, this article discusses the process of exploratory data analysis, scaling and transformation, data encoding, model building, and evaluation. The aim is to preprocess the data and build a K-Nearest Neighbors classification model to predict the crash_type variable. The initial model performed well on the test data, but showed signs of overfitting. Hyperparameter tuning will be used to improve the performance of the model.

Summary: Data-Driven Dispatch: Utilizing Supervised Learning for Accurate Predictions | Written by John Lenehan | August 2023

Exploratory Data Analysis (EDA) is an essential step in machine learning before building a model. This involves plotting histograms of the data distribution to understand outliers and make decisions on feature engineering. In this analysis, the latitude data was found to be bimodal and the longitude data was rightly skewed. To standardize the data for machine learning, scaling and transformation techniques were applied. Logarithmic functions were used to scale skewed or bimodal data, and trigonometric functions were used to scale cyclic data. Data encoding was also performed to represent non-numerical data in numerical format. The preprocessed data was then split into training and testing sets. A K-Nearest Neighbors (KNN) classification model was fitted to the training data and evaluated using various metrics. The model was found to be overfitting, so hyperparameter tuning was applied using k-fold cross-validation. Different distance metrics, such as Euclidean and Manhattan, were tested for the KNN model.

You May Also Like to Read  Swarm Robotics: Exploring Applications and Future Developments

Frequently Asked Questions:

1. What is Data Science and why is it important?
Data Science is a multidisciplinary field that involves extracting knowledge and insights from large amounts of structured and unstructured data. It incorporates various techniques and tools from mathematics, statistics, computer science, and domain-specific knowledge to uncover patterns and solve complex problems. Data Science is crucial in today’s digital era as it enables businesses to make data-driven decisions, enhance operational efficiency, and gain a competitive advantage.

2. What are the steps involved in the Data Science process?
The Data Science process typically involves the following steps:
a) Data Collection: Gathering raw data from various sources.
b) Data Cleaning: Removing irrelevant or erroneous data, and handling inconsistencies.
c) Data Exploration: Analyzing and visualizing the data to gain initial insights.
d) Data Preparation: Transforming and formatting the data to make it suitable for further analysis.
e) Model Building: Constructing statistical or machine learning models to identify patterns and predict outcomes.
f) Model Evaluation: Assessing model performance using appropriate metrics.
g) Deployment: Implementing the model in a production environment and monitoring its performance.

3. What are the key skills required to become a Data Scientist?
To become a successful Data Scientist, one needs to possess a blend of technical and non-technical skills. Some key skills include:
a) Programming: Proficiency in languages like Python or R for data manipulation and analysis.
b) Statistics and Mathematics: A solid understanding of statistical concepts and mathematical techniques used in data analysis.
c) Machine Learning: Familiarity with various algorithms and techniques for building predictive models.
d) Data Visualization: The ability to create insightful visual representations of data to aid in understanding and decision-making.
e) Domain Knowledge: Understanding the specific field or industry where data analysis is being applied.
f) Communication: The capability to effectively explain complex findings to both technical and non-technical stakeholders.

4. What is the difference between Data Science, Machine Learning, and Artificial Intelligence?
While these terms are often used interchangeably, they have distinct meanings:
a) Data Science: The overarching field that encompasses the collection, cleaning, analysis, and interpretation of data to gain insights and solve problems.
b) Machine Learning: Specific algorithms and statistical models that allow computer systems to learn and improve from data without being explicitly programmed.
c) Artificial Intelligence: The broader concept of machines or systems that exhibit human-like intelligence, including problem-solving, natural language processing, and decision-making.

5. How is Data Science used in real-world applications?
Data Science has a wide range of practical applications across various industries, including:
a) Business Analytics: Analyzing customer behavior and market trends to drive data-driven decision making in marketing, sales, and product development.
b) Healthcare: Utilizing electronic health records, genomics, and medical imaging data to improve diagnoses, treatment plans, and patient outcomes.
c) Finance: Predicting stock prices, detecting fraudulent transactions, and assessing credit risk to aid in financial decision making.
d) Transportation and Logistics: Optimizing route planning, inventory management, and supply chain operations for improved efficiency and cost savings.
e) Social Media: Analyzing user behavior and sentiment to improve user experience, personalize recommendations, and drive targeted advertising.

Remember to always attribute information from external sources and provide proper citations to maintain ethical writing practices.