Assignments

Jump to: [Homeworks] [Projects] [Quizzes] [Exams]

There will be one homework (HW) for each topical unit of the course. Due about a week after we finish that unit.

These are intended to build your conceptual analysis skills plus your implementation skills in Python.

  • HW0 : Numerical Programming Fundamentals
  • HW1 : Regression, Cross-Validation, and Regularization
  • HW2 : Evaluating Binary Classifiers and Implementing Logistic Regression
  • HW3 : Neural Networks and Stochastic Gradient Descent
  • HW4 : Trees
  • HW5 : Kernel Methods and PCA

After completing each unit, there will be a 20 minute quiz (taken online via gradescope).

Each quiz will be designed to assess your conceptual understanding about each unit.

Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions.

You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

There will be three larger "projects" throughout the semester:

  • Project A: Classifying Images with Feature Transformations
  • Project B: Classifying Sentiment from Text Reviews
  • Project C: Recommendation Systems for Movies

Projects are meant to be open-ended and encourage creativity. They are meant to be case studies of applications of the ML concepts from class to three "real world" use cases: image classification, text classification, and recommendations of movies to users.

Each project will due approximately 4 weeks after being handed out. Start early! Do not wait until the last few days.

Projects will generally be centered around a particular methodology for solving a specific task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report (2-4 pages), describing your approach and providing several figures/tables to explain your results to the reader.

You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.

Machine Learning Fundamentals Handbook – Key Concepts, Algorithms, and Python Code Examples

Tatev Aslanyan

If you're planning to become a Machine Learning Engineer, Data Scientist, or you want to refresh your memory before your interviews, this handbook is for you.

In it, we'll cover the key Machine Learning algorithms you'll need to know as a Data Scientist, Machine Learning Engineer, Machine Learning Researcher, and AI Engineer.

Throughout this handbook, I'll include examples for each Machine Learning algorithm with its Python code to help you understand what you're learning.

Whether you're a beginner or have some experience with Machine Learning or AI, this guide is designed to help you understand the fundamentals of Machine Learning algorithms at a high level.

As an experienced machine learning practitioner, I'm excited to share my knowledge and insights with you.

What You'll Learn

Chapter 1: what is machine learning.

  • Chapter 2: Most popular Machine Learning algorithms
  • 2.1 Linear Regression and Ordinary Least Squares (OLS)
  • 2.2 Logistic Regression and MLE
  • 2.3 Linear Discriminant Analysis(LDA)

2.4 Logistic Regression vs LDA

  • 2.5 Naïve Bayes

2.6 Naïve Bayes vs Logistic Regression

2.7 decision trees, 2.8 bagging, 2.9 random forest.

  • 2.10 Boosting or Ensamble Techniques (AdaBoost, GBM, XGBoost)

3.  Chapter 3: Feature Selection

  • 3.1 Subset Selection
  • 3.2 Regularization (Ridge and Lasso)
  • 3.3 Dimensionality Reduction (PCA)

4.  Chapter 4: Resampling Technique

  • 4.1 Cross Validation: (Validation Set, LOOCV, K-Fold CV)
  • 4.2 Optimal k in K-Fold CV
  • 4.5 Bootstrapping

5.  Chapter 5: Optimization Techniques

  • 5.1 Optimization Techniques: Batch Gradient Descent (GD)
  • 5.2 Optimization Techniques: Stochastic Gradient Descent (SGD)
  • 5.3 Optimization Techniques: SGD with Momentum
  • 5.4 Optimization Techniques: Adam Optimiser
  • 6.1 Key Takeaways & What Comes Next
  • 6.2 About the Author — That’s Me!
  • 6.3 How Can You Dive Deeper?
  • 6.4 Connect with Me

image-88

Prerequisites

To make the most out of this handbook, it'll be helpful if you're familiar with some core ML concepts:

Basic Terminology:

  • Training Data & Test Data: Datasets used to train and evaluate models.
  • Features: Variables aiding in predictions, we also call independent variables
  • Target Variable: The predicted outcome, also called dependent variable or response variable

Overfitting Problem in Machine Learning

Understanding Overfitting, how it's related to Bias-Variance Tradeoff, and how you can fix it is very important. We will look at regularization techniques in detail in this guide, too. For a detailed understanding, refer to:

1*sHhtYhaCe2Uc3IU0IgKwIQ

Foundational Readings for Beginners

If you have no prior statistical knowledge and wish to learn or refresh your understanding of essential statistical concepts, I'd recommend this article: Fundamental Statistical Concepts for Data Science

For a comprehensive guide on kickstarting a career in Data Science and AI, and insights on securing a Data Science job, you can delve into my previous handbook: Launching Your Data Science & AI Career

Tools/Languages to use in Machine Learning

As a Machine Learning Researcher or Machine Learning Engineer, there are many technical tools and programming languages you might use in your day-to-day job. But for today and for this handbook, we'll use the programming language and tools:

  • Python Basics: Variables, data types, structures, and control mechanisms.
  • Essential Libraries: numpy , pandas , matplotlib ,   scikit-learn , xgboost
  • Environment: Familiarity with Jupyter Notebooks  or PyCharm as IDE.

Embarking on this Machine Learning journey with a solid foundation ensures a more profound and enlightening experience.

Now, shall we?

Machine Learning (ML), a branch of artificial intelligence (AI), refers to a computer's ability to autonomously learn from data patterns and make decisions without explicit programming. Machines use statistical algorithms to enhance system decision-making and task performance.

At its core, ML is a method where computers improve at tasks by learning from data. Think of it like teaching computers to make decisions by providing them examples, much like showing pictures to teach a child to recognize animals.

For instance, by analyzing buying patterns, ML algorithms can help online shopping platforms recommend products (like how Amazon suggests items you might like).

Or consider email platforms that learn to flag spam through recognizing patterns in unwanted mails. Using ML techniques, computers quietly enhance our daily digital experiences, making recommendations more accurate and safeguarding our inboxes.

On this journey, you'll unravel the fascinating world of ML, one where technology learns and grows from the information it encounters. But before doing so, let's look into some basics in Machine Learning you must know to understand any sorts of Machine Learning model.

Types of Learning in Machine Learning:

There are three main ways models can learn:

  • Supervised Learning: Models predict from labeled data (you got both features and labels, X and the Y)
  • Unsupervised Learning: Models identify patterns autonomously, where you don't have labeled date (you only got features no response variable, only X)
  • Reinforcement Learning: Algorithms learn via action feedback.

Model Evaluation Metrics:

In Machine Learning, whenever you are training a model you always must evaluate it. And you'll want to use the most common type of evaluation metrics depending on the nature of your problem.

Here are most common ML model evaluation metrics per model type:

1. Regression Metrics:

  • MAE, MSE, RMSE: Measure differences between predicted and actual values.
  • R-Squared: Indicates variance explained by the model.

2. Classification Metrics:

  • Accuracy: Percentage of correct predictions.
  • Precision, Recall, F1-Score: Assess prediction quality.
  • ROC Curve, AUC: Gauge model's discriminatory power.
  • Confusion Matrix: Compares actual vs. predicted classifications.

3. Clustering Metrics:

  • Silhouette Score: Gauges object similarity within clusters.
  • Davies-Bouldin Index: Assesses cluster separation.

image-74

Chapter 2: Most Popular Machine Learning Algorithms

In this chapter, we'll simplify the complexity of essential Machine Learning (ML) algorithms. This will be a valuable resource for roles ranging from Data Scientists and Machine Learning Engineers to AI Researchers.

We'll start with basics in 2.1 with Linear Regression and Ordinary Least Squares (OLS), then go into 2.2 which explores Logistic Regression and Maximum Likelihood Estimation (MLE).

Section 2.3 explores Linear Discriminant Analysis (LDA), which is contrasted with Logistic Regression in 2.4. We get into Naïve Bayes in 2.5, offering a comparative analysis with Logistic Regression in 2.6.

In 2.7, we go through Decision Trees, subsequently exploring ensemble methods: Bagging in 2.8, and Random Forest in 2.9. Various and popular Boosting techniques unfold in the following segments, discussing AdaBoost in 2.10, Gradient Boosting Model (GBM) in 2.11, and concluding with Extreme Gradient Boosting (XGBoost) in 2.12.

All the algorithms we'll discuss here are fundamental and popular in the field, and every Data Scientist, Machine Learning Engineer, and AI researcher must know them at least at this high level.

Note that we will not delve into unsupervised learning techniques here, or enter into granular details of each algorithm.

2.1 Linear Regression

When the relationship between two variables is linear, you can use the Linear Regression statistical method. It can help you model the impact of a unit change in one variable, the independent variable on the values of another variable, the dependent variable .

Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables.

When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression . But when the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression .

Simple Linear Regression can be described by the following expression:

0*oLHnTG7OkSaBpmni

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, and β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.

The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data. One example of the Linear Regression application is modeling the impact of flipper length on penguins’ body mass , which is visualized below:

Image Source: The Author

Multiple Linear Regression with three independent variables can be described by the following expression:

0*O6gSvCYw8FxXAW54

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, and β1 , β 2, β 3 are the slope coefficients or a parameter corresponding to the variable X1, X2, X3 which are unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.

2.1.1 Ordinary Least Squares

The ordinary least squares (OLS) is a method for estimating the unknown parameters such as β0 and β1 in a linear regression model. The model is based on the principle of least squares that minimizes the sum of squares of the differences between the observed dependent variable and its values predicted by the linear function of the independent variable, often referred to as fitted values .

This difference between the real and predicted values of dependent variable Y is referred to as residual . What OLS does is minimize the sum of squared residuals. This optimization problem results in the following OLS estimates for the unknown parameters β0 and β1 which are also known as coefficient estimates .

0*jFQQnpCqqPeKOGeJ

Once these parameters of the Simple Linear Regression model are estimated, the fitted values of the response variable can be computed as follows:

0*v66iFYRMqQOENjX0

Standard Error

The residuals or the estimated error terms can be determined as follows:

0*EqX54WI0SqwPlQ2S

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown.

Also, these estimates are subject to sampling uncertainty. What this means is that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. But we can estimate it by calculating the sample residual variance.

2.1.2 OLS Assumptions

The OLS estimation method makes the following assumptions which need to be satisfied to get reliable prediction results:

  • A ssumption (A) 1: the Linearity assumption states that the model is linear in parameters.
  • A2: the Random Sample assumption states that all observations in the sample are randomly selected.
  • A3: the Exogeneity assumption states that independent variables are uncorrelated with the error terms.
  • A4: the Homoskedasticity assumption states that the variance of all error terms is constant.
  • A5: the No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.

Note that the above description for Linear Regression is from my article named Complete Guide to Linear Regression .

For detailed article on Linear Regression check out this post:

1*dVP5uUuCxu35duUhU6Bi2g

2.1.3 Linear Regression in Python

Imagine you have a friend, Alex, who collects stamps. Every month, Alex buys a certain number of stamps, and you notice that the amount Alex spends seems to depend on the number of stamps bought.

Now, you want to create a little tool that can predict how much Alex will spend next month based on the number of stamps bought. This is where Linear Regression comes into play.

In technical terms, we're trying to predict the dependent variable (amount spent) based on the independent variable (number of stamps bought).

Below is some simple Python code using scikit-learn to perform Linear Regression on a created dataset.

  • Sample Data : stamps_bought represents the number of stamps Alex bought each month and amount_spent represents the corresponding money spent.
  • Creating and Training Model : Using LinearRegression() from scikit-learn to create and train our model using .fit() .
  • Predictions : Use the trained model to predict the amount Alex will spend for a given number of stamps. In the code, we predict the amount for 10 stamps.
  • Plotting : We plot the original data points (in blue) and the predicted line (in red) to visually understand our model’s prediction capability.
  • Displaying Prediction : Finally, we print out the predicted spending for a specific number of stamps (10 in this case).

LinearRegression

‌2.2 Logistic Regression

Another very popular Machine Learning technique is Logistic Regression which, though named regression, is actually a supervised classification technique .

Logistic regression is a Machine Learning method that models conditional probability of an event occurring or observation belonging to a certain class, based on a given dataset of independent variables.

When the relationship between two variables is linear and the dependent variable is a categorical variable, you may want to predict a variable in the form of a probability (number between 0 and 1). In these cases, Logistic Regression comes in handy.

This is because during the prediction process in Logistic Regression, the classifier predicts the probability (a value between 0 and 1) of each observation belonging to the certain class, usually to one of the two classes of dependent variable.

For instance, if you want to predict the probability or likelihood that a candidate will be elected or not during an election given the candidate's popularity score, past successes, and other descriptive variables about that candidate, you can use Logistic Regression to model this probability.

So, rather than predicting the response variable, Logistic Regression models the probability that Y belongs to a particular category.

It's similar to Linear Regression with a difference being that instead of Y it predicts the log odds. In statistical terminology, we model the conditional distribution of the response Y , given the predictor(s) X . So LR helps to predict the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x) .

The name Logistic in Logistic Regression comes from the function this approach is based upon, which is Logistic Function . Logistic Function makes sure that for too large and too small values, the corresponding probability is still within the [0,1 bounds].

image-46

In the equation above, the P(X) stands for the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x). X stands for the independent variable, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well similar to Linear Regression. e stands for exp() function.

Odds and Log Odds

Logistic Regression and its estimation technique MLE is based on the terms Odds and Log Odds. Where Odds is defined as follows:

1*s5_x03xuUHM_n3SAMujx7w

and Log Odds is defined as follows:

Screenshot-2023-10-14-at-5.39.11-AM

2.2.1 Maximum Likelihood Estimation (MLE)

While for Linear Regression, we use OLS (Ordinary Least Squares) or LS (Least Squares) as an estimation technique, for Logistic Regression we should use another estimation technique.

We can’t use LS in Logistic Regression to find the best fitting line (to perform estimation) because the errors can then become very large or very small (even negative) while in case of Logistic Regression we aim for a predicted value in [0,1].

So for Logistic Regression we use the MLE technique, where the likelihood function calculates the probability of observing the outcome given the input data and the model. This function is then optimised to find the set of parameters that results in the largest sum likelihood over the training dataset.

1*u7XLRKF3BVsvyF5zXLxEsg

The logistic function will always produce an S-shaped curve like above, regardless of the value of independent variable X resulting in sensible estimation most of the time.

2.2.2 Logistic Regression Likelihood Function(s)

The Likelihood function can be expressed as follows:

1*BR90pVIpXkTobihxToP8bg

So the Log Likelihood function can be expressed as follows:

1*573K4SJ2pDY5bmKndL8e_A

or, after transformation from multipliers to summation, we get:

1*nabbNqzEzMBR-2cIdfnRtA

Then the idea behind the MLE is to find a set of estimates that would maximize this likelihood function.

  • Step 1: Project the data points into a candidate line that produces a sample log (odds) value.
  • Step 2: Transform sample log (odds) to sample probabilities by using the following formula:

1*Tab5F2hMLHo9AMhEbjJQoQ

  • Step 3: Obtain the overall likelihood or overall log likelihood.
  • Step 4: Rotate the log (odds) line again and again, until you find the optimal log (odds) maximizing the overall likelihood

2.2.3 Cut off value in Logistic Regression

If you plan to use Logistic Regression at the end get a binary {0,1} value, then you need a cut-off point to transform the estimated values per observation from the range of [0,1] to a value 0 or 1.

Depending on your individual case you can choose a corresponding cut off point, but a popular cut-ff point is 0.5. In this case, all observations with a predicted value smaller than 0.5 will be assigned to class 0 and observations with a predicted value larger or equal than 0.5 will be assigned to class 1.

2.2.4 Performance Metrics in Logistic Regression

Since Logistic Regression is a classification method, common classification metrics such as recall, precision, F-1 measure can all be used. But there is also a metrics system that is also commonly used for assessing the performance of the Logistic Regression model, called Deviance .

2.2.5 Logistic Regression in Python

Jenny is an avid book reader. Jenny reads books of different genres and maintains a little journal where she notes down the number of pages and whether she liked the book (Yes or No).

We see a pattern: Jenny typically enjoys books that are neither too short nor too long. Now, can we predict whether Jenny will like a book based on its number of pages? This is where Logistic Regression can help us!

In technical terms, we're trying to predict a binary outcome (like/dislike) based on one independent variable (number of pages).

Here's a simplified Python example using scikit-learn to implement Logistic Regression:

  • Sample Data : pages represents the number of pages in the books Jenny has read, and likes represents whether she liked them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate LogisticRegression() and train the model using .fit() with our data.
  • Predictions : We predict whether Jenny will like a book with a particular number of pages (260 in this example).
  • Plotting : We visualize the original data points (in blue) and the predicted probability curve (in red). The green dashed line represents the page number we’re predicting for, and the grey dashed line indicates the threshold (0.5) above which we predict a "like".
  • Displaying Prediction : We output whether Jenny will like a book of the given page number based on our model's prediction.

Screenshot-2023-10-20-at-8.44.09-PM

‌2.3 Linear Discriminant Analysis (LDA)

Another classification technique, closely related to Logistic Regression, is Linear Discriminant Analytics (LDA). Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories.

LDA offers an alternative approach to model the conditional likelihood of the outcome variable given that set of predictors that addresses the issues of Logistic Regression. It models the distribution of the predictors X separately in each of the response classes (that is, given Y ), and then uses Bayes’ theorem to flip these two around into estimates for Pr(Y = k|X = x).

Note that in the case of LDA these distributions are assumed to be normal. It turns out that the model is very similar in form to logistic regression. In the equation here:

1*jMSHLN0-cAG3zKGCxXWY7w

π_k represents the overall prior probability that a randomly chosen observation comes from the k th class. f_k(x) , which is equal to Pr(X = x|Y = k), represents the posterior probability , and is the density function of X for an observation that comes from the k th class (density function of the predictors).

This is the probability of X=x given the observation is from certain class. Stated differently, it is the probability that the observation belongs to the k th class, given the predictor value for that observation.

Assuming that f_k(x) is Normal or Gaussian, the normal density takes the following form (this is the one- normal dimensional setting):

1*0dOVbhy_xPi9rIa7Z7j2Fg

where μ_k and σ_k² are the mean and variance parameters for the k th class. Assuming that σ_¹² = · · · = σ_K² (there is a shared variance term across all K classes, which we denote by σ2).

Then the LDA approximates the Bayes classifier by using the following estimates for πk, μk, and σ2:

1*EloSKpmgw0Jhz-ubEGaogg

Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories.

2.3.1 Linear Discriminant Analysis in Python

Imagine Sarah, who loves cooking and trying various fruits. She sees that the fruits she likes are typically of specific sizes and sweetness levels.

Now, Sarah is curious: can she predict whether she will like a fruit based on its size and sweetness? Let's use Linear Discriminant Analysis (LDA) to help her predict whether she'll like certain fruits or not.

In technical language, we are trying to classify the fruits (like/dislike) based on two predictor variables (size and sweetness).

  • Sample Data : fruits_features contains two features – size and sweetness of fruits, and fruits_likes represents whether Sarah likes them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate LinearDiscriminantAnalysis() and train it using .fit() with our sample data.
  • Prediction : We predict whether Sarah will like a fruit with a particular size and sweetness level ([2.5, 6] in this example).
  • Plotting : We visualize the original data points, color-coded based on Sarah’s like (yellow) and dislike (purple), and mark the new fruit with a red 'x'.
  • Displaying Prediction : We output whether Sarah will like a fruit with the given size and sweetness level based on our model's prediction.

Screenshot-2023-10-20-at-8.48.44-PM

Logistic regression is a popular approach for performing classification when there are two classes. But when the classes are well-separated or the number of classes exceeds 2, the parameter estimates for the logistic regression model are surprisingly unstable.

Unlike Logistic Regression, LDA does not suffer from this instability problem when the number of classes is more than 2. If n is small and the distribution of the predictors X is approximately normal in each of the classes, LDA is again more stable than the Logistic Regression model.

‌ 2.5 Naïve Bayes

Another classification method that relies on Bayes Rule , like LDA, is Naïve Bayes Classification approach. For more about Bayes Theorem, Bayes Rule and a corresponding example, you can read these articles .

Like Logistic Regression, you can use the Naïve Bayes approach to classify observation in one of the two classes (0 or 1).

The idea behind this method is to calculate the probability of observation belonging to a class given the prior probability for that class and conditional probability of each feature value given for given class. That is:

image-49

where Y stands for the class of observation, k is the k th class and x1, …, xn stands for feature 1 till feature n, respectively. f_k(x) = Pr(X = x|Y = k), represents the posterior probability, which like in case of LDA is the density function of X for an observation that comes from the k th class (density function of the predictors).

If you compare the above expression with the one you saw for LDA, you will see some similarities.

In LDA, we make a very important and strong assumption for simplification purposes: namely, that f_k is the density function for a multivariate normal random variable with class-specific mean μ_k, and shared covariance matrix Sigma Σ.

This assumtion helps to replace the very challenging problem of estimating K p-dimensional density functions with the much simpler problem, which is to estimate K p-dimensional mean vectors and one (p × p)-dimensional covariance matrices.

In the case of the Naïve Bayes Classifier, it uses a different approach for estimating f_1 (x), . . . , f_K(x). Instead of making an assumption that these functions belong to a particular family of distributions (for example normal or multivariate normal), we instead make a single assumption: within the k th class, the p predictors are independent. That is:

image-51

So Bayes classifier assumes that the value of a particular variable or feature is independent of the value of any other variables (uncorrelated), given the class/label variable.

For instance, a fruit may be considered to be a banana if it is yellow, oval shaped, and about 5–10 cm long. So, the Naïve Bayes classifier considers that each of these various features of fruit contribute independently to the probability that this fruit is a banana, independent of any possible correlation between the colour, shape, and length features.

Naïve Bayes Estimation

Like Logistic Regression, in the case of the Naïve Bayes classification approach we use Maximum Likelihood Estimation (MLE) as estimation technique. There is a great article providing detailed, coincise summary for this approach with corresponding example which you can find here .

2.5.1 Naïve Bayes in Python

Tom is a movie enthusiast who watches films across different genres and records his feedback—whether he liked them or not. He has noticed that whether he likes a film might depend on two aspects: the movie's length and its genre. Can we predict whether Tom will like a movie based on these two characteristics using Naïve Bayes?

Technically, we want to predict a binary outcome (like/dislike) based on the independent variables (movie length and genre).

  • Sample Data : movies_features contains two features: movie length and genre (encoded as numbers), while movies_likes indicates whether Tom likes them (1 for like, 0 for dislike).
  • Creating and Training Model : We instantiate GaussianNB() (a Naïve Bayes classifier assuming Gaussian distribution of data) and train it with .fit() using our data.
  • Prediction : We predict whether Tom will like a new movie, given its length and genre code ([100, 1] in this case).
  • Plotting : We visualize the original data points, color-coded based on Tom’s like (yellow) and dislike (purple). The red 'x' represents the new movie.
  • Displaying Prediction : We print whether Tom will like a movie of the given length and genre code, as per our model's prediction.

Screenshot-2023-10-20-at-8.51.54-PM

Naïve Bayes Classifier has proven to be faster and has a higher bias and lower variance. Logistic regression has a low bias and higher variance. Depending on your individual case, and the bias-variance trade-off , you can pick the corresponding approach.

image-52

Decision Trees are a supervised and non-parametric Machine Learning learning method used for both classification and regression purposes. The idea is to create a model that predicts the value of a target variable by learning simple decision rules from the data predictors.

Unlike Linear Regression, or Logistic Regression, Decision Trees are simple and useful model alternatives when the relationship between independent variables and dependent variable is suspected to be non-linear.

Tree-based methods stratify or segment the predictor space into smaller regions. The idea behind building Decision Trees is to divide the predictor space into distinct and mutually exclusive regions X1,X2,….. ,Xp → R_1,R_2, …,R_N where the regions are in the form of boxes or rectangles. These regions are found by recursive binary splitting since minimizing the RSS is not feasible. This approach is often referred to as a greedy approach.

Decision trees are built by top-down splitting. So, in the beginning, all observations belong to a single region. Then, the model successively splits the predictor space. Each split is indicated via two new branches further down on the tree.

This approach is sometimes called greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

Stopping Criteria

There are some common stopping criteria used when building Decision Trees:

  • Minimum number of observations in the leaf.
  • Minimum number of samples for a node split.
  • Maximum depth of tree (vertical depth).
  • Maximum number of terminal nodes.
  • Maximum features to consider for the split.

Screenshot-2023-10-20-at-8.05.15-PM

For example, repeat this splitting process until no region contains more than 100 observations. Let's dive deeper

1. Minimum number of observations in the leaf: If a proposed split results in a leaf node with fewer than a defined number of observations, that split might be discarded. This prevents the tree from becoming overly complex.

2. Minimum number of samples for a node split: To proceed with a node split, the node must have at least this many samples. This ensures that there's a significant amount of data to justify the split.

3. Maximum depth of tree (vertical depth): This limits how many times a tree can split. It's like telling the tree how many questions it can ask about the data before making a decision.

4. Maximum number of terminal nodes: This is the total number of end nodes (or leaves) the tree can have.

5. Maximum features to consider for the split: For each split, the algorithm considers only a subset of features. This can speed up training and help in generalization.

When building a decision tree, especially when dealing with large number of features, the tree can become too big with too many leaves. This will effect the interpretability of the model, and might potentially result in an overfitting problem. Therefore, picking a good stopping criteria is essential for the interpretability and for the performance of the model.

RSS/Gini Index/Entropy/Node Purity

When building the tree, we use RSS (for Regression Trees) and GINI Index/Entropy (for Classification Trees) for picking the predictor and value for splitting the regions. Both Gini Index and Entropy are often called Node Purity measures because they describe how pure the leaf of the trees are.

image-53

The Gini index measures the total variance across K classes. It takes small value when all class error rates are either 1 or 0. This is also why it’s called a measure for node purity – Gini index takes small values when the nodes of the tree contain predominantly observations from the same class.

The Gini index is defined as follows:

image-54

where pˆmk represents the proportion of training observations in the mth region that are from the kth class.

Entropy is another node purity measure, and like the Gini index, the entropy will take on a small value if the m th node is pure. In fact, the Gini index and the entropy are quite similar numerical and can be expressed as follows:‌                                      

image-55

Decision Tree Classification Example

Let’s look at an example where we have three features describing consumers' past behaviour:

  • Recency (How recent was the customer’s last purchase?)
  • Monetary (How much money did the customer spend in a given period?)
  • Frequency (How often did this customer make a purchase in a given period?)

We will use the classification version of the Decision Tree to classify customers to 1 of the 3 classes (Good: 1, Better: 2 and Best: 3), given the features describing the customer's behaviour.

In the following tree, where we use Gini Index as a purity measure, we see that the first features that seems to be the most important one is the Recency. Let's look at the tree and then interpret it:

image-56

Customers who have a recency of 202 or larger (last time has made a purchase > 202 days ago) then the chance of this observation to be assigned to class 1 is 93% (basically, we can label those customers as Good Class customers).

For customers with Recency less than 202 (they made a purchase recently), we look at their Monetary value and if it's smaller than 1394, then we look at their Frequency. If the Frequency is then smaller than 44, we can then label this customers’ class as Better or (class 2). And so on.

Decision Trees Python Implementation

Alex is intrigued by the relationship between the number of hours studied and the scores obtained by students. Alex collected data from his peers about their study hours and respective test scores.

He wonders: can we predict a student's score based on the number of hours they study? Let's leverage Decision Tree Regression to uncover this.

Technically, we're predicting a continuous outcome (test score) based on an independent variable (study hours).

  • Sample Data : study_hours contains hours studied, and test_scores contains the corresponding test scores.
  • Creating and Training Model : We create a DecisionTreeRegressor with a specified maximum depth (to prevent overfitting) and train it with .fit() using our data.
  • Plotting the Decision Tree : plot_tree helps visualize the decision-making process of the model, representing splits based on study hours.
  • Prediction & Plotting : We predict the test score for a new study hour value (5.5 in this example), visualize the original data points, the decision tree’s predicted scores, and the new prediction.

Screenshot-2023-10-20-at-8.54.27-PM

The visualization depicts a decision tree model trained on study hours data. Each node represents a decision based on study hours, branching from the top root based on conditions that best forecast test scores. The process continues until reaching a maximum depth or no further meaningful splits. Leaf nodes at the bottom give final predictions, which for regression trees, are the average of target values for training instances reaching that leaf. This visualization highlights the model's predictive approach and the significant influence of study hours on test scores.

Screenshot-2023-10-20-at-8.54.43-PM

The "Study Hours vs. Test Scores" plot illustrates the correlation between study hours and corresponding test scores. Actual data points are denoted by red dots, while the model's predictions are shown as an orange step function, characteristic of regression trees. A green "x" marker highlights a prediction for a new data point, here representing a 5.5-hour study duration. The plot's design elements, such as gridlines, labels, and legends, enhance comprehension of the real versus anticipated values.

image-58

One of the biggest disadvantages of Decision Trees is their high variance. You might end up with a model and predictions that are easy to explain but misleading. This would result in making incorrect conclusions and business decisions.

So to reduce the variance of the Decision trees, you can use a method called Bagging. To understand what Bagging is, there are two terms you need to know:

  • Bootstrapping
  • Central Limit Theorem (CLT)

You can find more about Boostrapping, which is a resampling technique, later in this handbook. For now, you can think of Bootstrapping as a technique that performs sampling from the original data with replacement, which creates a copy of the data very similar to but not exactly the same as the original data.

Bagging is also based on the same ideas as the CLT which is one of the most important if not the most important theorem in Statistics. You can read in more detail about CLT here .

But the idea that is also used in Bagging is that if you take the average of many samples, then the variance is significantly reduced compared to the variance of each of the individual sample based models.

So, given a set of n independent observations Z1,…,Zn, each with variance σ2, the variance of the mean Z ̄ of the observations is given by σ2/n . So averaging a set of observations reduces variance.

For more Statistical details, check out the following tutorial:

1*5gU4KwudRqY-vP0G2UpRZA

Bagging is basically a Bootstrap aggregation that builds B trees using Bootrsapped samples. Bagging can be used to improve the precision (lower the variance of many approaches) by taking repeated samples from a single training data.

So, in Bagging, we generate B bootstrapped training samples, based on which B similar trees (correlated trees) are built that end up being aggregaated to calculate the predictions, so taking the average of these predictions for these B-samples. Notably, each tree is built on a bootstrap data set, independent of the other trees.

So, in case of Bagging in each tree split all p features are considered which results in similar trees wince every time the strongest predictors are at the top and weak ones at the bottom resulting all of the bagged trees will look quite similar to each other.

2.8.1 Bagging in Regression Trees

To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. So each individual tree has high variance, but low bias. Averaging these B trees reduces the variance.

2.8.2 Bagging in Classification Trees

For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote : the overall prediction is the most commonly occurring majority class among the B predictions.

2.8.3 OOB Out-of-Bag Error Estimation

When Bagging is applied to decision trees, there is no longer a need to apply Cross Validation to estimate the test error rate. In bagging, we repeatedly fit the trees to Bootstrapped samples – and on average only 2/3 of these observations are used. The other 1/3 are not used during the training process. These are called Out-of-bag observations.

So there are in total B/3 prediction per ith observation not used in training. We can take the average of response values for these cases (or majority class). So per observation, the OOB error and average of these forms the test error rate.

2.8.4 Bagging in Python

Meet Lucy, a fitness coach who is curious about predicting her clients’ weight loss based on their daily calorie intake and workout duration. Lucy has data from past clients but recognizes that individual predictions might be prone to errors. Let's utilize Bagging to create a more stable prediction model.

Technically, we'll predict a continuous outcome (weight loss) based on two independent variables (daily calorie intake and workout duration), using Bagging to reduce variance in predictions.

True weight loss: [2.  4.5] Predicted weight loss: [3.1  3.96] Mean Squared Error: 0.75

  • Sample Data : clients_data contains daily calorie intake and workout duration, and weight_loss contains the corresponding weight loss.
  • Train-Test Split : We split the data into training and test sets to validate the model's predictive performance.
  • Creating and Training Model : We instantiate BaggingRegressor with DecisionTreeRegressor as the base estimator and train it using .fit() with our training data.
  • Prediction & Evaluation : We predict weight loss for the test data, evaluating prediction quality with Mean Squared Error (MSE).
  • Visualizing One of the Base Estimators : Optionally, visualize one tree from the ensemble to understand individual decision-making processes (keeping in mind an individual tree may not perform well, but collectively they produce stable predictions).

Bagging

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.

As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

The split is allowed to use only one of those m predictors. A fresh and random sample of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. This is also the reason why Random Forest is called “random”.

The main difference between bagging and random forests is the choice of predictor subset size m decorrelates the trees.

Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors. So, if you have a problem of Multicollearity, RF is a good method to fix that problem.

So, unlike in Bagging, in the case of Random Forest, in each tree split not all p predictors are considered – but only randomly selected m predictors from it. This results in not similar trees being decorrelateed. And due to the fact that averaging decorrelated trees results in smaller variance, Random Forest is more accurate than Bagging.

2.9.1 Random Forest Python Implementation

Noah is a botanist who has collected data about various plant species and their characteristics, such as leaf size and flower color. Noah is curious if he could predict a plant’s species based on these features.

Here, we’ll utilize Random Forest, an ensemble learning method, to help him classify plants.

Technically, we aim to classify plant species based on certain predictor variables using a Random Forest model.

  • Sample Data : plants_features contains leaf size and flower color, while plants_species indicates the species of the respective plant.
  • Train-Test Split : We separate the data into training and test sets.
  • Creating and Training Model : We instantiate RandomForestClassifier with a specified number of trees (10 in this case) and train it using .fit() with our training data.
  • Prediction & Evaluation : We predict the species for the test data and evaluate the predictions using a classification report which provides precision, recall, f1-score, and support.
  • Visualizing Feature Importances : We utilize a horizontal bar chart to display the importance of each feature in predicting the plant species. Random Forest quantifies the usefulness of features during the tree-building process, which we visualize here.

Random-Forest

‌2.10 Boosting or Ensemble Models

Like Bagging (averaging correlated Decision Trees) and Random Forest (averaging uncorrelated Decision Trees), Boosting aims to improve the predictions resulting from a decision tree. Boosting is a supervised Machine Learning model that can be used for both regression and classification problems.

Unlike Bagging or Random Forest, where the trees are built independently from each other using one of the B bootstrapped samples (copy of the initial training date), in Boosting, the trees are built sequentially and dependent on each other. Each tree is grown using information from previously grown trees.

Boosting does not involve bootstrap sampling. Instead, each tree fits on a modified version of the original data set. It’s a method of converting weak learners into strong learners.

In boosting, each new tree is a fit on a modified version of the original data set. So, unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly.

Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y, as the response. We then add this new decision tree into the fitted function in order to update the residuals.

Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. Now let's have a look at 3 most popular Boosting models in Machine Learning:

2.10.1 Boosting: AdaBoost

The first Ensemble algorithm we will look into today is AdaBoost. Like in all boosting techniques, in the case of AdaBoost the trees are built using the information from the previous tree – and more specifically part of the tree which didn’t perform well. This is called the weak learner (Decision Stump). This Decision Stump is built using only a single predictor and not all predictors to perform the prediction.

So, AdaBoost combines weak learners to make classifications and each stump is made by using the previous stump’s errors. Here is the step-by-step plan for building an AdaBoost model:

  • Step 1: Initial Weight Assignment – assign equal weight to all observations in the sample where this weight represents the importance of the observations being correctly classified: 1/N (all samples are equally important at this stage).
  • Step 2: Optimal Predictor Selection – The first stamp is built by obtaining the RSS (in case of regression) or GINI Index/Entropy (in case of classification) for each predictor. Picking the stump that does the best job in terms of prediction accuracy: the stump with the smallest RSS or GINI/Entropy is selected as the next tree.
  • Step 3: Computing Stumps Weight based on Stumps Total Error – The importance of this stump in the final tree is then determined using the total error that this stump is making. Where a stump that is not better than random flip of a coin with total error equal to 0.5 gets weight 0. Weight = 0.5*log(1-Total Error/Total Error)
  • Step 4: Updating Observation Weights – We increase the weight of the observations which have been incorrectly predicted and decrease the remaining observations which had higher accuracy or have been correctly classified, so that the next stump will have higher importance of correctly predicted the value f this observation.
  • Step 5: Building the next Stump based on updated weights – Using Weighted Gini index to chose the next stump.
  • Step 6: Combining B stumps – Then all the stumps are combined while taking into account their importance, weighted sum.

AdaBoost Python Implementation

Imagine a scenario where we aim to predict house prices based on certain features like the number of rooms and age of the house.

For this example, let's generate synthetic data where: num_rooms: The number of rooms in the house. house_age: The age of the house in years. price: The price of the house in thousand dollars:

image-79

2.10.2 Boosting Algorithm: Gradient Boosting Model (GBM)

AdaBoost and Gradient Boosting are very similar to each other. But compared to AdaBoost, which starts the process by selecting a stump and continuing to build it by using the weak learners from the previous stump, Gradient Boosting starts with a single leaf instead of a tree of a stump.

The outcome corresponding to this chosen leaf is then an initial guess for the outcome variable. Like in the case of AdaBoost, Gradient Boosting uses the previous stump’s errors to build the tree. But unlike in AdaBoost, the trees that Gradient Boost builds are larger than a stump. That’s a parameter where we set a max number of leaves.

To make sure the tree is not overfitting, Gradient Boosting uses the Learning Rate to scale the gradient contributions. Gradient Boosting is based on the idea that taking lots of small steps in the right direction (gradients) will result in lower variance (for testing data).

The major difference between the AdaBoost and Gradient Boosting algorithms is how the two identify the shortcomings of weak learners (for example, decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y=ax+b+e , e needs a special mention as it is the error term).

The loss function is a measure indicating how good a model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimise.

Early Stopping

The special process of tuning the number of iterations for an algorithm (such as GBM and Random Forest) is called “Early Stopping” – a phenomenon we touched upon when discussing the Decision Trees.

Early Stopping performs model optimisation by monitoring the model’s performance on a separate test data set and stopping the training procedure once the performance on the test data stops improving beyond a certain number of iterations.

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

In the context of GBM, early stopping can be based either on an out of bag sample set (“OOB”) or cross-validation (“CV”). Like mentioned earlier, the ideal time to stop training the model is when the validation error has decreased and started to stabilise before it starts increasing due to overfitting.

To build GBM, follow this step-by-step process:

  • Step 1: Train the model on the existing data to predict the outcome variable
  • Step 2: Compute the error rate using the predictions and the real values (Pseudo Residual)
  • Step 3: Use the existing features and the Pseudo Residual as the outcome variable to predict the residuals again
  • Step 4: Use the predicted residuals to update the predictions from the Step 1, while scaling this contribution to the tree with a learning rate (hyper parameter)
  • Step 5: Repeat steps 1–4, the process of updating the pseudo residuals and the tree while scaling with the learning rate, to move slowly in the right direction until there is no longer an improvement or we come to our stopping rule

The idea is that each time we add a new scaled tree to the model, the residuals should get smaller.

At any m step, the Gradient Boosting model produces a model that is an ensemble of the previous step F(m-1) and learning rate eta multiplied with the negative derivative of the loss function with regard to the output of the model at step m-1: (weak learner at step m-1).

image-64

GBM Python Implementation

‌                                              

GBM

2.10.3 Boosting Algorithm: XGBoost

One of the most popular Boosting or Ensemble algorithms is Extreme Gradient Boosting (XGBoost).

The difference between the GBM and XGBoost is that in case of XGBoost the second-order derivatives are calculated (second-order gradients). This provides more information about the direction of gradients and how to get to the minimum of the loss function.

Remember that this is needed to identify the weak learner and improve the model by improving the weak learners.

The idea behind the XGBoost is that the 2nd order derivative tends to be more precise in terms of finding the accurate direction. Like the AdaBoost, XGBoost applies advanced regularization in the form of L1 or L2 norms to address overfitting.

Unlike the AdaBoost, XGBoost is parallelizable due to its special cashing mechanism, making it convenient to handle large and complex datasets. Also, to speed up the training, XGBoost uses an Approximate Greedy Algorithm to consider only limited amount of tresholds for splitting the nodes of the trees.

To build an XGBoost model, follow this step-by-step process:

  • Step 1: Fit a Single Decision Tree – In this step, the Loss function is calculated, for example NDCG to evaluate the model.
  • Step 2: Add the Second Tree – This is done such that when this second tree is added to the model, it lowers the Loss function based on 1st and 2nd order derivatives compared to the previous tree (where we also used learning rate eta).
  • Step 3: Finding the Direction of the Next Move – Using the first degree and second-degree derivatives, we can find the direction in which the Loss function decreases the largest. This is basically the gradient of the Loss function with regard to to the output of the previous model.
  • Step 4: Splitting the nodes – To split the observations, XGBoost uses Approximate Greedy Algorithm (about 3 approximate weighted quantiles usually) quantiles that have a similar sum of weights. For finding the split value of the nodes, it doesn't consider all the candidate thresholds but instead it uses the quantiles of that predictor only.

Optimal Learning Rate can be determined by using Cross Validation & Grid Search.

Simple XGBoost Python Implementation

Imagine you have a dataset containing information about various houses and their prices. The dataset includes features like the number of bedrooms, bathrooms, the total area, the year built, and so on, and you want to predict the price of a house based on these features.

XGBoost2

Chapter 3: Feature Selection in Machine Learning

The pathway to building effective machine learning models often involves a critical question: which features should we include to generate reliable predictions while keeping the model simple and understandable? This is where subset selection plays a key role.

In Machine Learning, in many cases we are dealing with large amount of features and not all of them are usually important and informative for the model. Including such irrelevant variables in the model leads to unnecessary complexity in the Machine Learning model and effects the model's interpretability as well as its performance.

By removing these unimportant variables, and selecting only relatively informative features, we can get a model which can be easier to interpret and is possibly more accurate.

Let’s look at a specific example of a Machine Learning model for simplicity's sake.

Let’s assume that we are looking at a Multiple Linear Regression model (multiple independent variables and single response/dependent variable) with very large number of features. This model is likely to be complex when it comes to interpreting it. On the top of that, it might be result in inaccurate predictions since some of those features might be unimportant and are not helping to explain the response variable.

The process of selecting important variables in the model is called feature selection or variable selection. This process involves identifying a subset of the p variables that we believe to be related to the dependent or the response variable. For this, we need to run the regression for all possible combinations of independent variables and select one that results in best performing model or the worst performing model.

There are various approaches you can use for Features Selection, usually broken down into the following 3 categories:

  • Subset Selection (Best Subset Selection, Step-Wise Feature Selection)
  • Regularisation Techniques (L1 Lasso, L2 Ridge Regressions)
  • Dimensionality Reduction Techniques (PCA)  

3.1 Subset Selection in Machine Learning

Subset Selection in machine learning is a technique designed to identify and use a subset of important features while omitting the rest. This helps create models that are easier to interpret and, in some cases, predict more accurately by avoiding overfitting.

Navigating through numerous features, it becomes vital to selectively choose the ones that significantly impact the predictive model. Subset selection provides a systematic approach to sifting through possible combinations of predictors. It aims to select a subset that effectively represents the data without unnecessary complexity.

  • Best Subset Selection: Examines all possible combinations and selects the most optimal set of predictors.
  • Stepwise Selection : Adds or removes predictors incrementally, which includes forward and backward stepwise selection.
  • Random Subset Selection : Chooses subsets randomly, introducing an element of randomness into model selection.

It’s a balance between using all available predictors, risking model overcomplexity and potential overfitting, and building a too-simple model that may overlook important data patterns.

In this section, we will explore these subset selection techniques. You'll learn how each approach works and affects model performance, ensuring that the models we build are reliable, simple, and effective.

3.1.1 Step-Wise Feature Selection Techniques

One of the popular subset selection techniques is the Step-Wise Feature Selection Technique. Let’s look at two different step-wise feature selection methods:

  • Forward Step-wise Selection
  • Backward Step-wise Selection

Forward Step-Wise Selection: What Forward Step-Wise Feature Selection technique does is it starts with an empty Null model with only an intercept. We then run a set of simple regressions and pick the variable which has a model with the smallest RSS (Residual Sum of Squares). Then we do the same with 2 variable regressions and continue until it’s completed.

So, Forward Step-Wise Selection begins with a model containing no predictors, and then adds predictors to the model, one at a time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Forward Step-Wise Selection can be summarized as follows:

Step 1: Let M_0 be the null model, containing no features.

Step 2: For K = 0,…., p-1:

  • Consider all (p-k) models that contain the variables in M_k with one additional feature or predictor.
  • Choose the best model among these p-k models, and define it M_(k+1) by using performance metrics such as RSS / R-squared .

Step 3: Select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error , C_p , AIC (Akaike Information Criterion) , BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*).

So, the idea behind this Selection is to start simple and increase the number of predictors in the model. Per number of predictors, consider all possible combination of variables and select a single best model: M_k. Then compare all these models with different number of predictors (best M_ks ) and the one best performing one can be selected.

When n < p, so when number of observations is larger than number of predictors in Linear Regression, you can use this approach to select features in the model in order for LR to work in the first place.

Backward Step-wise Feature Selection: Unlike in Forward Step-wise Selection, in case of Backward Step-wise Selection the feature selection algorithm starts with the full model containing all p predictors. Then the best model with p predictorss is selected.

Consequently, the model removes one by one the variable with the largest p-value and again best model is selected.

Each time, the model is fitted again to identify the least statistically significant variable until the stopping rule is reached. (For example, all p- values need to be smaller then 5%.) Then we compare all these models with different number of predictors (best M_ks) and select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error , C_p , AIC (Akaike Information Criterion) , BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*).

Backward Step-Wise Feature Selection can be summarized as follows:

Step 1: Let M_p be the full model, containing all features.

Step 2: For k= p, p-1 ….,1:

  • Consider all k models that contain all variables except for one of the predictors in M_k model, for k − 1 features.
  • Choose the best model among these k models, and define it M_(k-1) by using performance metrics such as RSS / R-squared .

Like Forward Step-wise Selection, the Backward Step-Wise Feature Selection technique searches through only (p+1)/2 models, making it possible to apply in settings where p is too large to apply other selection techniques.

Also, Backward Step-Wise Feature Selection is not guaranteed to yield the best model containing a subset of the p predictors. It requires that the number of observations or data points n to be larger than the number of model variables p whereas Forward Step-Wise Selection can be used even when n < p.

image-65

3.2 Regularization in Machine Learning

Regularization, also known as Shrinkage, is a widely-used strategy to address the issue of overfitting in machine learning models.

The fundamental concept of regularization involves deliberately introducing a slight bias into the model, with the benefit of notably reducing its variance.

The term "Shrinkage" is derived from the method's ability to pull some of the estimated coefficients toward zero, imposing a penalty on them to prevent them from elevating the model's variance excessively.

Two prominent regularization techniques stand out in practice: Ridge Regression, which leverages the L2 norm, and Lasso Regression, employing the L1 norm.

3.2.1 Ridge Regression (L2 Regularization)

Let's explore examples of multiple linear regression, involving p p independent variables or predictors utilized to model the dependent variable y y .

It's worth remembering that Ordinary Least Squares (OLS), provided its assumptions are met, is a widely-adopted estimation technique for determining the parameters of linear regression. OLS seeks the optimal coefficients by minimizing the model's residual sum of squares (RSS). That is:

1*9mdYD6q-ns3ZO5KYw046Uw

where the β represents the coefficient estimates for different variables or predictors(X).

Ridge Regression is pretty similar to OLS, except that the coefficients are estimated by minimizing a slightly different cost or loss function. Namely, the Ridge Regression coefficient estimates βˆR values such that they minimize the following loss function:

1*Yri4m3wximoVgqCdfjqybg

where λ (lambda, which is always positive, ≥ 0) is the tuning parameter or the penalty parameter, and as can be seen from this formula, in the case of the Ridge, the L2 penalty or L2 norm is used.

In this way, Ridge Regression will assign a penalty to some variables shrinking their coefficients towards zero, reducing the overall model variance – but these coefficients will never become exactly zero. So, the model parameters are never set to exactly 0, which means that all p predictors of the model are still intact.

L2 Norm (Euclidean Distance)

L2 norm is a mathematical term that comes from Linear Algebra. It stands for a Euclidean norm which can be represented as follows:

1*3XOoIOpLRREo4882c2K0kQ

Tuning parameter λ : tuning parameter λ serves to control the relative impact of the penalty on the regression coefficient estimates. When λ = 0, the penalty term has no effect, and the ridge regression will produce the ordinary least squares estimates. But as λ → ∞ (gets very large), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates approach to 0. Here's a visual representation of this:

1*2ICCHEBIlr2WkJwBdH4ZpQ

Why does Ridge Regression Work?

Ridge regression’s advantage over ordinary least squares comes from the earlier introduced bias-variance trade-off phenomenon. As λ, the penalty parameter, increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.

3.2.2 Lasso Regression (L1 Regularization)

Lasso Regression overcomes this disadvantage of Ridge Regression. Namely, the Lasso Regression coefficient estimates βˆλL are the values that minimize:

1*9xgT0094jajcR3h4LuLjNQ

As with Ridge Regression, the Lasso shrinks the coefficient estimates towards zero. But in the case of the Lasso, the L1 penalty or L1 norm is used which has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is significantly large.

So, like many feature selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.

1*xxJGK_RO3yMMk78jzXC7qw

L1 Norm (Manhattan Distance)

L1 norm is a mathematical term that comes from Linear Algebra. It stands for a Manhattan norm which can be represented as follows:

1*-6vGuuy9s8FahKYyEEjSwQ

Why does Lasso Regression Work?

Like, Ridge Regression, Lasso Regression’s advantage over ordinary least squares comes from the earlier introduced bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases. This leads to decreased variance but increased bias. Additionally, Lasso also performs feature selection.

3.2.3 Lasso vs Ridge Regression

Lasso Regression shrinks the coefficient estimates towards zero and even forces some of these coefficients to be exactly equal to zero when the tuning parameter λ is significantly large. So, like many features selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.

Comparison between Ridge Regression and Lasso Regression becomes clear when putting earlier two graphs next to each other:

1*oq-2dyqDAC9T_MkUYnu61g

If you want to learn regularization in detail, read this tutorial:

Chapter 4: Resampling Techniques in Machine Learning

When we have only training data and we want to make judgments about the performance of the model on unseen data, we can use Resampling Techniques to create artificial test data.

Resampling Techniques are often divided into two categories: Cross-Validation and Bootstrapping. They're usually used for the following three purposes:

  • Model Assessment: evaluate the model performance (to compute test error rate)
  • Model Variance: compute the variance of the model to check how generalizable your model is
  • Model Selection: select model flexibility

For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ.

4.1 Cross-Validation

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to perform:

  • Model assessment: to evaluate its performance by calc test error rate
  • Model Selection: to select the appropriate level of flexibility.

You hold out a subset of the training observations from the fitting process, and then apply the statistical learning method to those held out observations.

CV is usually divided in the following three categories:

  • Validation Set Approach
  • K-fold Cross Validation (K-ford CV)
  • Leave One Out Cross Validation (LOOCV)

4.1.1 Validation Set Approach

This is a simple approach to randomly split the data into training and validation sets. This approach usually uses Sklearn’s train_test_split() function.

The model is then trained on the training data (usually 80% of the data) and uses it to predict the values for the hold-out or Validation Set (usually 20% of the data) which is the test error rate.

4.1.2 Leave One Out Cross Validation (LOOCV)

LOOCV is similar to the Validation set approach. But each time it leaves one observation out of the training set and uses the remaining n-1 to train the model and calculates the MSE for that one prediction. So, in the case of LOOCV, the Model has to be fit n times (where n is the number of observations in the model).

Then this process is repeated for all observations and n times MSEs are calculated. The mean of the MSEs is the Cross-Validation error rate and can be expressed as follows:

image-66

‌        

4.1.3 K-fold Cross Validation (K-ford CV)

K-Fold CV is the silver lining between the Validation Set approach (high variance and high bias but is computationally efficient) versus the LOOCV (low bias and low variance but is computationally inefficient).

In K-Fold CV, the data is randomly sampled into K equally sized samples (K- folds). Then each time, 1 is used as validation and the rest as training, and the model is fit K times. The mean of K MSEs form the Cross validation test error rate.

Note that the LOOCV is a special case of K-fold CV where K = N, and can be expressed as follows:

image-67

‌                                            

4.2 Selecting Optimal k in K-fold CV

The choice of k in K-fold is a matter of Bias-Variance Trade-Off and the efficiency of the model. Usually, K-Fold CV and LOOCV provide similar results and their performance can be evaluated using simulated data.

However, LOOCV has lower bias (unbiased) compared to K-fold CV because LOOCV uses more training data than K-fold CV does. But LOOCV has higher variance than K-fold does because LOOCV is fitting the model on almost identical data for each item and the outcomes are highly correlated compared to the outcomes of K-Fold which are less correlated.

Since the mean of highly correlated outcomes has higher variance than the one of less correlated outcomes, the LOOCV variance is higher.

  • K = N (LOOCV) , larger the K→ higher variance and lower bias
  • K = 1, smaller the K → lower variance and higher bias

Taking this information into account, we can calculate the performance of the model for various Ks lets say K = 3,5,6,7…,10 or the Type I, Type II, and total model classification error in case of classification model. Then the best performing model’s K can be the optimal K using the idea of ROC curve (classification case) or the Elbow method (regression case).

image-69

4.3 Bootstrapping

Bootstrapping is another very popular resampling technique that is used for various purposes. One of them is to effectively estimate the variability of the estimates/models or to create artificial samples from an existing sample and improve model performance (like in the case of Bagging or Random Forest).

It is used in many situations where it's hard or even impossible to directly compute the standard deviation of a quantity of interest.

  • It's a very useful way to quantify the uncertainty associated with the statistical learning method and obtain the standard errors/measure of variability.
  • It's not useful for Linear Regression since the standard R/Python provides these results (SE of coefficients).

Bootstrapping is extremely handy for other methods as well where variability is more difficult to quantify. The bootstrap sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

So, Bootstrapping takes the original training sample and resamples from it by replacement, resulting in B different samples. Then for each of these simulated samples, the coefficient estimate is computed. Then, by taking the mean of these coefficient estimates and using the common formula for SE, we calculate the Standard Error of the Bootstrapped model.

Read more about it here .‌                                              ‌             ‌

Chapter 5: Optimization Techniques

Knowing the fundamentals of the Machine Learning models and learning how to train those models is definitely big part of becoming technical Data Scientist. But that’s only a part of the job.

In order to use the Machine Learning model to solve a business problem, you need to optimize it after you have established its baseline. That is, you need to optimize the set of hyper parameters in your Machine Learning model to find the set of optimal parameters that result in the best performing model (all things being equal).

So, to optimize or to tune your Machine Learning model, you need too perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the errors the model produces and build the most accurate model.

A model hyperparameter is a constant in the model. It's external to the model, and its value cannot be estimated from data (but rather should be specified in advanced before the model is trained). For instance, k in k-Nearest Neighbors (kNN) or the number of hidden layers in Neural Networks.

Hyperparameter optimization methods are usually categorized into:

  • Exhaustive Search or Brute Force Approach (like Grid Search)
  • Gradient Descent (Batch GD, SGD, SDG with Momentum, Adam)
  • Genetic Algorithms

In this handbook, I will discuss only the first two types of optimisation techniques.

5.1 Brute Force Approach (Grid Search)

Exhaustive Search (often referred as Grid Search or Brute Force Approach) is the process of looking for the most optimal hyperparameters by checking each of the candidates for the hyperparameters and computing the model error rate.

Once we create the list of possible values for each of the hyperparameters, for every possible combination of hyper parameter values, we calculate the model error rate and compare it to the current optimal model (one with minimum error rate). During each iteration, the optimal model is updated if the new parameter values result in lower error rate.

The optimisation method is simple. For instance, if you are working with a K-means clustering algorithm, you can manually search for the right number of clusters. But if there are hundreds or thousands of possible combination of hyperparameter values that you have to consider, the model can take hours or days to train – and it becomes incredibly heavy and slow. So most of the time, brute-force search is inefficient.

To optimize or to tune your Machine Learning model, you need to perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the error the model produces and build the most accurate model.

When it comes to Gradient Descent type of optimisation techniques, then its variants such as Batch Gradient Descent, Stochastic Gradient Descent, and so on differ in terms of the amount of data used to compute the gradient of the Loss or Cost function.

Let's define this Loss Function by J(θ) where θ (theta) represents the parameter we want to optimize.

The amount of data usage is about a trade-off between the accuracy of the parameter update and the time it takes to perform such an update. Namely, the larger the data sample we use, we can expect a more accurate adjustment of a parameter – but the process will be then much slower.

The opposite holds true as well. The smaller the data sample, the less accurate will be the adjustments in the parameter but the process will be much faster.

5.2 Gradient Descent Optimization (GD)

The Batch Gradient Descent algorithm (often just referred to as Gradient Descent or GD), computes the gradient of the Loss Function J(θ) with respect to the target parameter using the entire training data.

We do this by first predicting the values for all observations in each iteration, and comparing them to the given value in the training data. These two values are used to calculate the prediction error term per observation which is then used to update the model parameters. This process continues until the model converges.

The gradient or the first order derivative of the loss function can be expressed as follows:

image-70

Then, this gradient is used to update the previous iterations’ value of the target parameter. That is:

image-71

  • θ : This represents the parameter(s) or weight(s) of a model that you are trying to optimize. In many contexts, especially in neural networks, θ can be a vector containing many individual weights.
  • η : This is the learning rate. It's a hyperparameter that dictates the step size at each iteration while moving towards a minimum of the cost function. A smaller learning rate might make the optimization more precise but could also slow down the convergence process, while a larger learning rate might speed up convergence but risks overshooting the minimum. Can be [0,1] but is is usually a number between (0.001 and 0.04)
  • ∇ J ( θ ): This is the gradient of the cost function J with respect to the parameter θ It indicates the direction and magnitude of the steepest increase of J . By subtracting this from the current parameter value (multiplied by the learning rate), we adjust θ in the direction of the steepest decrease of J .

There are two major disadvantages to GD which make this optimization technique not so popular especially when dealing with large and complex datasets. Since in each iteration the entire training data should be used and stored, the computation time can be very large resulting in incredibly slow process. On top of that, storing that large amount of data results in memory issues, making GD computationally heavy and slow.

image-80

5.3 Stochastic Gradient Descent (SGD)

The Stochastic Gradient Descent (SGD) method, also known as Incremental Gradient Descent, is an iterative approach for solving optimisation problems with a differential objective function, exactly like GD.

But unlike GD, SGD doesn’t use the entire batch of training data to update the parameter value in each iteration. The SGD method is often referred as the stochastic approximation of the gradient descent which aims to find the extreme or zero points of the stochastic model containing parameters that cannot be directly estimated.

SGD minimises this cost function by sweeping through data in the training dataset and updating the values of the parameters in every iteration.

In SGD, all model parameters are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, the SGD algorithm improves parameters by looking at a single and randomly sampled training set (hence the name Stochastic ). That is:

image-72

  • η : This is the learning rate. It's a hyperparameter that dictates the step size at each iteration while moving towards a minimum of the cost function. A smaller learning rate might make the optimization more precise but could also slow down the convergence process, while a larger learning rate might speed up convergence but risks overshooting the minimum.
  • ∇ J ( θ , x ( i ), y ( i )): This is the gradient of the cost function J with respect to the parameter θ for a given input x ( i ) and its corresponding target output y ( i ). It indicates the direction and magnitude of the steepest increase of J . By subtracting this from the current parameter value (multiplied by the learning rate), we adjust θ in the direction of the steepest decrease of J .
  • x ( i ): This represents the ith input data sample from your dataset.
  • y ( i ): This is the true target output for the ith input data sample.

In the context of Stochastic Gradient Descent (SGD), the update rule applies to individual data samples x ( i ) and y ( i ) rather than the entire dataset, which would be the case for batch Gradient Descent.

This single-step improves the speed of the process of finding the global minima of the optimization problem and this is what differentiate SGD from GD. So, SGD consistently adjusts the parameters with an attempt to move in the direction of the global minimum of the objective function.

SGD addresses the slow computation time issue of GD, because it scales well with both big data and with a size of the model. But even though SGD method itself is simple and fast, it is known as a “bad optimizer” because it's prone to finding a local optimum instead of a global optimum.

In SGD, all model parameters are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, SGD improves parameters by looking at a single training sample.

This single step improves the speed of the process of finding the global minimum of the optimization problem. This is what differentiates SGD from GD.

image-73

5.4 SGD with Momentum

When the error function is complex and non-convex, instead of finding the global optimum, the SGD algorithm mistakenly moves in the direction of numerous local minima. This results in higher computation time.

In order to address this issue and further improve the SGD algorithm, various methods have been introduced. One popular way of escaping a local minimum and moving right in direction of a global minimum is SGD with Momentum .

The goal of the SGD method with momentum is to accelerate gradient vectors in the direction of the global minimum, resulting in faster convergence.

The idea behind the momentum is that the model parameters are learned by using the directions and values of previous parameter adjustments. Also, the adjustment values are calculated in such a way that more recent adjustments are weighted heavier (they get larger weights) compared to the very early adjustments (they get smaller weights).

The reason for this difference is that with the SGD method we do not determine the exact derivative of the loss function, but we estimate it on a small batch. Since the gradient is noisy, it is likely that it will not always move in the optimal direction.

The momentum helps then to estimate those derivatives more accurately, resulting in better direction choices when moving towards the global minimum.

Another reason for the difference in the performance of classical SGD and SGD with momentum lies in the area referred as Pathological Curvature, also called the ravine area .

Pathological Curvature or Ravine Area can be represented by the following graph. The orange line represents the path taken by the method based on the gradient while the dark blue line represents the ideal path in towards the direction of ending the global optimum.

1*kJS9IPV1DcZWkQ4b8QEB8w

To visualise the difference between the SGD and SGD Momentum, let's look at the following figure.

1*aM92FlJ8zn1-ao6Z6ynzEg

In the left hand-side is the SGD method without Momentum. In the right hand-side is the SGD with Momentum. The orange pattern represents the path of the gradient in a search of the global minimum.

1*amVpAKdAsDXA1R-XHPfztw

5.5 Adam Optimizer

Another popular technique for enhancing SGD optimization procedure is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba (2015). Adam is the extended version of the SGD with the momentum method.

The main difference compared to the SGD with momentum, which uses a single learning rate for all parameter updates, is that the Adam algorithm defines different learning rates for different parameters.

The algorithm calculates the individual adaptive learning rates for each parameter based on the estimates of the first two moments of the gradients (first and the second order derivative of the Loss function).

So, each parameter has a unique learning rate, which is being updated using the exponential decaying average of the rst moments (the mean) and second moments (the variance) of the gradients.

image-89

Key Takeaways & What Comes Next

In this handbook, we've covered the essentials and beyond in machine learning. From the basics to advanced techniques, we've unpacked popular ML algorithms used globally in tech and the key optimization methods that power them.

While learning about each concept, we saw some practical examples and Python code, ensuring that you're not just understanding the theory but also its application.

Your Machine Learning journey is ongoing, and this guide is your reference. It's not a one-time read – it's a resource to revisit as you progress and flourish in this field. With this knowledge, you're ready to tackle most of the real-world ML challenges confidently at a high level. But this is just the beginning.

About the Author — That’s Me!

I am Tatev , Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.

With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI, I've gathered this high-level summary of ML topics to share with you.

How Can You Dive Deeper?

After studying this guide, if you're keen to dive even deeper and structured learning is your style, consider joining us at LunarTech . Follow the course " Fundamentals to Machine Learning ," a comprehensive program that offers an in-depth understanding of the theory, hands-on practical implementation, extensive practice material, and tailored interview preparation to set you up for success at your own phase.

This course is also a part of The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023 , and has been featured in esteemed publications like Forbes , Yahoo , Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. You can enroll for a Free Trial of The Ultimate Data Science Bootcamp at LunarTech .

forbes-icon-hires-fau

Connect with Me:

Screenshot-2023-10-23-at-6.59.27-PM

  • Follow me on LinkedIn for a ton of Free Resources in ML and AI
  • Visit my Personal Website
  • Subscribe to my The Data Science and AI Newsletter

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d0853c3-c41a-48a2-a9e2-837f1cac1c70%2Fapple-touch-icon-1024x1024

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!

Co-founder of LunarTech, I harness power of Statistics, Machine Learning, Artificial Intelligence to deliver transformative solutions. Applied Data Scientist, MSc/BSc Econometrics

If you read this far, thank the author to show them you care. Say Thanks

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

  • Top Courses
  • Online Degrees
  • Find your New Career
  • Join for Free

The University of Chicago

Machine Learning: Concepts and Applications

Taught in English

Some content may not be translated

Financial aid available

2,825 already enrolled

Gain insight into a topic and learn the fundamentals

Dr. Nick Feamster

Instructor: Dr. Nick Feamster

Coursera Plus

Included with Coursera Plus

(14 reviews)

Recommended experience

Intermediate level

Basic Python and Linear Algebra

Skills you'll gain

  • Unsupervised Learning
  • Artificial Neural Network
  • Machine Learning
  • Statistical Classification

Details to know

assignment machine learning

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 9 modules in this course

This course gives you a comprehensive introduction to both the theory and practice of machine learning. You will learn to use Python along with industry-standard libraries and tools, including Pandas, Scikit-learn, and Tensorflow, to ingest, explore, and prepare data for modeling and then train and evaluate models using a wide variety of techniques. Those techniques include linear regression with ordinary least squares, logistic regression, support vector machines, decision trees and ensembles, clustering, principal component analysis, hidden Markov models, and deep learning.

A key feature of this course is that you not only learn how to apply these techniques, you also learn the conceptual basis underlying them so that you understand how they work, why you are doing what you are doing, and what your results mean. The course also features real-world datasets, drawn primarily from the realm of public policy. It is based on an introductory machine learning course offered to graduate students at the University of Chicago and will serve as a strong foundation for deeper and more specialized study.

Machine Learning and the Machine Learning Pipeline

In this module you will be introduced to the machine-learning pipeline and learn about the initial work on your data that you need to do prior to modeling. You will learn about how to ingest data using Pandas, a standard Python library for data exploration and preparation. Next, we turn to the first approach to modeling that we explore in this class, linear regression with ordinary least squares.

What's included

6 videos 2 quizzes 3 ungraded labs

6 videos • Total 49 minutes

  • Course Introduction • 5 minutes • Preview module
  • The Data Science Pipeline • 15 minutes
  • Data Ingestion and Exploration • 3 minutes
  • Lab Walkthrough: Data Exploration with Pandas • 10 minutes
  • Supervised Learning, Linear Models, and Least Squares • 10 minutes
  • Lab Walkthrough: Linear Regression • 3 minutes

2 quizzes • Total 60 minutes

  • Working with Data • 30 minutes
  • Introduction to Linear Regression • 30 minutes

3 ungraded labs • Total 180 minutes

  • Data Basics: Numpy and Pandas • 60 minutes
  • Data Exploration with Pandas • 60 minutes
  • Linear Regression • 60 minutes

Least Squares and Maximum Likelihood Estimation

In this module, you continue the work that we began in the last with linear regressions. You will learn more about how to evaluate such models and how to select the important features and exclude the ones that are not statistically significant. You will also learn about maximum likelihood estimation, a probabilistic approach to estimating your models.

4 videos 2 quizzes 1 programming assignment 2 ungraded labs

4 videos • Total 16 minutes

  • Linear Regression and Least Squares • 5 minutes • Preview module
  • Lab Walkthrough: Linear Regression on the Prostate Cancer Dataset • 3 minutes
  • Maximum Likelihood Estimation • 5 minutes
  • Lab Walkthrough: Linear Regression and Maximum Likelihood Estimation • 2 minutes
  • Linear Regression • 30 minutes
  • Maximum Likelihood Estimation • 30 minutes

1 programming assignment • Total 120 minutes

  • Graded Quiz: Manipulating Data & Linear Regressions • 120 minutes

2 ungraded labs • Total 120 minutes

  • Linear Regression on the Prostate Cancer Dataset • 60 minutes
  • Linear Regression and Maximum Likelihood Estimation • 60 minutes

Basis Functions and Regularization

This module introduces you to basis functions and polynomial expansions in particular, which will allow you to use the same linear regression techniques that we have been studying so far to model non-linear relationships. Then, we learn about the bias-variance tradeoff, a key relationship in machine learning. Methods like polynomial expansion may help you train models that capture the relationship in your training data quite well, but those same models may perform badly on new data. You learn about different regularization methods that can help balance this tradeoff and create models that avoid overfitting.

4 videos 2 quizzes 2 ungraded labs

4 videos • Total 25 minutes

  • Basis Functions • 5 minutes • Preview module
  • Lab Walkthrough: Features and Basis Functions • 4 minutes
  • Regularization and the Bias-Variance Tradeoff • 10 minutes
  • Lab Walkthrough: Linear Regression: Regularization • 4 minutes
  • Polynomial Feature Expansion • 30 minutes
  • Regularization • 30 minutes
  • Features and Basis Functions • 60 minutes
  • Linear Regression: Regularization • 60 minutes

Model Selection and Logistic Regression

In this module, you first learn more about evaluating and tuning your models. We look at cross validation techniques that will help you get more accurate measurements of your model's performance, and then you see how to use them along with pipelines and GridSearch to tune your models. Finally, we look a the theory and practice of our first technique for classification, logistic regression.

4 videos • Total 23 minutes

  • Model Selection and Cross Validation • 7 minutes • Preview module
  • Lab Walkthrough: Model Selection and Pipelines • 5 minutes
  • Logistic Regression • 7 minutes
  • Lab Walkthrough: Logistic Regression • 3 minutes
  • Model Tuning and Selection • 30 minutes
  • Logistic Regression • 30 minutes
  • Model Selection and Pipelines • 60 minutes
  • Logistic Regression • 60 minutes

More Classifiers: SVMs and Naive Bayes

You will learn about two more classification techniques in this module: first, Support Vector Machines (SVMs) and then Naive Bayes, a quick and highly interpretable approach that uses Bayes' theorem.

4 videos 3 quizzes 3 ungraded labs

4 videos • Total 24 minutes

  • Support Vector Machines • 8 minutes • Preview module
  • Lab Walkthrough: Support Vector Machines • 3 minutes
  • Naive Bayes Classification • 7 minutes
  • Naive Bayes Classification Example • 4 minutes

3 quizzes • Total 150 minutes

  • Graded Quiz: Model Evaluation • 90 minutes
  • Classification with SVMs • 30 minutes
  • Naive Bayes Classifiers • 30 minutes

3 ungraded labs • Total 120 minutes

  • SVMs • 60 minutes
  • Naive Bayes Classification Example • 60 minutes
  • Starter Code for the Quiz • 0 minutes

Tree-Based Models, Ensemble Methods, and Evaluation

In this module, you will first learn about classification using decision trees. We will see how to create models that use individual decision trees, and then ensemble models, which use many trees, such as bagging, boosting, and random forests. Then, we learn more about how to evaluate the performance of classifiers.

5 videos 3 quizzes 3 ungraded labs

5 videos • Total 30 minutes

  • Tree-Based Models • 8 minutes • Preview module
  • Ensembles, Bagging, and Boosting • 6 minutes
  • Lab Walkthrough: Trees and Forests • 4 minutes
  • Evaluation Metrics • 8 minutes
  • Lab Walkthrough: Evaluation • 3 minutes

3 quizzes • Total 180 minutes

  • Trees and Forests Quiz • 120 minutes
  • Trees and Ensembles • 30 minutes
  • Evaluating Models • 30 minutes
  • Trees and Forests • 60 minutes
  • Evaluation • 60 minutes

Clustering Methods

To this point, we have been focusing on supervised learning and training models that estimate a target variable that you have specified. In this module, we take our first look at unsupervised learning, a domain of machine learning that uses techniques to find patterns and relationships in data without you ever defining a target. In particular, we look at a variety of clustering techniques, beginning with k-means and hierarchical clustering, and then distribution and density-based clustering.

4 videos • Total 27 minutes

  • Unsupervised Learning (K-Means, Hierarchical) • 11 minutes • Preview module
  • Lab Walkthrough: Clustering • 2 minutes
  • Clustering (KDE, Meanshift, DBSCAN) • 10 minutes
  • Lab Walkthrough: Density and Distribution-Based Clustering • 2 minutes
  • K-Means and Hierarchical Clustering • 30 minutes
  • Clustering II • 30 minutes
  • Clustering • 60 minutes
  • Density and Distribution-Based Clustering • 60 minutes

Dimensionality Reduction and Temporal Models

You will look at two new techniques in this module. The first is Principal Component Analysis, a powerful dimensionality reduction technique that you can use to project high-dimensional features into lower-dimensional spaces. This can be used for a range of purposes, including feature selection, preventing overfitting, visualizing in two- or three-dimensional spaces higher dimensional data, and more. Then, you will study hidden Markov models, a technique that you can use to model sequences of states, where each state depends on the one that came before.

  • Principal Component Analysis (PCA) • 6 minutes • Preview module
  • Lab Walkthrough: Principal Component Analysis • 5 minutes
  • Temporal Models and Hidden Markov Models • 13 minutes
  • Lab Walkthrough: Hidden Markov Models • 1 minute
  • Principal Component Analysis • 30 minutes
  • HMMs • 30 minutes
  • Principal Component Analysis (PCA) • 60 minutes
  • Hidden Markov Models on Divvy Bike Trips • 60 minutes

Deep Learning

This module introduces you to one of the most hyped topics in machine learning, deep learning with feed-forward neural networks and convolutional neural networks. You will learn about how these techniques work and where they might be very effective--or very ineffective. We explore how to design, implement, and evaluate such models using Python and Keras.

  • Feed-Forward Neural Networks • 10 minutes • Preview module
  • Lab Walkthrough: Feed Forward Neural Networks • 3 minutes
  • Convolutional Neural Networks • 8 minutes
  • Lab Walkthrough: Convolutional Neural Nets • 1 minute
  • Neural Networks • 30 minutes
  • Convolutional Neural Nets • 30 minutes
  • Feed-forward Neural Nets • 60 minutes
  • Convolutional Neural Nets • 60 minutes

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

assignment machine learning

One of the world's premier academic and research institutions, the University of Chicago has driven new ways of thinking since our 1890 founding. Today, UChicago is an intellectual destination that draws inspired scholars to our Hyde Park and international campuses, keeping UChicago at the nexus of ideas that challenge and change the world.

Recommended if you're interested in Machine Learning

assignment machine learning

Coursera Project Network

Product Reviews Text-based Search - OpenAI Text Embedding

assignment machine learning

Facebook Network Analysis using Python and Networkx

Why people choose coursera for their career.

assignment machine learning

New to Machine Learning? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I purchase the Certificate?

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

You will be eligible for a full refund until two weeks after your payment date, or (for courses that have just launched) until two weeks after the first session of the course begins, whichever is later. You cannot receive a refund once you’ve earned a Course Certificate, even if you complete the course within the two-week refund period. See our full refund policy Opens in a new tab .

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

More questions

Bloomberg ML EDU presents:

Foundations of Machine Learning

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.

The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist . The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.

Please fill out this short online form to register for access to our course's Piazza discussion board. Applications are processed manually, so please be patient. You should receive an email directly from Piazza when you are registered. Common questions from this and previous editions of the course are posted in our FAQ .

The first lecture, Black Box Machine Learning , gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Highlights and Distinctive Features of the Course Lectures, Notes, and Assignments

  • Geometric explanation for what happens with ridge, lasso, and elastic net regression in the case of correlated random variables.
  • Investigation of when the penalty (Tikhonov) and constraint (Ivanov) forms of regularization are equivalent.
  • Concise summary of what we really learn about SVMs from Lagrangian duality.
  • Proof of representer theorem with simple linear algebra, emphasizing it as a way to reparametrize certain objective functions.
  • Guided derivation of the math behind the classic diamond/circle/ellipsoids picture that "explains" why L1 regularization gives sparsity (Homework 2, Problem 5)
  • From scrach (in numpy) implementation of almost all major ML algorithms we discuss: ridge regression with SGD and GD (Homework 1, Problems 2.5, 2.6 page 4), lasso regression with the shooting algorithm (Homework 2, Problem 3, page 4), kernel ridge regression (Homework 4, Problem 3, page 2), kernelized SVM with Kernelized Pegasos (Homework 4, 6.4, page 9), L2-regularized logistic regression (Homework 5, Problem 3.3, page 4),Bayesian Linear Regession (Homework 5, problem 5, page 6), multiclass SVM (Homework 6, Problem 4.2, p. 3), classification and regression trees (without pruning) (Homework 6, Problem 6), gradient boosting with trees for classification and regression (Homework 6, Problem 8), multilayer perceptron for regression (Homework 7, Problem 4, page 3)
  • Repeated use of a simple 1-dimensional regression dataset, so it's easy to visualize the effect of various hypothesis spaces and regularizations that we investigate throughout the course.
  • Investigation of how to derive a conditional probability estimate from a predicted score for various loss functions, and why it's not so straightforward for the hinge loss (i.e. the SVM) (Homework 5, Problem 2, page 1)
  • Discussion of numerical overflow issues and the log-sum-exp trick (Homework 5, Problem 3.2)
  • Self-contained introduction to the expectation maximization (EM) algorithm for latent variable models.
  • Develop a general computation graph framework from scratch, using numpy, and implement your neural networks in it.

Prerequisites

The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment , which is a preview of some of the math concepts that show up in the first part of the course.

  • Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
  • Python programming required for most homework assignments.
  • Recommended: At least one advanced, proof-based mathematics course
  • Recommended: Computer science background up to a "data structures and algorithms" course
  • (HTF) refers to Hastie, Tibshirani, and Friedman's book The Elements of Statistical Learning
  • (SSBD) refers to Shalev-Shwartz and Ben-David's book Understanding Machine Learning: From Theory to Algorithms
  • (JWHT) refers to James, Witten, Hastie, and Tibshirani's book An Introduction to Statistical Learning

Assignments

GD, SGD, and Ridge Regression

Lasso Regression

SVM and Sentiment Analysis

Kernel Methods

Probabilistic Modeling

Multiclass, Trees, and Gradient Boosting

Computation Graphs, Backpropagation, and Neural Networks

The cover of Hands-On Machine Learning with Scikit-Learn and TensorFlow

Other tutorials and references

  • Carlos Fernandez-Granda's lecture notes provide a comprehensive review of the prerequisite material in linear algebra, probability, statistics, and optimization.
  • Brian Dalessandro's iPython notebooks from DS-GA-1001: Intro to Data Science
  • The Matrix Cookbook has lots of facts and identities about matrices and certain probability distributions.
  • Stanford CS229: "Review of Probability Theory"
  • Stanford CS229: "Linear Algebra Review and Reference"
  • Math for Machine Learning by Hal Daumé III

A photo of David Rosenberg

David S. Rosenberg

Teaching Assistants

COS 402: Artificial Intelligence

Dean's date reminder:  Unlike the other homeworks, this one must be turned in on time.  This final homework is due on "dean's date," the latest possible due date allowed by university policy.  As per university rules, this also means that you will need written permission from the appropriate dean to turn it in late.

Part I: Written Exercises

The written exercises are available here in pdf.

Part II:  Programming

The topic of this assignment is machine learning for supervised classification problems.  Here are the main components of the assignment:

  • Implementation of the machine learning algorithm of your choice.
  • Comparison of your learning algorithm to those implemented by your fellow students on a small set of benchmark datasets.
  • A systematic experiment of your choice using your algorithm.
  • A short written report describing and discussing what you did and what results you got.

For this assignment, you may choose to work individually or in pairs.  You are encouraged to work in pairs since you are likely to learn more, have more fun and have an easier time overall.  (The written exercises should still be done individually though.)

Note that part of this assignment must be turned in by Thursday, January 8.  See "What to turn in" below.  Also, be sure to plan your time carefully as the systematic experiment may take hours or days to run (depending on what you decide to do for this part).

A machine learning algorithm

The first part of the assignment is to implement a machine learning algorithm of your choice.  We have discussed several algorithms including naive Bayes, decision trees, AdaBoost, SVM's and neural nets.  R&N discuss others including decision stumps and nearest neighbors.  There are a few other algorithms that might be appropriate for this assignment such as the (voted) perceptron algorithm and bagging.  You may choose any of these to implement.  More details of these algorithms are given below, in some cases with pointers for further reading.  For several of these algorithms, there are a number of design decisions that you will need to make; for instance, for decision trees, you will need to decide on a splitting criterion, pruning strategy, etc.  In general, you are encouraged to experiment with these algorithms to try to make them as accurate as you can.  You are welcome to try your own variants, but if you do, you should compare to the standard vanilla version of the algorithm as well.

If you are working individually, you should implement one algorithm.  If you are working with a partner, the two of you together should implement two algorithms.  You are welcome to implement more algorithms if you wish.

If it happens that you have previously implemented a learning algorithm for another class or independent project, you should choose a different one for this homework.

For this assignment, you may wish to do outside reading about your chosen algorithm, but by no means are you required to do so.  Several books on machine learning have been placed on reserve at the Engineering Library.   Although outside reading is allowed, as usual, copying, borrowing, looking at, or in any way making use of actual code that you find on-line or elsewhere is not permitted.  Please be sure to cite all your outside sources (other than lecture and R&N) in your report.

Notes on debugging:   It can be very hard to know if a machine learning program is actually working.  With a sorting program, you can feed in a set of numbers and see if the result is a sorted list.  But with machine learning, we usually do not know what the "correct" output should be.  Here are some suggestions for debugging your program:

  • Run your program on small, hand-built datasets where you know exactly what the answer should be.
  • Compare your results to those of fellow classmates who are implementing the same algorithm.  You can do this directly with them, or you can compare to results appearing on the course website (see below).  Of course, this can be tricky since there may be algorithmic differences that are not indicative of bugs.
  • If your code is broken down into multiple methods or modules (as it should be), check each separately.  If you are working as a pair, take turns checking each other's code. 
  • If you have time, implement the same algorithm in two different ways, for instance, once in java and once in matlab.  Compare the results.
  • Some algorithms come with theoretical guarantees (for instance, we discussed a theoretical upper bound on the training error of AdaBoost).  Check to be sure that your implementation conforms with the theory.  In general, keep your eye out for behavior that seems unreasonable.

Comparison on benchmark datasets

We have set up a mechanism whereby you will be able to compare the performance of your program to that of your fellow students.  The idea is to simulate what happens in the real world where you need to build a classifier using available labeled data, and then you use that classifier on new data that you never got to see or touch during training.  Here is how it works:

We are providing four benchmark datasets described below.  Each dataset includes a set of labeled training examples and another set of unlabeled test examples.  Once you have your learning algorithm written, debugged and working, you should try training your algorithm on the training data and producing a set of predictions on all of the test examples.  These predictions can then be submitted using moodle.  If you then press the "Run Script" button, your submitted predictions will be compared to the correct test labels and the resulting test error rate will be posted here where everyone can see how well everyone else's programs are performing.  The website will show, for each such submission, the date submitted, the author(s) of the program, a short description that you provide of the learning algorithm used, and the test error rate achieved.  The name listed as "author" will be a name that you provide.  So, if you wish to remain anonymous on the website, you can do so by using a made-up name of your choice, or even a random sequence of letters, but something that will allow you to identify your own entry.  (However, please refrain from using a name that might be considered offensive or inappropriate.)

The "description" you provide in submitting your test predictions should clearly describe the algorithm you are using, and any important design decisions you made (such as parameter settings).  This one or two sentence description should be as understandable as possible to others in the class.  For instance, try to avoid cryptic abbreviations.  (In contrast, the "author" you provide in submitting your test predictions can be any name you wish.)  While being brief, try to give enough information that a fellow classmate might have a reasonable chance of reproducing your results.

Once you have seen your results on test data, you may wish to try to improve your algorithm, or you may wish to try another algorithm altogether (although the assignment does not require you to do so).  Once you have done this, you may submit another set of test predictions.  However, to avoid the test sets becoming overused (leading to statistically meaningless results), each student will be limited to submitting three sets of predictions for each benchmark dataset.  Note that this limit is per student, not per team; in other words, if you are working as a pair, then together you can submit up to six sets of predictions per dataset.

Your grade will not depend on how accurate a classifier you are able to produce relative to the rest of the class.  Although you are encouraged to do the best you can, this should not be regarded as anything more than a fun (I hope) communal experiment exploring various machine learning algorithms.  This also means that, in choosing an algorithm to implement, it is more important to choose an algorithm that interests you than to choose one that you expect will give the best accuracy.  The greater a variety of algorithms that are implemented, the more interesting will be the results of our class experiment, even if some of those algorithms perform poorly.

A systematic experiment

The third part of this assignment is to run a single systematic experiment.  For instance, you might want to produce a "learning curve" such as the one shown in R&N Figure 18.11a.  In such a curve, the accuracy (or error) is measured as the training set size is varied over some range.  To be more specific, here is how you might produce such a curve.  The provided test datasets cannot be used here since they are unlabeled, and since you are limited to making only three sets of predictions on each.  Instead, you can split the provided training set into two subsets, one for training, and the other for measuring performance.  For instance, if you have 2000 training examples, you might hold out 1000 of those examples for measuring performance.  You can then run your learning algorithm on successively larger subsets of the remaining 1000 examples, say of size 20, 50, 100, 200, 500, 1000.  Each run of your algorithm will generate a classifier whose error can be measured on the held-out set of 1000 examples.  The results can then be plotted using matlab, gnuplot, excel, etc.  Note that such an experiment, though requiring multiple runs of the same algorithm, can be programmed to run entirely automatically, thus reducing the amount of human effort required, and also substantially reducing the chance of making mistakes.

This is just one possible experiment you might wish to run.  There are many other possibilities.  For instance, if you are using neural nets, you might want to plot accuracy as a function of the number of epochs of training.  Or if you are using boosting, you might plot accuracy as a function of the number of rounds.  Another possibility is to compare the accuracy of two different variants of the same algorithm, for instance, decision trees with and without boosting, or decision trees with two different splitting criteria.

This general approach of holding out part of the training set may also be helpful for improving the performance of your learning algorithm without using the "real" test set.  For instance, if your algorithm has a parameter (like the learning rate in neural nets) that needs to be tuned, you can try different settings and see which one seems to work the best on held out data.  (This could also count as a systematic experiment.)  You might then use this best setting of the learning rate to train on the entire training set and to generate test predictions that you submit for posting on the class website.

In general, such held-out sets should consist of about 500-1000 examples for reliable results.

Note that systematic experiments of this kind can take a considerable amount of computation time to complete, in some cases, many hours or even days, depending on the experiment and the algorithm being used.  Therefore, it is very important that you start on this part of the assignment as early as possible.

If you are working as a pair, it is okay to do just one experiment.  However, whatever experiments you do should involve at least two of the algorithms that you implemented.  For instance, you might produce learning curves for both.

A written report

The fourth part of this assignment is to write up your results clearly but concisely in a short report.  Your report should include all of the following (numbers in brackets indicate roughly how many paragraphs you might want to write for each bullet):

  • [1-2] A description of the algorithm(s) that you implemented.  The description should include enough implementation details that a motivated classmate would be able to reproduce your results.
  • [1] A brief description of what strategies you used to test that your program is working correctly since, as noted above, it can be difficult to know if a machine learning program is working.
  • [1] A description of the experiment(s) that you carried out, again, with enough detail for a fellow classmate to reproduce your results.
  • [1] The results of your experiment, possibly summarized by a figure.
  • [1] The accuracy of your algorithm on the provided test sets, and a comparison to other methods used by others in the class.
  • [1-3] A discussion of your results.  For instance, what do your results tell us about the learning algorithm(s) you studied?  Were the results in any way surprising, or were they what you expected, and why?  How do they fit with theory and intuition?  Can you conclude anything about what kind of algorithms might be better for what kind of problems?

If you are working as a pair, you only need to submit a single report (in which case, your report might be slightly longer than indicated by the numbers above).

The code we are providing

We are providing a class DataSet for storing a dataset, and for reading one in from data files that we provide or that you generate yourself for testing.  Each dataset is described by an array of training examples, an array of labels and an array of test examples.  Each example is itself an array of attribute values.  There are two kinds of attributes: numeric and discrete.  Numeric attributes have numeric values, such as age, height, weight, etc.  Discrete attributes can only take on values from a small set of discrete values, for instance, sex (male, female), eye color (brown, blue, green), etc.  Below, we also refer to binary attributes; these are numeric attributes that happen to only take the two values 0 and 1.

Numeric attributes are stored by their actual value as an integer (for simplicity, we don't allow floating point values).  Discrete attributes are stored by an index (an integer) into a set of values.  The DataSet class also stores a description of each attribute including its name, and, in the case of discrete attributes, the list of possible values.  Labels are stored as integers which must be 0 or 1 (we will only consider two-class problems).  The names of the two classes are also stored as part of the DataSet class.

A dataset is read in from three files with the names <stem>.names , <stem>.train and <stem>.test .  The first contains a description of the attributes and classes.  The second and third contain the labeled training examples and unlabeled test examples.  A typical <stem>.names file looks like the following:

yes   no age         numeric eye-color   brown  blue  green

The first line must contain the names of the two classes, which in this case are called " yes " and " no ".  After this follows a list of attributes.  In this case, the second line of the file says that the first attribute is called " age ", and that this attribute takes numeric values.  The next line says that the second attribute is called " eye-color ", and that it is a discrete attribute taking the three values " brown ", " blue " and " green ".

A typical <stem>.train file might look like this:

33   blue   yes 15   green  no 25   green  yes

There is one example per line consisting of a list of attribute values (corresponding to those described in the <stem>.names file), followed by the class label.

A <stem>.test file has exactly the same format except that the label is omitted, such as the following:

33 green 19 blue

The DataSet class has a constructor taking a file-stem as an argument that will read in all three files and set up the public fields of the class appropriately.  The .train and .names files are required, but not the .test file (if no .test file is found, a non-fatal warning message will be printed and an empty test set established).

Working with several different kinds of attributes can be convenient when coding a dataset but a nuisance when writing a machine learning program.  For instance, neural nets prefer all of the data to be numeric, while decision trees are simplest to describe when all attributes are binary.  For this reason, we have provided additional classes that will read in a dataset and convert all of the attributes so that they all have the same type.  This should make your job much, much simpler.  Each of these classes is in fact a subclass of DataSet (see the references listed on the course home page for an explanation of subclasses and how to use them), and each has a constructor taking as argument a file-stem.  The three classes are NumericDataSet , BinaryDataSet and DiscreteDataSet , which convert any dataset into data that is entirely numeric, binary or discrete.  (In addition, BinaryDataSet is a subclass of NumericDataSet .)  So, for instance, if you are using neural nets and want your data to be entirely numeric, simply load the data using a command like this:

ds = new NumericDataSet(filestem);

Using these subclasses inevitably has the effect of changing the attributes.  When converting discrete attributes to numeric or binary, a new binary attribute is created for each value.  For instance, the eye-color attribute will become three new binary attributes: eye-color=brown , eye-color=blue and eye-color=green ; if eye-color is blue on some example, then eye-color=blue would be given the value 1, and the others the value 0.  A numeric (non-binary) attribute is converted to binary by creating new binary attributes in a similar fashion.  Thus, the numeric attribute age would be replaced by the binary attributes age>=19 , age>=25 , age>=33 .  If age actually is 25, then age>=19 and age>=25 would be set to 1, while age>=33 would be set to 0.  When converting a numeric (including binary) attribute to discrete, we simply regard it as a discrete attribute that can take on the possible values of the original numeric attribute.  Thus, in this example, age would now become a discrete attribute that can take on the values " 15 ", " 19 ", " 25 " and " 33 ".  Note that all ordering information has been lost between these values.

If you produce your own dataset, it is important to know that the provided code assumes that numeric attributes can only take on a fairly small number of possible values.  If you try this code out with a numeric attribute that takes a very large number of values, you probably will run into memory and efficiency issues.  All of the provided datasets have been set up so that this should not be a problem.

The DataSet class also includes a method printTestPredictions that will print the predictions of your classifier on the test examples in the format required for submission.  The output of this method should be stored in a file called <stem>.testout and submitted using moodle.

We also are providing an interface called Classifier that your learning program and the computed classifier (hypothesis) should adhere to.  This interface has three methods: predict , which will compute the prediction of the classifier on a given example; algorithmDescription , which simply returns a very brief but understandable description of the algorithm you are using for inclusion on the course website; and author , which returns the "author" of the program as you would like it to appear on the website (can be your real name or a pseudonym).  A typical class implementing this interface will also include a constructor where the actual learning takes place.

A very simple example of a class implementing the Classifier interface is provided in BaselineClassifier.java which also includes a simple main for loading a dataset, training the classifier and printing the results on test data.

All code and data files can be obtained from this directory , or all at once from this zip file .  Data is included in the data subdirectory.

Documentation on the provided Java classes is available here .

The datasets we are providing

We are providing four datasets, all consisting of real-world data suitably cleaned and simplified for this assignment.

The first two datasets consist of optical images of handwritten digits.  Some examples are shown in R&N Figure 20.29 (the data we are providing actually comes from the same source, although ours have been prepared somewhat differently).  Each image is a 14x14 pixel array, with 4 pixel-intensity levels.  The goal is to recognize the digit being represented.  In the first and easier dataset with file-stem ocr17 , the goal is to distinguish 1's from 7's.  In the second and harder dataset with file-stem ocr49 , the goal is to distinguish 4's from 9's.

The third dataset consists of census information.  Each example corresponds to a single individual with attributes such as years of education, age, race, etc.  The goal is predict whether this individual has an income above or below $50,000.  The file-stem is census .

The fourth dataset consists of DNA sequences of length 60.  The goal is to predict whether the site at the center of this window is a "splice" or "non-splice" site.  The file-stem is dna .

The DNA dataset consists of 1000 training examples and 2175 test examples.  All of the other datasets consist of 2000 training examples and 4000 test examples.

It might happen that you think of ways of figuring out the labels of the test examples, for instance, manually looking at the OCR data to see what digit is represented, or finding these datasets on the web.  Please do not try anything of this kind, as it will ruin the spirit of the assignment.  The test examples should not be used for any purpose other than generating predictions that you then submit.  You should pretend that the test examples will arrive in the future after the classifier has already been built and deployed.

The code that you need to write

Your job is to create one or more Classifier classes implementing your learning algorithm and the generated classifier.  Since we do not intend to do automatic testing of your code, you can do this however you wish.  You also will probably need to write some kind of code to carry out your systematic experiment.

Because we will not be relying on automatic testing, we ask that you make an extra effort to document your code well to make it as readable as possible.

What to turn in

For the purposes of submitting to moodle , we have divided this assignment in two.

On moodle, under the assignment called " A7: Machine Learning (test predictions) ", you should turn in the following:

  • For each of the four datasets, a file called <stem>.testout generated by printTestPredictions and containing author, algorithm description and predictions on all test examples.  These should be submitted (and the "Run Script" button pushed) as early as possible so that others can compare their results to yours.  At the latest, you (or your partner, if working as a pair) should submit a first round of test predictions by Thursday, January 8 (although you can continue to submit more rounds of test predictions after this date).  Keep in mind that you may not submit more than three sets of predictions per student and per dataset.  (Moodle will automatically prevent you from doing so.)

In addition, under the assignment called " A7: Machine Learning (code) ", you should turn in the following:

  • Any java code that you wrote in a form that will compile and run, should the TA's wish to try it out.
  • A readme.txt file explaining briefly how your code is organized, what data structures you are using, or anything else that will help the TA's understand how your code works overall.

Finally, your program report should be submitted in hard copy as described on the assignments page.

If you are working with a partner, the two of you together only need to submit your code once, and you only need to prepare and turn in a single written report .  Be sure that it is clear who your partner is.  In all cases, the written exercises should be completed and turned in individually.

You do not need to submit any code or anything in writing by Thursday, January 8.  The only thing you need to submit by that date is a single round of predictions on each of the test sets.  The reason this part is due before the rest of the assignment is so that you will have time to compare your results to those of your fellow classmates when you write up your report.  You can continue to submit test predictions (up to three rounds, including the one due on January 8), up until the assignment due date (Tuesday, January 13).

What you will be graded on

You will be graded on completing each of the components of this assignment, as described above.  More emphasis will be placed on your report than on the code itself.  You have a great deal of freedom in choosing how much work you want to put into this assignment, and your grade will in part reflect how great a challenge you decide to take on.  Creativity and ingenuity will be one component of your grade.  Here is a rough breakdown, with approximate point values in parentheses, of how much each component is worth:

  • [20] The learning algorithm (correct and complete implementation; debugging and testing; adequate documentation).
  • [10] Submitting test predictions via moodle and comparing to results of others.
  • [20] A systematic experiment.
  • [10] The overall presentation of the report (should be clear, concise and well written).
  • [15] The discussion of results (should be thoughtful, perceptive and insightful).
  • [10] Overall creativity, ingenuity and ambitiousness.

As noted above, your grade will not at all depend on how well your algorithm actually worked, provided that its poor performance is not due to an incorrect or incomplete implementation.

Full credit for this assignment will be worth around 85 points.  However, exceptional effort will receive extra credit of 5-20 points.

Algorithms you can choose from

Here is a list of machine learning algorithms you can choose from for the programming assignment.  Most of these are described further in the books requested for reserve at the Engineering Library (see below).  A few additional pointers are also provided below.  Be sure to take note of the R&N errata detailed on the written exercises for this homework.

Decision trees

These were discussed in class, and also in R&N Section 18.3.  To implement them, you will need to decide what splitting criterion to use, when to stop growing the tree and what kind of pruning strategy to use.

This algorithm was discussed in class, and also in R&N Section 18.4.  AdaBoost is an algorithm that must be combined with another "weak" learning algorithm, so you will need to implement at least one other algorithm (which might work well if you are working as a pair).  Natural candidates for weak learning algorithms include decision stumps or decision trees.  You also will need to decide how the weak learner will make use of the weights D t on the training examples.  One possibility is to design a weak learner that directly minimizes the weighted training error.  The other option is to select a random subset of the training examples on each round of boosting by resampling according to distribution D t .  This means repeatedly selecting examples from the training set, each time selecting example i with probability D t ( i ). This is done "with replacement", meaning that the same example may appear in the selected subset several times.  Typically, this procedure will be repeated N times, where N is the number of training examples.

Finally, you will need to decide on the number of rounds of boosting.  This is usually in the 100's to 1000's.

See also this overview paper , or this survey .

Support-vector machines (SVM's)

This algorithm was discussed in class, and also in R&N Section 20.6.  Even so, we did not describe specific algorithms for implementing it, so if you are interested, you will need to do some background reading.  One of the books on reserve is all about kernel machines (including SVM's).  You can also have a look at this tutorial paper , as well as the references therein and some of the other resources and tutorials at www.kernel-machines.org .  The SMO algorithm is a favorite technique for computing SVM's.

Since implementing SVM's can be pretty challenging, you might instead want to implement the very simple (voted) perceptron algorithm, another "large margin" classifier described below which also can be combined with the kernel trick, and whose performance is substantially similar to that of SVM's.

Neural networks

This algorithm was discussed in class, and also in R&N Section 20.5 in considerable detail.  You will need to choose an architecture, and you have a great deal of freedom in doing so.  You also will need to choose a value for the "learning rate" parameter, and you will need to decide how long to train the network for.  You might want to implement just a single-layer neural network, or you might want to experiment with larger multi-layer networks.  You can also try maximizing likelihood rather than minimizing the sum of squared errors as described in R&N Eq. (20.13) and the surrounding text.  It can be proved that taking this approach with a single-layer net has the important advantage that gradient ascent can never get stuck in a local maximum.  (This amounts to a tried and true statistical method called logistic regression.)

Naive Bayes

We discussed this algorithm in class much earlier in the semester, but it can be used as a very simple algorithm for classification learning.  It is described in R&N at the very end of Section 13.6, and also in the middle of Section 20.2.  Although simple, and although the naive independence assumptions underlying it are usually wrong, this algorithm often works better than expected.  In estimating probabilities, you will probably want to work with log probabilities and use Laplace ("add-one") smoothing as in HW#5.  This algorithm works best with discrete attributes.

Decision stumps

This is probably the simplest classifier on this list (and the least challenging to implement).  They are briefly touched upon in R&N Section 18.4.  A decision stump is a decision tree consisting of just a single test node.  Given data, it is straightforward to search through all possible choices for the test node to build the decision stump with minimum training error.  These make good, truly weak, weak hypotheses for AdaBoost.

Nearest neighbors

We did not discuss this algorithm in detail in class, but it is discussed in R&N Section 20.4.  The idea is simple: during training, all we do is store the entire training set.  Then given a test example, we find the training example that is closest to it, and predict that the label of the test example is the same as the label of its closest neighbor.  As described, this is the 1-nearest neighbor algorithm.  In the k -nearest neighbor algorithm, we find the k closest training examples and predict with the majority vote of their labels.  In either case, it is necessary to choose a distance function for measuring the distance between examples.

(Voted) perceptron algorithm

The perceptron algorithm (not to be confused with the algorithm given in R&N Figure 20.21) is one of the oldest learning algorithms, and also a very simple algorithm.  Like SVM's the algorithm's purpose is to learn a separating hyperplane defined by a weight vector w .  Starting with an initial guess for w , the algorithm  proceeds to cycle through the examples in the training set.  If example x is on the "correct" side of the hyperplane defined by w , then no action is taken.  Otherwise, y x is added to w .  The algorithm has some nice theoretical properties, and can be combined with the kernel trick.  Also, there is a version of the algorithm in which the average of all of the weight vectors computed along the way are used in defining the final weight vector defining the output hypothesis.  All this is described in this paper .  [Unfortunately, this paper has some annoying typos in Figure 1.  The initialization line should read: "Initialize: k := 1, v 1 := 0 , c 1 := 0."  Also, the line that reads "If ŷ = y then..." should instead read "If ŷ = y i then..."]

This is an "ensemble" method similar to boosting, somewhat simpler though not quite as effective overall.  As in boosting, we assume access to a "weak" or base learning algorithm.  This base learner is run repeatedly on different subsets of the training set.  Each subset is chosen by selecting N of the training examples with replacement from the training set, where N is the number of training examples.  This means that we select one of the training examples at random, then another, then another and so on N times.  Each time, however, we are selecting from the entire training set, so that some examples will appear more than once, and some won't appear at all.  The base learner is trained on this subset, and the entire procedure is repeated some number of times (usually around 100).  These "weak" or base hypotheses are then combined into a single hypothesis by a simple majority vote.  For more detail, see this paper .  This algorithm works best with an algorithm like decision trees as the base learner.

If you are interested in implementing some other algorithm not listed here, please contact me first.

Books on reserve at the Engineering Library

  • Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone. Classification and regression trees. Wadsworth, 1984. 
  • J. Ross Quinlan. C4:5: programs for machine learning. Morgan Kaufmann, 1993.
  • Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. 
  • Christopher M. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.

Browse Course Material

Course info.

  • Prof. Philippe Rigollet

Departments

  • Mathematics

As Taught In

  • Algorithms and Data Structures
  • Artificial Intelligence
  • Data Mining
  • Applied Mathematics
  • Discrete Mathematics
  • Probability and Statistics

Learning Resource Types

Mathematics of machine learning, mathematics of machine learning assignment 1.

This resource contains information regarding Mathematics of machine learning assignment 1.

facebook

You are leaving MIT OpenCourseWare

swayam-logo

Introduction to Machine Learning

Note: This exam date is subjected to change based on seat availability. You can check final exam date on your hall ticket.

Page Visits

Course layout, books and references.

  • The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, Jerome H. Friedman (freely available online)
  • Pattern Recognition and Machine Learning, by Christopher Bishop (optional)

Instructor bio

assignment machine learning

Prof. Balaraman Ravindran

Course certificate.

assignment machine learning

DOWNLOAD APP

assignment machine learning

SWAYAM SUPPORT

Please choose the SWAYAM National Coordinator for support. * :

  • Trending Now
  • Foundational Courses
  • Data Science
  • Practice Problem
  • Machine Learning
  • System Design
  • DevOps Tutorial
  • Machine Learning Model Evaluation
  • General steps to follow in a Machine Learning Problem
  • How to deploy a machine learning model using Node.js ?
  • Creating a simple machine learning model
  • Probabilistic Models in Machine Learning
  • Types of Machine Learning
  • Building a Machine Learning Model Using J48 Classifier
  • Top Career Paths in Machine Learning
  • Machine Learning as a Service (MLaaS)
  • Step by Step Predictive Analysis - Machine Learning
  • Top Machine Learning Trends in 2024
  • Deploy Machine Learning Model using Flask
  • Difference between Statistical Model and Machine Learning
  • Flowchart for basic Machine Learning models
  • How to Prepare Data Before Deploying a Machine Learning Model?
  • Deploy a Machine Learning Model using Streamlit Library
  • ML | Introduction to Data in Machine Learning
  • How to Start Learning Machine Learning?
  • How to approach a Machine Learning project : A step-wise guidance

Steps to Build a Machine Learning Model

In today’s era of a data-rich environment where data generation volume, velocity, and variety are unparalleled, we face both opportunities and challenges. Machine learning models offer a powerful mechanism to extract meaningful patterns, trends, and insights from this vast pool of data, giving us the power to make better-informed decisions and appropriate actions. In this article, we will explore the Fundamentals of Machine Learning and the Steps to build a Machine Learning Model.

Steps-to-build-a-Machine-Learning-Model

Table of Content

Understanding the Fundamentals of Machine Learning

Comprehensive guide to building a machine learning model, step 1: data collection for machine learning, step 2: preprocessing and preparing your data, step 3: selecting the right machine learning model, step 4: training your machine learning model, step 5: evaluating model performance, step 6: tuning and optimizing your model, step 7: deploying the model and making predictions.

Machine learning is the field of study that enables computers to learn from data and make decisions without explicit programming. Machine learning models play a pivotal role in tackling real-world problems across various domains by affecting our approach to tackling problems and decision-making. By using data-driven insights and sophisticated algorithms, machine learning models help us achieve unparalleled accuracy and efficiency in solving real-world problems.

Machine learning is crucial in today’s data-driven world, where the ability to extract insights and make predictions from vast amounts of data can help significant advancement in any field thus understanding its fundamentals becomes crucial.

We can see machine learning as a subset or just a part of artificial intelligence that focuses on developing algorithms that are capable of learning hidden patterns and relationships within the data allowing algorithms to generalize and make better predictions or decisions on new data. To achieve this we have several key concepts and techniques like supervised learning , unsupervised learning , and reinforcement learning.

  • Supervised learning involves training a model on labeled data, where the algorithm learns from the input data and its corresponding target ( output labels). The goal is to map from input to output, allowing the model to learn the relationship and make predictions based on the learnings of new data. Some of its algorithms are linear regression , logistic regression decision trees , and more.
  • Unsupervised learning , on the other hand, deals with the unlabeled dataset where algorithms try to uncover hidden patterns or structures within the data. Unlike supervised learning which depends on labeled data to create patterns or relationships for further predictions, unsupervised learning operates without such guidance. Some of its algorithms are, Clustering algorithms like k-means , hierarchical clustering dimensionality reduction algorithms like PCA , and more.
  • Reinforcement learning is a part of machine learning that involves training an agent to interact with an environment and learn optimal actions through trial and error. It employs a reward-penalty strategy, the agent receives feedback in the form of rewards or penalties based on its actions, allowing it to learn from experience and maximize its reward over time. Reinforcement learning applications in areas such as robotics, games, and more.

Some of the key terminologies of ML before building one are:

  • Feature : Features are the pieces of information that we use to train our model to make predictions. In simpler terms, they are the columns or attributes of the dataset that contain the data used for analysis and modeling.
  • Label : The output or target variable that the model aims to predict in supervised learning, also known as the dependent variable.
  • Training set : The portion of the dataset that is used to train the machine learning model. The model learns patterns and relationships in the data from the training set.
  • Validation set : A subset of the dataset that is used to tune the model’s hyperparameters and helps in assessing performance during training of the model.
  • Test Set : It is also a part of the dataset that is used to evaluate our final model performance on unseen data.

This comprehensive guide will take you through the process of building a machine-learning model, covering everything from data preprocessing to model evaluation and deployment. By following these steps, you’ll learn how to create a robust machine-learning model that meets your needs. Let’s see these steps,

Data collection is a crucial step in the creation of a machine learning model, as it lays the foundation for building accurate models. In this phase of machine learning model development, relevant data is gathered from various sources to train the machine learning model and enable it to make accurate predictions. The first step in data collection is defining the problem and understanding the requirements of the machine learning project. This usually involves determining the type of data we need for our project like structured or unstructured data , and identifying potential sources for gathering data.

Once the requirements are finalized, data can be collected from a variety of sources such as databases, APIs, web scraping , and manual data entry. It is crucial to ensure that the collected data is both relevant and accurate, as the quality of the data directly impacts the generalization ability of our machine learning model. In other words, the better the quality of the data, the better the performance and reliability of our model in making predictions or decisions.

Preprocessing and preparing data is an important step that involves transforming raw data into a format that is suitable for training and testing for our models. This phase aims to clean i.e. remove null values, and garbage values, and normalize and preprocess the data to achieve greater accuracy and performance of our machine learning models.

As Clive Humby said, “Data is the new oil. It’s valuable, but if unrefined it cannot be used.” This quote emphasizes the importance of refining data before using it for analysis or modeling. Just like oil needs to be refined to unlock its full potential, raw data must undergo preprocessing to enable its effective utilization in ML tasks. The preprocessing process typically involves several steps, including handling missing values, encoding categorical variables i.e. converting into numerical, scaling numerical features, and feature engineering. This ensures that the model’s performance is optimized and also our model can generalize well to unseen data and finally get accurate predictions.

Selecting the right machine learning model plays a pivotal role in building of successful model, with the presence of numerous algorithms and techniques available easily, choosing the most suitable model for a given problem significantly impacts the accuracy and performance of the model. The process of selecting the right machine learning model involves several considerations, some of which are:

Firstly, understanding the nature of the problem is an essential step, as our model nature can be of any type like classification , regression , clustering or more, different types of problems require different algorithms to make a predictive model.

Secondly, familiarizing yourself with a variety of machine learning algorithms suitable for your problem type is crucial. Evaluate the complexity of each algorithm and its interpretability. We can also explore more complex models like deep learning may help in increasing your model performance but are complex to interpret. The best approach is often to experiment with multiple models evaluate their metrics and iteratively check how well each of the algorithms is generalizing to unseen data.

In this phase of building a machine learning model, we have all the necessary ingredients to train our model effectively. This involves utilizing our prepared data to teach the model to recognize patterns and make predictions based on the input features. During the training process, we begin by feeding the preprocessed data into the selected machine-learning algorithm . The algorithm then iteratively adjusts its internal parameters to minimize the difference between its predictions and the actual target values in the training data. This optimization process often employs techniques like gradient descent.

As the model learns from the training data, it gradually improves its ability to generalize to new or unseen data. This iterative learning process enables the model to become more adept at making accurate predictions across a wide range of scenarios.

Once you have trained your model, it’s time to assess its performance. There are various metrics used to evaluate model performance, categorized based on the type of task: regression/numerical or classification.

1. For regression tasks, common evaluation metrics are:

  • Mean Absolute Error (MAE): MAE is the average of the absolute differences between predicted and actual values.
  • Mean Squared Error (MSE): MSE is the average of the squared differences between predicted and actual values.
  • Root Mean Squared Error ( RMSE ): It is a square root of the MSE , providing a measure of the average magnitude of error.
  • R-squared (R2): It is the proportion of the variance in the dependent variable that is predictable from the independent variables.

2. For classification tasks, common evaluation metrics are:

  • Accuracy: Proportion of correctly classified instances out of the total instances.
  • Precision: Proportion of true positive predictions among all positive predictions.
  • Recall: Proportion of true positive predictions among all actual positive instances.
  • F1-score: Harmonic mean of precision and recall, providing a balanced measure of model performance.
  • Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure of the model’s ability to distinguish between classes.
  • Confusion Metrics: It is a matrix that summarizes the performance of a classification model, showing counts of true positives, true negatives, false positives, and false negatives instances.

By evaluating the model using these metrics, one can gain insights into the strengths and weaknesses of our model allowing us to use further refinement and optimization.

As we have trained our model, our next step is to optimize our model more. Tuning and optimizing helps our model to maximize its performance and generalization ability. This process involves fine-tuning hyperparameters , selecting the best algorithm, and improving features through feature engineering techniques. Hyperparameters are parameters that are set before the training process begins and control the behavior of the machine learning model. These are like learning rate, regularization and parameters of the model should be carefully adjusted.

Techniques like grid search cv randomized search and cross-validation are some optimization techniques that are used to systematically explore the hyperparameter space and identify the best combination of hyperparameters for the model. Overall, tuning and optimizing the model involves a combination of careful speculation of parameters, feature engineering, and other techniques to create a highly generalized model.

Deploying the model and making predictions is the final stage in the journey of creating an ML model. Once a model has been trained and optimized, it’s to integrate it into a production environment where it can provide real-time predictions on new data.

During model deployment, it’s essential to ensure that the system can handle high user loads, operate smoothly without crashes, and be easily updated. Tools like Docker and Kubernetes help make this process easier by packaging the model in a way that makes it easy to run on different computers and manage efficiently. Once deployment is done our model is ready to predict new data, which involves feeding unseen data into the deployed model to enable real-time decision making.

In conclusion, building a machine learning model involves collecting and preparing data, selecting the right algorithm, tuning it, evaluating its performance, and deploying it for real-time decision-making. Through these steps, we can refine the model to make accurate predictions and contribute to solving real-world problems.

Please Login to comment...

Similar reads.

  • Machine Learning Blogs
  • AI-ML-DS Blogs

advertisewithusBannerImg

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Class Notes/Videos (Date-Wise)

Question-answer session recordings, additional resources, review material, assignment submission instructions, practice questions, assignments.

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: machine learning methods for solving assignment problems in multi-target tracking.

Abstract: Data association and track-to-track association, two fundamental problems in single-sensor and multi-sensor multi-target tracking, are instances of an NP-hard combinatorial optimization problem known as the multidimensional assignment problem (MDAP). Over the last few years, data-driven approaches to tackling MDAPs in tracking have become increasingly popular. We argue that viewing multi-target tracking as an assignment problem conceptually unifies the wide variety of machine learning methods that have been proposed for data association and track-to-track association. In this survey, we review recent literature, provide rigorous formulations of the assignment problems encountered in multi-target tracking, and review classic approaches used prior to the shift towards data-driven techniques. Recent attempts at using deep learning to solve NP-hard combinatorial optimization problems, including data association, are discussed as well. We highlight representation learning methods for multi-sensor applications and conclude by providing an overview of current multi-target tracking benchmarks.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

APDaga DumpBox : The Thirst for Learning...

  • 🌐 All Sites
  • _APDaga DumpBox
  • _APDaga Tech
  • _APDaga Invest
  • _APDaga Videos
  • 🗃️ Categories
  • _Free Tutorials
  • __Python (A to Z)
  • __Internet of Things
  • __Coursera (ML/DL)
  • __HackerRank (SQL)
  • __Interview Q&A
  • _Artificial Intelligence
  • __Machine Learning
  • __Deep Learning
  • _Internet of Things
  • __Raspberry Pi
  • __Coursera MCQs
  • __Linkedin MCQs
  • __Celonis MCQs
  • _Handwriting Analysis
  • __Graphology
  • _Investment Ideas
  • _Open Diary
  • _Troubleshoots
  • _Freescale/NXP
  • 📣 Mega Menu
  • _Logo Maker
  • _Youtube Tumbnail Downloader
  • 🕸️ Sitemap

Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] - Andrew NG

Coursera: Machine Learning - All weeks solutions [Assignment + Quiz] - Andrew NG

Recommended Machine Learning Courses: Coursera: Machine Learning    Coursera: Deep Learning Specialization Coursera: Machine Learning with Python Coursera: Advanced Machine Learning Specialization Udemy: Machine Learning LinkedIn: Machine Learning Eduonix: Machine Learning edX: Machine Learning Fast.ai: Introduction to Machine Learning for Coders

=== Week 1 ===

Assignments: .

  • No Assignment for Week 1
  • Machine Learning (Week 1) Quiz ▸  Introduction
  • Machine Learning (Week 1) Quiz ▸  Linear Regression with One Variable
  • Machine Learning (Week 1) Quiz ▸  Linear Algebra

=== Week 2 ===

Assignments:.

  • Machine Learning (Week 2) [Assignment Solution] ▸ Linear regression and get to see it work on data.
  • Machine Learning (Week 2) Quiz ▸  Linear Regression with Multiple Variables
  • Machine Learning (Week 2) Quiz ▸  Octave / Matlab Tutorial

=== Week 3 ===

  • Machine Learning (Week 3) [Assignment Solution] ▸ Logistic regression and apply it to two different datasets
  • Machine Learning (Week 3) Quiz ▸  Logistic Regression
  • Machine Learning (Week 3) Quiz ▸  Regularization

=== Week 4 ===

  • Machine Learning (Week 4) [Assignment Solution] ▸ One-vs-all logistic regression and neural networks to recognize hand-written digits.
  • Machine Learning (Week 4) Quiz ▸  Neural Networks: Representation

=== Week 5 ===

  • Machine Learning (Week 5) [Assignment Solution] ▸ Back-propagation algorithm for neural networks to the task of hand-written digit recognition.
  • Machine Learning (Week 5) Quiz ▸  Neural Networks: Learning

=== Week 6 ===

  • Machine Learning (Week 6) [Assignment Solution] ▸ Regularized linear regression to study models with different bias-variance properties.
  • Machine Learning (Week 6) Quiz ▸  Advice for Applying Machine Learning
  • Machine Learning (Week 6) Quiz ▸  Machine Learning System Design

=== Week 7 ===

  • Machine Learning (Week 7) [Assignment Solution] ▸ Support vector machines (SVMs) to build a spam classifier.
  • Machine Learning (Week 7) Quiz ▸  Support Vector Machines

=== Week 8 ===

  • Machine Learning (Week 8) [Assignment Solution] ▸ K-means clustering algorithm to compress an image. ▸ Principal component analysis to find a low-dimensional representation of face images.
  • Machine Learning (Week 8) Quiz ▸  Unsupervised Learning
  • Machine Learning (Week 8) Quiz ▸  Principal Component Analysis

=== Week 9 ===

  • Machine Learning (Week 9) [Assignment Solution] ▸ Anomaly detection algorithm to detect failing servers on a network. ▸ Collaborative filtering to build a recommender system for movies.
  • Machine Learning (Week 9) Quiz ▸  Anomaly Detection
  • Machine Learning (Week 9) Quiz ▸  Recommender Systems

=== Week 10 ===

  • No Assignment for Week 10
  • Machine Learning (Week 10) Quiz ▸  Large Scale Machine Learning

=== Week 11 ===

  • No Assignment for Week 11
  • Machine Learning (Week 11) Quiz ▸  Application: Photo OCR Variables

assignment machine learning

Question 5 Your friend in the U.S. gives you a simple regression fit for predicting house prices from square feet. The estimated intercept is -44850 and the estimated slope is 280.76. You believe that your housing market behaves very similarly, but houses are measured in square meters. To make predictions for inputs in square meters, what intercept must you use? Hint: there are 0.092903 square meters in 1 square foot. You do not need to round your answer. (Note: the next quiz question will ask for the slope of the new model.) i dint get answer for this could any one plz help me with it

assignment machine learning

Please comment below specific week's quiz blog post. So that I can keep on updating that blog post with updated questions and answers.

assignment machine learning

This comment has been removed by the author.

Good day Akshay, I trust that you are doing well. I am struggling to pass week 2 assignment, can you please assist me. I am desperate to pass this module and I am only getting 0%... Thank you, I would really appreat your help.

Our website uses cookies to improve your experience. Learn more

Contact form

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

This repository will contains all the assignments to be done for the completion of the course COL 774 at IIT, Delhi taken by Professor Parag Singla

agarwal-ayushi/Machine-Learning-Assignments

Folders and files, repository files navigation, machine-learning-assignments, this repository will contains all the assignments to be done for the completion of the course col 774 at iit, delhi taken by professor parag singla..

The course webpage is here: http://www.cse.iitd.ac.in/~parags/teaching/col774/

  • Jupyter Notebook 99.3%

COMMENTS

  1. Assignments

    After completing each unit, there will be a 20 minute quiz (taken online via gradescope). Each quiz will be designed to assess your conceptual understanding about each unit. Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions. You can view the conceptual questions in each unit's ...

  2. Machine Learning Fundamentals Handbook

    In it, we'll cover the key Machine Learning algorithms you'll need to know as a Data Scientist, Machine Learning Engineer, Machine Learning Researcher, ... Step 1: Initial Weight Assignment - assign equal weight to all observations in the sample where this weight represents the importance of the observations being correctly classified: ...

  3. Lab 1: Machine Learning with Python

    scikit-learn #. One of the most prominent Python libraries for machine learning: Contains many state-of-the-art machine learning algorithms. Builds on numpy (fast), implements advanced techniques. Wide range of evaluation measures and techniques. Offers comprehensive documentation about each algorithm.

  4. Machine Learning Introduction for Everyone

    Machine learning is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. ... To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If ...

  5. Machine Learning Specialization [3 courses] (Stanford)

    Machine learning is a branch of artificial intelligence that enables algorithms to automatically learn from data without being explicitly programmed. Its practitioners train algorithms to identify patterns in data and to make decisions with minimal human intervention. ... The assignments and lectures in the new Specialization have been rebuilt ...

  6. Machine Learning: Concepts and Applications

    There are 9 modules in this course. This course gives you a comprehensive introduction to both the theory and practice of machine learning. You will learn to use Python along with industry-standard libraries and tools, including Pandas, Scikit-learn, and Tensorflow, to ingest, explore, and prepare data for modeling and then train and evaluate ...

  7. Introduction to Machine Learning

    This course introduces principles, algorithms, and applications of machine learning from the point of view of modeling and prediction. It includes formulation of learning problems and concepts of representation, over-fitting, and generalization. These concepts are exercised in supervised learning and reinforcement learning, with applications to images and to temporal sequences.

  8. Assignments

    The assignments section provides problem sets, solutions, and supporting files from the course. Browse Course Material ... Machine Learning. Menu. More Info Syllabus Readings Lecture Notes Assignments Exams Projects Tools Assignments. Ali Mohammad and Rohit Singh prepared the problem sets and solutions. ...

  9. denikn/Machine-Learning-MIT-Assignment

    This repository contains the exercises, lab works and home works assignment for the Introduction to Machine Learning online class taught by Professor Leslie Pack Kaelbling, Professor Tomás Lozano-Pérez, Professor Isaac L. Chuang and Professor Duane S. Boning from Massachusett Institute of Technology - denikn/Machine-Learning-MIT-Assignment

  10. Foundations of Machine Learning

    The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. ... Hands-On Machine Learning with Scikit-Learn, Keras, and ...

  11. DeepLearning.AI, Stanford University

    The Machine Learning Specialization is a foundational online program created in collaboration between DeepLearning.AI and Stanford Online. This beginner-friendly program will teach you the fundamentals of machine learning and how to use these techniques to build real-world AI applications.

  12. A-sad-ali/Machine-Learning-Specialization

    Build machine learning models in Python using popular machine learning libraries NumPy and scikit-learn. Build and train supervised machine learning models for prediction and binary classification tasks, including linear regression and logistic regression. Build and train a neural network with TensorFlow to perform multi-class classification.

  13. CS 402: HW#7, Machine learning

    A machine learning algorithm. The first part of the assignment is to implement a machine learning algorithm of your choice. We have discussed several algorithms including naive Bayes, decision trees, AdaBoost, SVM's and neural nets. R&N discuss others including decision stumps and nearest neighbors.

  14. Mathematics of Machine Learning Assignment 1

    This resource contains information regarding Mathematics of machine learning assignment 1. Resource Type: Assignments. pdf. 129 kB Mathematics of Machine Learning Assignment 1 Download File DOWNLOAD. Course Info Instructor Prof. Philippe Rigollet; Departments ...

  15. PDF CSE 446: Machine Learning Assignment 1

    CSE 446: Machine Learning Assignment 1 Due: February 3rd, 2020 9:30am Instructions Read all instructions in this section thoroughly. Collaboration: Make certain that you understand the course collaboration policy, described on the course website. You must complete this assignment individually; you are not allowed to collaborate with anyone else.

  16. Introduction to Machine Learning

    In this course we intend to introduce some of the basic concepts of machine learning from a mathematically well motivated perspective. We will cover the different learning paradigms and some of the more popular algorithms and architectures used in each of these paradigms. INTENDED AUDIENCE : This is an elective course.

  17. Steps to Build a Machine Learning Model

    Machine learning is the field of study that enables computers to learn from data and make decisions without explicit programming. Machine learning models play a pivotal role in tackling real-world problems across various domains by affecting our approach to tackling problems and decision-making. By using data-driven insights and sophisticated algorithms, machine learning models help us achieve ...

  18. COL774: Machine Learning

    Machine Learning. Tom Mitchell. First Edition, McGraw-Hill, 1997. Assignment Submission Instructions. You are free to discuss the assignment problems with other students in the class. But all your code should be produced independently without looking at/referring to anyone else's code. Python is the default programming languages for the course ...

  19. PDF Machine Learning Laboratory Manual

    Machine learning Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve ... Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that

  20. Machine Learning Methods for Solving Assignment Problems in Multi

    We argue that viewing multi-target tracking as an assignment problem conceptually unifies the wide variety of machine learning methods that have been proposed for data association and track-to-track association. In this survey, we review recent literature, provide rigorous formulations of the assignment problems encountered in multi-target ...

  21. greyhatguy007/Machine-Learning-Specialization-Coursera

    python machine-learning deep-learning neural-network solutions mooc tensorflow linear-regression coursera recommendation-system logistic-regression decision-trees unsupervised-learning andrew-ng supervised-machine-learning unsupervised-machine-learning coursera-assignment coursera-specialization andrew-ng-machine-learning

  22. Coursera: Machine Learning

    by Akshay Daga (APDaga) - April 25, 2021. 4. The complete week-wise solutions for all the assignments and quizzes for the course "Coursera: Machine Learning by Andrew NG" is given below: Recommended Machine Learning Courses: Coursera: Machine Learning. Coursera: Deep Learning Specialization.

  23. Assignment 1: Machine Learning Textbook Analysis

    View COS4852 2024 Assignment 1 (2).pdf from COMPUTER S COS4852 at University of South Africa. COS4852/A1//2024 Tutorial Letter A1/0/2024 Machine Learning COS4852 Year module Department of Computer ... • A first encounter with Machine Learning, Max Welling, 2011. Give the complete URL where you found these textbooks, as well as the file size ...

  24. agarwal-ayushi/Machine-Learning-Assignments

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.

  25. Training distilled machine learning models: non-technical

    Training distilled machine learning models: non-technical. BARDEHLE PAGENBERG Partnerschaft mbB. European Union, Germany April 23 2024. The invention relates to machine learning models such as ...