greends-pml

Links and exercises for the course Practical Machine Learning, Green Data Science, 2o semester 2023/2024


Instructor: Manuel Campagnolo, ISA/ULisboa

The course will follow a flipped classroom model. Work outside class will be based on a range of Machine Learning resources including the book Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili. Machine Learning with PyTorch and Scikit-Learn. Packt Publishing, 2022. During classes, the notebooks (Python code) will be run on Google Colab.

Links for class resources:

Overview notebook This notebook provides an overview of the full course and contains pointers for other sources of relevant information and Python scripts.

Sessions: Each description below includes the summary of the topics covered in the session, as well as the description of assignments and links to videos or other materials that students should work through.

Introduction (Feb 23, 2024)

The goal of the first class is to give an example of a complex machine learning problem that can easily be solved using freely available resources. The example uses the high level machine learning package fastai.

Basic concepts (Mar 1, 2024): model, loss, gradient descent

The goal of the following classes up to April 12 is to understand how deep learning models can be trained and used to solve regression and classification problems. We start by applying the machine learning approach to well-known statistical problems like linear regression to illustrate the stepwise approach followed in ML. We use synthetic data generated from a linear or quadratic regression, where one can control the underlying model and the amout of noise. Then, we consider the Iris tabular data set with 4 explanatory variables and categorical label that can be one of three species.

Linear regression examples (Mar 15, 2024): epochs, perceptron, batches, train and test, overfitting

This session extends the previous class. We discuss some additional core ML concepts and we extend the approach to classification problems (discrete labels). The model (the perceptron) is still very simple and closely related to linear regression.

Regression vs classification problems; assessing ML performance (Mar 22, 2024): cross-entropy, confusion matrix, accuracy metrics

The main goal of this session is to understand how one can evaluate the accuracy of a classifier.

Neural networks (April 5, 2024): an implementation with PyTorch

In this session we extend the perceptron to a more complex model with multiple layers, called a neural network. We discuss how a neural network can be created and trained with PyTorch. Two data sets are used to illustrate the construction: the tabular Iris data set that had been used before, and a more complex data set (MNIST) where examples are images, but at this point are read just as vectors of numbers in a similar tabular way to the Iris data set.

Convolutional neural networks (April 12, 2024): parameters for convolutional layers ; an implementation with PyTorch

In this session, we improve on the model used in the previous session for the MNIST data set. Since the examples are images, it makes sense to explore the spatial context within each image. This can be done with convolutions over the images. Therefore, we add 2D-convolution and maxpool layers to the previous model and create a convolutional neural network using PyTorch.

Transfer learning (April 24, 2024): using and fine-tuning pre-trained models

We have seen how machine learning models are created and trained with PyTorch. However, when applying our model (e.g. a CNN) to a larger data set (e.f. CIFAR10) we encounter several problems like:

  1. the accuracy is low because the model is not good enough,
  2. training from scratch requires a lot of computational resources.

Similarly to the first session (Introduction) where we discussed a short script using the high level package fastai to implement a pre-trained convolutional neural network and apply it to classify images downloaded from the internet, we will adapt the code we discussed earlier to read and improve a pre-trained model called Resnet18. Here, we will see how to access a pre-trained model in PyTorch, and fine-tuned it to our data set. This will address both concerns listed above.

Production (May 3, 2024): saving and deploying models with gradio
Tabular data (May 10, 2024): preprocess tabular data
Feature engineering and data visualization (May 17, 2024): t-SNE, UMAP, processing pipeline
Random forests (May 23, 2024)
Gradient boosting and other ensemble techniques (May 30, 2024) AdaBoost and XGBoost; Variable importance; sklearn cheat sheet

</details>

Model evaluation and hyperparameter tuning (June 7, 2024); K-fold cross-validation, grid search, ROC curve and AUC

For the following topics, see Chap 6 notebook at https://github.com/rasbt/machine-learning-book/ and corresponding sections in the Overview notebook.


Main on-line resources
Some other useful links