Course Syllabus for

Scalable Data Science and Distributed Machine Learning
Skalbar data science och distribuerad maskininlärning

EDA080F, 6 credits

Valid from: Autumn 2020
Decided by: Professor Thomas Johansson
Date of establishment: 2022-01-19

General Information

Division: Computer Science (LTH)
Course type: Third-cycle course
Teaching language: English


The student should become familiar with: scalable data processes and partitioning methods such as random forrest; scaling up neural networks such as CNN, RNN and GANs; and scalable machine learning pipelines for typical decision problems, such as prediction, A/B testing and anomaly detection.


Knowledge and Understanding

For a passing grade the doctoral student must Show in assignments that the introduced concepts (see course content) have been understood and can be applied to a given problem.

Competences and Skills

For a passing grade the doctoral student must Solve given, real-world or realistic problems in respective assignments using the concepts and theories introduced in the course.

Judgement and Approach

For a passing grade the doctoral student must Be able to determine which method to apply in a given problem context. Be able to determine the quality of a result from applying the taught methods.

Course Contents

The course is given in three modules. In addition to lectures by the organizers there will be invited guest speakers from industry. Module 1 – Introduction to Data Science: Introduction to fault-tolerant distributed file systems and computing. The whole data science process illustrated with industrial case-studies. Practical introduction to scalable data processing to ingest, extract, load, transform, and explore (un)structured datasets. Scalable machine learning pipelines to model, train/fit, validate, select, tune, test and predict or estimate in an unsupervised and a supervised setting using nonparametric and partitioning methods such as random forests. Introduction to distributed vertex-programming. Module 2 – Distributed Deep Learning: Introduction to the theory and implementation of distributed deep learning. Classification and regression using generalised linear models, including different learning, regularization, and hyperparameters tuning techniques. The feedforward deep network as a fundamental network, and the advanced techniques to overcome its main challenges, such as overfitting, vanishing/exploding gradient, and training speed. Various deep neural networks for various kinds of data. For example, the CNN for scaling up neural networks to process large images, RNN to scale up deep neural models to long temporal sequences, and autoencoder and GANs. Module 3 – Decision-making with Scalable Algorithms Theoretical foundations of distributed systems and analysis of their scalable algorithms for sorting, joining, streaming, sketching, optimising and computing in numerical linear algebra with applications in scalable machine learning pipelines for typical decision problems (eg. prediction, A/B testing, anomaly detection) with various types of data (eg. time-indexed, space-time-indexed and network-indexed). Privacy-aware decisions with sanitized (cleaned, imputed, anonymised) datasets and datastreams. Practical applications of these algorithms on real-world examples (eg. mobility, social media, machine sensors and logs). Illustration via industrial use-cases. The first course module, we aim to ensure that all students understand the basic concepts and tools in deep learning.

Course Literature

Specific material and literatur is announced and distributed in connection to the course instances.

Instruction Details

Type of instruction: Lectures. Lectures are given module wise in block sessions.

Examination Details

Examination format: Written assignments. Hand-ins (assignments) can include practical parts.
Grading scale: Failed, pass

Admission Details

Course Occasion Information

Contact and Other Information

Course coordinator: Elin A. Topp <>

Complete view