I'm proposing several topics for research internships. Feel free to contact me if one of them motivates you.

Master DISS : Machine Learning - 2022/2023

Introduction
This is the page of the Machine Learning class of DISS International Master at Lyon 1 University.

Rooms - Schedule

Please refer to the Week Schedule there: https://adelb.univ-lyon1.fr/
We have class mostly in Room TD10 and TP9, at nautibus

Program, classes and content

Below is an overview of the courses and practicals. Note that this schedule is still a work in progress, it might change during the semester.
As the classes are done, I'll add my presentation slides, some notebooks, etc.
This organization can be subject to changes !.

Day Topic Resources
Wednesday Sep. 7(a.m) Introduction, Data Description Exercises - Slides
Wednesday Sep. 14(a.m) Machine Learning Basics Exercises - Slides
Wednesday Sep. 21(a.m) Supervised ML 2, classification Exercises - Slides
Wednesday Sep. 28(a.m) Supervised Learning: Advanced Methods (+Project) Exercises - Slides
Wednesday Oct. 05(a.m) Clustering beyond k-means Exercises - Slides
Wednesday Oct. 12(a.m) Graphs 1 Exercises - Slides - Cheatsheet intro - Cheatsheet mat.
- CheatSheet centrality
Wednesday Oct. 19(a.m) Presentations - Project
Thursday Oct. 20(p.m) ROOM: TD12(14h-17h30) Deep Learning 1 (M. Lefort)
Wednesday Oct. 26(a.m) Graphs 2 Exercises - Slides
Thursday Oct. 27(p.m) ROOM: TD12(14h-17h30) Deep Learning 2 (M. Lefort)
Wednesday Nov. 2(a.m) Deep Learning 3 (M. Lefort)
Wednesday Nov. 9(a.m) Deep Learning 4 (M. Lefort)
Wednesday Nov. 16(a.m) Dimensionality Reduction beyond PCA Exercises - Slides
Wednesday Nov. 23(a.m) Deep Learning Advanced 1 (G. Andresini) Slides 1 - Slides 2
Wednesday Nov. 23(p.m) Deep Learning Advanced 2 (G. Andresini)
Wednesday Nov. 30(a.m) Deep Learning Advanced 3 (G. Andresini)
Wednesday Nov. 30(p.m) Deep Learning Advanced 4(G. Andresini)
Wednesday Dec. 7(a.m + p.m) Bayesian Optimisation
Wednesday Dec. 14(a.m) Final Exam


Data

I propose to work on a movie dataset, available on kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
This dataset is quite rich, because it has both metadata about movies(title, duration, budget...), and information about viewers scores.
(alternative download link - save 2022)
The uncompressed file is about 1Go (real data !).
If you need a smaller one, the metadata file we use in the first classes, much smaller (34 Mo) can be accessed direclty there
Some network datasets.
Small network Game Of Thrones(.graphml)
Airports with countries and positions(.graphml)

Tools

For tutorials and the experimental part of lectures, you need to use some softwares, detailed below.

Python

Most of the experiments are done in python. If you're not familiar with this language, there are numerous tutorials on the web. A good one for instance is from w3schools. If you want to be all set-up for experiments, here is a list of packages we will use. Note that some of them are only available with pip, and not anaconda. If you're using anaconda, you can neverthless use them, using the pip command (pip install package_name).
  • notebook. Jupyter notebook
  • pandas. Pandas
  • scikit-learn. Machine learning/Data mining
  • seaborn. ploting library
  • networkx. Generic network analysis
  • cdlib. Community detection
A quick tutorial for pandas: ici
A quick tutorial about python data structures (lists, dictionaries, sets...)ici

Gephi

Gephi is a software for basic graph manipulation and visualization. Although you can't do much in term of graph analysis, it is really convenient to explore and visualize graphs of small to medium size ( < 1000 nodes).
It can be donwloaded there : Gephi.
Gephi requires Java, and suffer from a few bugs on windows (but there is no better alternative). Here are solutions to common problems:

Exams

Partial Exam 1 (15

Please choose, in group of 1 or 2 persons, an article to present. You'll present this article in 15 to 20 minutes(+questions) on Wednesday October 12. You can choose among those propositions, or propose me another article that interest you.
  • Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189-215.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
  • Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015). Efficient and robust automated machine learning. Advances in neural information processing systems, 28.

Partial Exam 2: Project

  • To do this project, you can work in group of 1 or 2. Exceptionnaly, 3, but in that case you need to make it clear on which aspect each person has worked on.
  • you will apply the tools we have learnt during the class to the dataset of your choice. You can also apply tools you have seen in other classes, but you must mainly use methods from the class. A few recommendations to find a dataset.
  • The grade will be at 50% on two aspects: 1) You show technically that you know how to analyse data, 2) You show through comments and discussion that you understood the class. The report should thus have approximately as much text as code. You should for instance comment the choice of methods, their performance, transformations done to variables, your confidence on the results obtained, etc.
  • Your result should be provided as a jupyter notebook (+.py dependency files if needed), + a PDF if needed.
  • You should send your project to me(email/discord) before Sunday December 18 23:59.

Final exam

Final exam will take place on December 14, with questions on all parts of the class by all teachers.