Master DISS : Machine Learning - 2023/2024

Introduction
This is the page of the Machine Learning class of DISS International Master at Lyon 1 University.

Rooms - Schedule

Please refer to the Week Schedule there: https://adelb.univ-lyon1.fr/

Program, classes and content

Below is an overview of the courses and practicals.
As the classes are done, I'll add my presentation slides, some notebooks, etc.
This organization can be subject to changes !.

Day Topic Resources
Wednesday Sep. 6(a.m) Introduction, Data Description Slides - TP
Wednesday Sep. 13(a.m) Unsupervised: Clustering beyond k-means Slides - TP
Wednesday Sep. 13(p.m) Supervised Machine Learning: Basics Slides - TP
Wednesday Sep. 20(a.m) Supervised ML 2 Slides - TP
Wednesday Sep. 27(a.m) Data Challenge Github with data
Wednesday Oct. 04(a.m) Networks 1 Slides - TP1 - TP2 - CheatSheet_intro - CheatSheet_matrices - CheatSheet_centralities
Wednesday Oct. 11(a.m) Mathieu Lefort: Deep Neural Networks CM - Lien Moodle
Wednesday Oct. 11(p.m) Presentations
Wednesday Oct. 18(a.m) Networks 2 + Project presentation Slides
Wednesday Oct. 25(a.m) Mathieu Lefort: Deep Neural Networks CM - Lien Moodle
Wednesday Nov. 8(a.m) Mathieu Lefort: Deep Neural Networks TP - Lien Moodle
Wednesday Nov. 15(a.m) (Bruno Yun) - ML pipeline Content
Wednesday Nov. 15(p.m) Mathieu Lefort: Deep Neural Networks TP - Lien Moodle
Wednesday Nov. 22(a.m) (Bruno Yun) - ML for Textual Data Content
Wednesday Nov. 29(a.m) (Bruno Yun) - ML with Large Language Models (part 1) Content
Wednesday Nov. 29(p.m) (Bruno Yun) - ML with Large Language Models (part 2) Content
Wednesday Dec. 6(a.m) Seesion working on the Project (reminder: the project must be uploaded on Tomuss for December 17)
Wednesday Dec. 13 (9h45-11h45) Examen écrit
January 17, 9:45 a.m. Article presentation V2 (Nautibus TD12) link to schedule


Data

Data Exploration
  • A synthetic, toy dataset about used cars: cars_synthetic.csv
  • A real dataset about used cars usedCarsVW.csv. This dataset comes from Kaggle. The complete dataset is available here.
Clustering Supervised Learning Networks

Article presentation

Proposed articles
    • Karczmarek, P., Kiersztyn, A., Pedrycz, W., & Al, E. (2020). K-Means-based isolation forest. Knowledge-based systems, 195, 105659.
    • Carreira-Perpinán, M. A. (2015). A review of mean-shift algorithms for clustering. arXiv preprint arXiv:1503.00687.
    • Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2), 49-60.
    • Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine learning, 2, 139-172.
    • He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, 212, 106622.
    • Ran, X., Zhou, X., Lei, M., Tepsan, W., & Deng, W. (2021). A novel k-means clustering algorithm with a noise algorithm for capturing urban hotspots. Applied Sciences, 11(23), 11202.
    • Zafar, M. B., Valera, I., Rogriguez, M. G., & Gummadi, K. P. (2017, April). Fairness constraints: Mechanisms for fair classification. In Artificial intelligence and statistics (pp. 962-970). PMLR.
    • Mercioni, M. A., & Holban, S. (2019, June). A survey of distance metrics in clustering data mining techniques. In Proceedings of the 3rd International Conference on Graphics and Signal Processing (pp. 44-47).
    • Nguyen, H. L., Woon, Y. K., & Ng, W. K. (2015). A survey on data stream clustering and classification. Knowledge and information systems, 45, 535-569.
    • Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84-90.
    • LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.

Tools

For tutorials and the experimental part of lectures, you need to use some softwares, detailed below.

Python

Most of the experiments are done in python. If you're not familiar with this language, there are numerous tutorials on the web. A good one for instance is from w3schools. If you want to be all set-up for experiments, here is a list of packages we will use. Note that some of them are only available with pip, and not anaconda. If you're using anaconda, you can neverthless use them, using the pip command (pip install package_name).
  • notebook. Jupyter notebook
  • pandas. Pandas
  • scikit-learn. Machine learning/Data mining
  • seaborn. ploting library
  • networkx. Generic network analysis
  • cdlib. Community detection
A quick tutorial for pandas: ici
A quick tutorial about python data structures (lists, dictionaries, sets...)ici

Gephi

Gephi is a software for basic graph manipulation and visualization. Although you can't do much in term of graph analysis, it is really convenient to explore and visualize graphs of small to medium size ( < 1000 nodes).
It can be donwloaded there : Gephi.
Gephi requires Java, and suffer from a few bugs on windows (but there is no better alternative). Here are solutions to common problems:

Exams

Coefficients

  • CT - final exam: 50%
  • Presentation: 10%
  • Project Rémy Cazabet: 20%
  • Project Mathieu Lefort: 10%
  • Project Bruno Yun: 10%

Project

  • To complete the project, you can form groups of 2 or 3. You must clearly indicate who worked on which part. Grades can be individualized.
  • The aim of the project is to take a real dataset and analyze it using techniques and tools discussed in class. You can use tools we haven't covered in class, but a significant portion of the project should be about applying what you've learned. Here are some tips for finding a dataset.
  • The deliverables consist of two parts: 1) The code. It can be in the form of a notebook, .py files, or a combination of both. 2) A report-type PDF document, no longer than 6 pages + possibly an appendix of figures with captions. The report must be well-prepared. It should describe the data and present a clear and relevant analysis.
  • The grade will evaluate 1) the use of various methods taught in class, 2) the quality of the document (readability, clarity of explanations, usage of precise and appropriate terminology). The code itself won't be graded, but it will be checked to ensure the work was indeed done by the students.
  • The project is due by Sunday, December 17, 23:59, and should be submitted on Tomuss, either as a file or as a URL linking to a repository like GitHub/GitLab.

Final exam

Final exam will take place on December 13, with questions on all parts of the class by all teachers.

2022 Exam (my part)

Note on the article presentation V2

  • Presentation duration: 20 minutes: 15 minutes presentation, 5 minutes questions. (respecting it is part of the grade)
  • Select Key Information: Focus on the most important points of your article. Don't include everything; choose what's most relevant to your audience.
  • Charts and Tables: When presenting a chart or table, ensure you:
    • Highlight the main takeaways.
    • Direct the audience's attention to significant data points.
    • Provide a brief explanation to set the context.
  • Questions: Anticipate and address the most significant or likely question your audience might have about your article.
  • Keep Slides Clear:
    • Avoid using complete sentences; use bullet points or short phrases instead.
    • Ensure slides are readable and not crowded with text.
  • Presentation Length: Aim for approximately 12 slides for a 12-minute presentation. This keeps your content paced well and prevents rushing.
  • Quality over Quantity: Focus on presenting high-quality, relevant information rather than trying to include everything.
  • Address Uncertainties: If there's something in the article you don't understand, be transparent about it. This can lead to productive discussions.
  • Originality:
    • Avoid copying slides from online sources; always create your own content.
    • Use your own figures, diagrams, and visuals whenever possible.
  • Storytelling: Frame your presentation as a narrative or story. This helps in engaging your audience and making complex information more accessible.
  • Pacing: Speak clearly and at a moderate pace. Speaking too quickly can make it difficult for your audience to follow along.
  • Relevance of Plots: Only include plots or graphs if there's sufficient time to explain them. If the audience can't understand a plot's relevance, it's best to leave it out.