Schedule
Week 1 – Welcome
17 January
- what is data mining
- course logistics
- creating a simple musical artist data set (rating songs on this playlist)
- nearest neighbor algorithm
19 January
- lecture: algorithms and data
- Anaconda – getting started.
Week 2 – learning the basics
24 January
- getting started with iPython notebooks:
- collaborative filtering
- nearest neighbor algorithm
- intro to numpy (Numpy notebook)
26 January
- intro to Pandas (Pandas notebook)
- intro to Pearson (Pearson notebook)
Week 3 – Recommendation systems
31 January
- Pearson
- Matrix Factorization (PDF -ends abruptly but hopefully gives a reasonable intro)
- Matrix Factorization lab (Matrix Factorization Notebook)
- Matrix Factorization Techniques for Recommender Systems by Koren, Bell, and Volinsky
- Matrix Factorization and Neighbor Based Algorithms for the Netflix Prize Problem by Takács et al.
2 February
- Introduction to sklearn
- RAT #1 Chapters 2, 3 and Matrix Factorization
Week 4 – Classification and Normalization
7 February
- Titanic Task
- Chapter 4: Intro to kNN classification
- mostly a lab day
9 February
- Chapter 4. Normalization
- Titanic Task
Week 5 – Entropy and Decision Trees
14 February
- Decision Tree team worksheet
- Decision Tree Python Notebook
16 February
- TBD
Week 6 – Naïve Bayes Classifier
21 February
- Intro to Naïve Bayes
- Pen or Pencil spreadsheet
- Naïve Bayes Python Notebook
23 February
- Challenge
Week 7 – Decision Trees Continued
28 February
- RAT 2: Entropy, Decision Trees, Probability and Naïve Bayes
2 March
Week 8
7 March Spring Break
9 March – Spring Break
Week 9 – Bayes and Unstructured Text
14 March
- Analyzing Text Notebook
- Other files needed:
16 March
- Lab Day
Week 10 – Bayes and Unstructured Text
21 March
23 March
- Naive Bayes and unstructured text
- 20 newsgroup dataset (notes)125xp
- TBD: Twitter Sentiment Analysis dataset
Week 11 – Clustering
28 March
- clustering
- dog breed Google Sheet
- dog distances sorted
- Team Task- hierarchical clustering
- k-means clustering
- clustering notebook
30 March
- hierarchical clustering
Week 12 – Focus on Notebooks
4 April
- walk through of Naive Bayes Notebook
6 April
Week 13 – Feature Selection
11 April
13 April
- Regression
- Cricket Chirps vs Temperature
- House Prices Dataset (100xp)
- Combined Cycle Power Plant Data Set (50xp)
note: there are 5 sheets of data you need to merge.
Week 14 – Feature Selection and PCA
18 April
- Advice on How to start on a Data Science Problem
- Lasso Regression
- Machine Learning Ensemble Methods
- Principal Component Analysis
- Eigenfaces
20 April
- PCA
Week 15
25 April
- Presentations