Syllabus
cs419 Data Mining
The course covers basic data mining techniques including those used for collective intelligence applications. We will experiment with these methods using Python.
Topics covered
- Collaborative filtering
- Item-based filtering
- Nearest-neighbor algorithms
- Distance methods (Manhattan, Euclidean, Minkowski)
- Evaluation of machine learning techniques
- Naïve Bayes classification methods.
- Probability density functions
- Naïve Bayes and unstructured text
Format
The course takes an active, hands-on approach to learning. Students will spend much of the class time exploring, experimenting, and evaluating code using their own laptops. Class time is divided between short lectures, individual experimentation with programming, working on code with a partner, team projects, and quizzes. During the first week all students will be assigned to permanent teams of around 5 people.
Every week you will read one chapter of the textbook. I will not regurgitate that material in an lecture. I will assume you have read and understood it. There will be ‘readiness assessment tests’ on textbook material not covered in class. Class time will be spent practicing what we have learned. The emphasis is not on theory but in learning development skills that can be used in the workplace.
Laptops
I am assuming that nearly everyone has a laptop. We will be working with laptops during a large percentage of our class. It doesn’t matter if your laptop runs Microsoft Windows, is a Mac, or an Ubuntu machine. Usually when I write this in a syllabus I add “It doesn’t matter if it is 5 years old. It also doesn’t matter how powerful it is–even a basic netbook will work.” In the case of this class, it does matter. Doing data mining even on small datasets is processor intensive. If you are using a netbook you will need to reduce the size of the data.
Required book
A Programmer’s Guide to Data Mining: The Ancient Art of the Numerati, which is a free online book. You can get extra credit by pointing out errors in this book.
Evaluation
Grading is based on a method developed by Professor Lee Sheldon at Indiana University. It is based on obtaining experience points (XP). The number of XP determines what level you are at. You start the class at Level Zero and with 0 XP. The level you obtain at the end of the semester determines your final grade. Here is the chart:
If you are at Level Four or lower when mid-semester reports are due, I will report your work as unsatisfactory.
Activities
All activities are optional. The total number of XP exceeds 2400. You only need 2150 for an A.
Readiness Assessment Tests – 480XP
There will be approximately 8 short multiple-choice readiness assessment tests (RATS) given during the course. Each quiz will be taken individually, then, immediately after, the same test will be taken as a team. Each individual quiz is worth on average 30 points; each team quiz is also worth on average 30XP. You will have advance notice of these quizzes. In addition, there may be unannounced mini-quizzes. They may be individual only, or individual and team.
Programming Practice – 500XP
Programming practice will typically be slight remixes of the book exercises and done in class. There will be at least 7 practices. Programming practice can be done individually or with a partner.
Programming Demo – 150-300XP
You can elect to extend the code for a particular programming practice. During class I will offer possible extensions. You will give a short , 15 minute, demo of the code to the class. You will gain up to 150XP for each demo project. This can be done individually or with a partner. Can be repeated with a different partner.
Class Presentation – 200XP
You can elect to give a short, 15 minute presentation on some machine learning topic (I will offer suggestions throughout the semester). You can do this individually or with a partner. Can be repeated.
Exam – 400XP
Throughout the last half of the semester I will post questions & problems. These are part of the final exam for the course. You can elect to complete the work anytime between the time the problem is posted and the final exam period. You will gain 10% more XP if you complete the work within a week of the problem being posted.
Team In Class Projects and Worksheets – 400XP
Team projects may be programming tasks, design tasks, or other work.
Team Participation – 100XP
Each student will rate the helpfulness of all members of their team. Individual team participation scores will be the sum of the points they receive from other members of their team. Each team member distributes 100 points to other members of the team. The average team participation score will be 100 points. The rater must differentiate some of their ratings (they cannot assign the same rating to all members).
Projects – 500XP
Throughout the semester I will be offering different data mining/machine learning challenges. (for example, this one on Caterpillar Tractor tubes. Solve the problem–gain XP. Simple as that!
Book Corrections/Suggestions 2-75XP per suggestion
You will lose xp if you do not attend class during the last week (student presentations)
Announcements, discussions, and questions
I will communicate with the class via piazza and the course web page. Please signup for Piazza at piazza.com/umw/fall2015/cs419
For questions about any aspect of the course including homeworks and labs, please use piazza rather than email. If you want you can tag questions as ‘private’.
You are responsible for checking your email and piazza every 24 hours and the web page at least weekly.
Accommodations for Students with Special Needs
Any student with a documented disability may receive a special accommodation to complete any requirements of this course. If you are have a disability or believe you have one you may wish to self-identify. You may do so by providing documentation to the Office of Disability Services located in Room 203 of George Washington Hall (Phone: Voice 540-654-1266, Fax: 540-654-1163). Appropriate accommodations may then be provided for you. If you have a condition that may affect your ability to exit the premises in an emergency or that may cause an emergency during class, you are encouraged to discuss this in confidence with me and/or anyone at the Office of Disability Services. This office can also answer any questions you have about the Americans with Disabilities Act (ADA).
Academic Integrity
I assume you are an ethical student and a person with integrity. I expect that you will follow the university honor code (see http://rosemary.umw.edu/CSHonorCode.html). Please use common sense and ask yourself what would a person with integrity do? To help you, I would like to make three comments related to this:
Plagiarism
Plagiarism means presenting some other person’s work as your own. This can mean using some other person’s words without acknowledging their source, or using some other person’s ideas. Copying another student’s work (homework or exam) is also plagiarism. Plagiarism will minimally result in an automatic zero for that submission.
Collusion
Collusion is unauthorized collaboration that produces work which is then presented as work completed independently by the student. Collusion includes participating in group discussions that develop solutions which everyone copies. Penalties for plagiarism and collusion include receiving a failing grade for that work.
Classroom Behavior
I ask that you respect the other people in the class. I recognize that your life circumstances may require you to receive cell phone calls during class. If this is the case please set your cell phone on vibrate and discretely leave the class to accept calls. During tests, if you walk out of the classroom, or consult/display your cell phone, I will assume you are done with the test and collect your grading sheet
Avatar Names
During the first week of class I will ask you for your avatar name. This is the name that will appear on the Experience Point Google Spreadsheet that will be viewable by everyone in the class. If you wish to remain anonymous, don’t share your avatar name with anyone. To further protect the anonymity of those who wish to remain anonymous, the spreadsheet may also be populated by fictitious avatar names.