cs419 Data Mining
The course covers basic data mining techniques including those used for collective intelligence applications. We will experiment with these methods using Python.
- Collaborative filtering
- Item-based filtering
- Nearest-neighbor algorithms
- Distance methods (Manhattan, Euclidean, Minkowski)
- Evaluation of machine learning techniques
- Naïve Bayes classification methods.
- Probability density functions
- Naïve Bayes and unstructured text
The course takes an active, hands-on approach to learning. Students will spend much of the class time exploring, experimenting, and evaluating code using their own laptops. Class time is divided between short lectures, individual experimentation with programming, working on code with a partner, team projects, and quizzes. During the first week all students will be assigned to permanent teams of around 5 people.
Every week you will read one chapter of the textbook. I will not regurgitate that material in an lecture. I will assume you have read and understood it. There will be ‘readiness assessment tests’ on textbook material not covered in class. Class time will be spent practicing what we have learned. The emphasis is not on theory but in learning development skills that can be used in the workplace.
I am assuming that nearly everyone has a laptop. We will be working with laptops during a large percentage of our class. It doesn’t matter if your laptop runs Microsoft Windows, is a Mac, or an Ubuntu machine. Usually when I write this in a syllabus I add “It doesn’t matter if it is 5 years old. It also doesn’t matter how powerful it is–even a basic netbook will work.” In the case of this class, it does matter. Doing data mining even on small datasets is processor intensive.
A Programmer’s Guide to Data Mining: The Ancient Art of the Numerati, which is a free online book.
Grading is based on a method developed by Professor Lee Sheldon at Indiana University. It is based on obtaining experience points (XP). The number of XP determines what level you are at. You start the class at Level Zero and with 0 XP. The level you obtain at the end of the semester determines your final grade. Here is the chart:
If you are at Level Four or lower when mid-semester reports are due, I will report your work as unsatisfactory.
All activities are optional. The total number of XP exceeds 2400. You only need 2150 for an A.
Readiness Assessment Tests – 350XP
There will be approximately 5 short multiple-choice readiness assessment tests (RATS) given during the course. Each quiz will be taken individually, then, immediately after, the same test will be taken as a team. Each individual quiz is worth on average 35 points; each team quiz is also worth on average 35XP. You will have advance notice of these quizzes. In addition, there may be unannounced mini-quizzes. They may be individual only, or individual and team.
Python Notebooks – 500XP
Python notebooks are sort of like an interactive textbook displayed in your browser. Each notebook explains a topic, for example the Numpy Python Library, accompanied by a short set of programming questions. There will be approximately 10 notebooks. These are to be done individually, but you are free to ask other people in the class for help.
Data Challenges – 500XP
Following many of the notebooks are challenges that allow you to practice what you learned on a new dataset. There will be 6-8 of these challenges each worth from 50-100xp.
Pencil exercises – 500XP
To help you understand various machine learning methods, there will be a set of worksheets designed to be completed by hand. For example, there is a worksheet that asks you to compute entropy for a particular problem set. Some of these are done with a partner and others with your team.
Team Projects – 500XP
Throughout the semester I will be offering different data mining/machine learning challenges. (for example, this one on Caterpillar Tractor tubes. Solve the problem–gain XP. Simple as that!
Team Participation – 100XP
Each student will rate the helpfulness of all members of their team. Individual team participation scores will be the sum of the points they receive from other members of their team. Each team member distributes 100 points to other members of the team. The average team participation score will be 100 points. The rater must differentiate some of their ratings (they cannot assign the same rating to all members).
Community Participation – 50-100XP
Announcements, discussions, and questions
For questions about any aspect of the course including homeworks and labs, please use piazza rather than email. If you want you can tag questions as ‘private’.
You are responsible for checking your email and piazza every 24 hours and the web page at least weekly.
Accommodations for Students with Special Needs
Any student with a documented disability may receive a special accommodation to complete any requirements of this course. If you are have a disability or believe you have one you may wish to self-identify. You may do so by providing documentation to the Office of Disability Services located in Room 203 of George Washington Hall (Phone: Voice 540-654-1266, Fax: 540-654-1163). Appropriate accommodations may then be provided for you. If you have a condition that may affect your ability to exit the premises in an emergency or that may cause an emergency during class, you are encouraged to discuss this in confidence with me and/or anyone at the Office of Disability Services. This office can also answer any questions you have about the Americans with Disabilities Act (ADA).
I assume you are an ethical student and a person with integrity. I expect that you will follow the university honor code (see http://rosemary.umw.edu/CSHonorCode.html). Please use common sense and ask yourself what would a person with integrity do? To help you, I would like to make three comments related to this:
Plagiarism means presenting some other person’s work as your own. This can mean using some other person’s words without acknowledging their source, or using some other person’s ideas. Copying another student’s work (homework or exam) is also plagiarism. Plagiarism will minimally result in an automatic zero for that submission.
Collusion is unauthorized collaboration that produces work which is then presented as work completed independently by the student. Collusion includes participating in group discussions that develop solutions which everyone copies. Penalties for plagiarism and collusion include receiving a failing grade for that work.
I ask that you respect the other people in the class. I recognize that your life circumstances may require you to receive cell phone calls during class. If this is the case please set your cell phone on vibrate and discretely leave the class to accept calls. During tests, if you walk out of the classroom, or consult/display your cell phone, I will assume you are done with the test and collect your grading sheet
During the first week of class I will ask you for your avatar name. This is the name that will appear on the Experience Point Google Spreadsheet that will be viewable by everyone in the class. If you wish to remain anonymous, don’t share your avatar name with anyone. To further protect the anonymity of those who wish to remain anonymous, the spreadsheet may also be populated by fictitious avatar names.