January 19, 2013
I am a fan of MOOCs (Massive Open Online Courses). These courses allow people everywhere (well, at least those with access to the Internet) to have access to courses taught by the top people in the field. For example, Peter Norvig, head of research at Google and the author of the top textbook on artificial intelligence, and Sebastian Thrun, a research professor at Stanford who lead the development of a robotic vehicle that won the DARPA challenge race to drive a 150 mile mountainous course, taught an free online course on artificial intelligence. 160,000 people in 190 countries enrolled in the course. This is a phenomenal shift in education. Students no longer need to be accepted to elite universities to have access to the highest quality instruction. I’ve used MOOCs in several of my courses ranging from introduction to computer science courses to upper level courses.
>>
November 24, 2012
My book, A Programmer’s Guide to Data Mining: The Ancient Art of the Numerati, is to be translated into Chinese and published by Posts and Telecommunications Press, the largest publisher of computer books in China. This Chinese version will be available in Mainland China, Taiwan, Hong Kong, and Macau. I am super excited about this opportunity to help more people learn about data mining. As always, the English version of the book is available for free at guidetodatamining.com. Also forthcoming is an English paperback edition for under $20. Now I just need to finish the revisions.
July 23, 2012
I am a Buddhist practitioner. On 15 July I presented the Sunday talk ‘My poodle and creating an enlightened society’ at the Religious Science church Center for Spiritual Living in Las Cruces, New Mexico. The talk was about bodhisattvas– beings who work tirelessly for the benefit of others. >>
April 18, 2012
Jeanette Gundel presented a talk “Underspecification of cognitive status in reference production: the grammar-pragmatics interface” at the Workshop on Bridging the Gap between Computational, Empirical, and Theoretical Approaches to Reference at the Annual Cognitive Science Meeting in Boston. The talk was about our work on the Givenness Hierarchy. >>
April 4, 2012
There are two common indices for scientific productivity: the h-index and the g-index. I have gone up in both these ratings. My h-index is now 13. This means that I have 13 articles each of which has been cited at least 13 times. The Wikipedia entry for the g-index says “ a value for h of about 12 might be typical for advancement to tenure (associate professor) at major research universities.” To put that number in perspective, Sebastian Thrun, a robotics professor at Stanford and developer of the Google Self-Driving Car, has an h-index of 95. So while I am very happy with my 13, it isn’t that great in the scheme of things. My g-index is now 39– I have 39 publications whose combined citations exceed 39 squared. My Erdõs number is 3.
January 5, 2012
If you are a student looking for a cool individual study project or are just interested in a project for its own sake, you might consider updating the existing resource, Cluster by Night. Cluster by Night (CnB) is a live CD approach to setting up an HPC (High Performance Computing) cluster for MPI work. (MPI is a programming library that allows you to write programs for computing clusters.) We’ve used CnB for the last several years in our operating systems class. What distinguishes CbN from other approaches (for example, the popular PelicanHPC) is that it can work with an existing network. With other approaches the master node on the cluster hands out IP addresses; with CbN the cluster nodes receive their IP addresses from the existing DHCP server. I think Cluster by Night is an awesome resource.
How can you help? >>
April 6, 2011
Jeanette Gundel, Nancy Hedberg, and I finally got our paper, Underspecification of Cognitive Status in Reference Production: Some Empirical Predictions accepted to the journal, Topics In Cognitive Science, a journal of the Cognitive Science Society. It will be appearing in the special issue on “Production of Referring Expressions: Bridging the Gap between Computational and Empirical Approaches to Reference.” Within the Givenness Hierarchy framework we outlined in our 1993 paper, lexical items included in referring forms are assumed to conventionally encode two kinds of information: conceptual information about the speaker’s intended referent and procedural information about the assumed cognitive status of that referent in the mind of the addressee. In this current paper we explore the role of underspecification of cognitive status in reference processing.We show how this framework accounts for a number of experimental results in the literature.
March 10, 2011
I just presented a half-day session on data and text mining at the Digital Jumpstart Workshop at the University of Kansas. I am grateful to the co-directors of the Institute for Digital Research in the Humanities for inviting me to this event, which was open to KU faculty, staff, and graduate students. Links to the resources I covered at the workshop are available at Resources for the Digital Jumpstart Workshop.
March 2, 2011
I just presented an invited paper “Don’t throw the analysis out with the bath water: Lessons learned from Modern Standard Arabic geographical classification” at the University of Kansas. The talk was sponsored by the departments of Linguistics and Slavic Languages. The abstract is as follows:
In corpus linguistics we throw out information. For example, in collecting corpora we necessarily omit some information about the extralinguistic context and only record that which we consider relevant for the purpose of our current research. In the analysis stage, we often remove data without thinking. One clear example of this is the routine practice of removing frequent words (commonly referred to as ‘stop words’) in a pre-processing step before analysis. In this talk I describe my work in Modern Standard Arabic geographical classification to illustrate the importance of being more mindful when we make these decisions about what to keep and what to discard. For example, I will show that it is possible to geographically classify text solely using words that some researchers have described as being fluff, superfluous, and non-significant. I will also describe how the paucity of metadata of commonly available Arabic corpora hampers research such as this
February 27, 2011
I just attended, as an invited participant, the Maryland Institute for Technology in the Humanities’ API Workshop, which was held on February 25th and 26th. The workshop alternated between presentations, lightning talks, and what the organizers called ‘unconferencing’. The highlights for me were the talks given by Mano Marks on Google’s MAP API, Google’s Fusion Table, and Google Refine. I’ve spent more time than I care to remember cleaning up language data. For example, I spent weeks cleaning up a Guarani lexicon. Google Refine is a tool that helps automate that process. If I used that for the Guarani lexicon I would have been done in a day! Google Fusion Table is an amazingly easy way to create map mashups just using a spreadsheet. The maps you create can be embedded on your web page. Even though I am gushing about Mano Marks’ talks, the other presentations were equally valuable.