January 2016 Items of Interest

Happy New Year!

If last year was a good year for the Hampton Roads data science and analytics community — and it was — 2016 is shaping up to be a great year.

We began 2015 with one nebulously defined Meetup — Tidewater Analytics — and over the course of the year several things happened.

First, Tidewater Analytics turned out to be more successful than I had ever imagined it would be. Since January 2015 we’ve had about twenty presentations by different people on either what they did with analytics in their jobs/lives, or about some analytics topic of interest. All the presentations were well attended, and an analytics community has slowly taken root. I could not be happier about that.

Second, two additional, related Meetups were established: the 757 R User Group, and Tidewater Big Data Enthusiasts. Both of these meetups are relatively new, but both were met with enthusiasm and seem to be developing nicely.

The one sour note to 2015 was that the 757 Python User Group is on the verge of going away. Brian McGill, the former organizer, announced last autumn that he was stepping down, and no one appears to have stepped up to take it over. I just got an e-mail from the Meetup.com folks, and they said that if no one takes over the organizer role by 16 January, it will be shut down. Hopefully someone will pick up the baton and keep it going (after all, Python is one of the most popular language for data science).

Looking forward to 2016, now, I see several new things happening.

To begin with, Tidewater Analytics has found a focus, and that focus is machine learning in the context of data science competitions. Specifically, our monthly meetings are going to be de facto self-taught machine learning modules, and we are going to aim at competing in a Kaggle — or similar — competition by the end of the year.

Also, Cathy Green — a colleague who has my absolute highest regard when it comes to all things data — is organizing a Meetup to address the myriad opportunities in Open Data. This will include commercial opportunities as well as the use of Open Data in research settings and civic endeavors. I anticipate this will kick off in April, and will meet on the first Tuesday of the month. This is a hugely exciting development, and I think that when combined with the other data-related Meetups, it provides Hampton Roads with across-the-board resources for learning and promoting data science and analytics.

January Meetups:

Machine Learning Working Group: Saturday, 09 January

3:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is a hands-on working group for those interested in working with the underlying algorithms and code that will be discussed at the Tuesday night Tidewater Analytics meeting.

Tidewater Analytics: Tuesday, 12 January

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is the kickoff meeting for our year of machine learning. The topic this month is regression trees. This will be more of an educational presentation of the topic. Those interested in hands-on work with the algorithms and code should attend the working group meeting on Saturday, 09 January.

757 R User’s Group: Tuesday, 19 January

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Dr. Patrick Kilduff, a fisheries consultant, will give an overview on how to develop R packages.

Tidewater Big Data EnthusiastsTuesday, 26 January

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Dr. Chuck Cartledge will lead a presentation on “Tools and Techniques to Visualize Big Data,” based on ideas from Nethan Yau’s book, “Data Points: Visualization That Means Something”.

Office Hours: Saturday, 30 January

4:00 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

This is a monthly hands-on working group for those new to R programming. It focuses on getting started and the basics of R.

MOOCs and other educational venues:

In view of the machine learning focus in Tidewater Analytics this year, I’m going to mention four online learning resources for machine learning. Not all of them are currently running, but there are resources from all of them that are accessible.

Coursera/Stanford (Hastie/Tibshirani): Statistical Learning

This is the online course that goes with the book we’ll be using as a primary resource in Tidewater Analytics. It is a very good course, and well worth taking. Unfortunately, although it has been offered more than once — and will probably be offered again — there are no active sessions just now. In the meantime, the videos from the course are available in several places around the Internet, including this R-Bloggers site.

Coursera/Stanford (Andrew Ng): Machine Learning

Andrew Ng is one of the names in machine learning, and many people swear by this course. I personally find it to be too oriented towards neural networks, and using Octave as the analysis package. But still, it has value. It looks to me as though they offer the course on a regular basis, with one starting again in late January.

Coursera/University of Washington (Pedro Domingos): Machine Learning

Like Trevor Hastie, Robert Tibshirani, and Andrew Ng, Pedro Domingos is one of the names in machine learning. Although the course itself is not active just now, the videos are available for download. I found the videos for this course to be a little bit wanting as they are pretty much just the camera focused on him giving lectures. Not that you can’t learn from that, but I would never consider this as a primary source of learning. This is more like augmentation material.

Penn State University: Applied Data Mining and Statistical Learning

These are really just the course notes for PSU’s Stat 897D course, but the fact is that that course uses “Introduction to Statistical Learning” as the text, and the course notes augment the book pretty well.

Books:

Introduction to Statistical Learning

Since we are using this book as a primary resource for our Tidewater Analytics work this year, I thought it would be appropriate to say something about it.

First, two of the authors – Trevor Hastie and Robert Tibshirani – are two of the names in this field. They each have significant accomplishments under their belts going back decades, and they are the principal authors of a previous book, “Elements of Statistical Learning,” which is a canonical detailed technical treatment of the field.

My understanding is that because “Elements…” is so technical, they teamed up with Gareth James and Daniela Witten to write this somewhat less technical introduction that is more accessible to those just getting into it.

And this book is relatively accessible. It assumes you know some basic statistics, and it assumes you know some basic R. But beyond that it does not assume much on the part of the reader.

The things I like most about it are the treatments of the higher-level general concepts. By that I mean, they have pretty good discussions in the first two chapters on parametric vs. non-parametric approaches, the trade-off between model accuracy and model interpretability, assessing model accuracy, the bias-variance trade-off, and regression vs classification.

The thing I really don’t like is that they sometimes gloss over details. For example, while discussing how to prune a regression tree in Chapter 8, they refer to using a certain parameter, alpha. But there is really no discussion of where to get alpha. This is frustrating, and it happens quite a bit throughout the book.

Also, in the print version of the book the colors used in the graphics can be pretty challenging. For example, in Figure 2.9 they depict a true function, f, in black, and then estimates of f in green, blue, and orange. In the print version, it is extremely difficult to separate out the black, blue, and green curves. And this color combination is seen throughout the book. (I should add that in the PDF version there is no problem with this.)

Speaking of which, you can either buy the print version or download the PDF version free. I’m one of those people who likes to pay authors for their hard work, and I bought the book. But I also use the PDF version when I really need to drill into a graphic.

On balance this is a good book for getting your feet wet in the machine learning waters. It would be difficult to teach yourself anything using just this book, and I would recommend at a minimum finding the videos that go with it (they, too, are available for free on at least one site on the Internet). Better still, Hastie and Tibshirani offer a free course via Coursera every once in a while. This is good because you can take advantage of the discussion boards, the class wiki, and so on.

Miscellaneous:

A Very Short History of Data Science

I have avoided using the term “data science” in the past, because I found it to be a bit vague, over-loaded, and pretentious. But that is starting to change. I’ve started to see real and valid definitions of it that align with what I can credibly say I and others I know do from time to time.

This recent article in Forbes by Gil Press adds  to my growing willingness to use the term. I was particularly happy to find the 1962 John Tukey article, “The Future of Data Analysis,” which lays the groundwork for having this “other” discipline that is not entirely statistics, but overlaps considerably while encompassing much more.

Both the Forbes article and the Tukey article are worth a read if you have a few minutes.

Analytics Blog of the Month: AnnMaria’s Blog

This is a pretty interesting blog, by a pretty interesting lady.

Ann Maria is Ann Maria De Mars. She’s a statistician and technology executive, and also Rhonda Rousey’s mother (and Rhonda Rousey, in case you don’t know, is the world famous mixed martial arts fighter who got clobbered in an Ultimate Fighting Championship match with Holly Holm this past autumn). And Ann Maria is a bit of a martial artist herself, having won some Judo acclaim over the years.

Anyway, Ann Maria is an interesting woman who writes interesting posts. Not all of her posts are about statistics or data science or analytics, but enough of them are that I have her on my Data Science feed.

One blog post that recently got my attention was Winners Know When to Quit. It opened with the following sentence:

That idea that winners never quit is complete and total bullshit.

Ya gotta love her.

If you’re looking for a new blog to check out, give hers a shot. She might not be your exact cup of tea, but I’m pretty sure you will find her to be thought-provoking and worth some occasional real consideration.

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s