March 2016 Items of Interest

March Meetups:

Machine Learning Working Group: Saturday, 05 March

3:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is a hands-on working group for those interested in working with the underlying algorithms and code that will be discussed at the Tuesday night Tidewater Analytics meeting.

Tidewater Analytics: Tuesday, 08 March

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is the third month of our machine learning track. The topic this month is regularization and tree pruning, and an introduction to ensemble methods. This will be more of an educational presentation of the topic. Those interested in hands-on work with the algorithms and code should attend the working group meeting on Saturday, 05 March.

757 Python User Group: Thursday, 10 March

7:00 pm at The Hatch, 111 Granby, downtown Norfolk

This is the kickoff meeting of the newly-reconstituted 757 Python User Group, thanks to Jesse Wright, as CS grad student at ODU who stepped up as organizer last month. The focus of this particular meeting is “Meet-and-Greet” fellow Pythonistas, and put in your $0.02 worth on what the direction of the group ought to be.

757 R User’s Group: Tuesday, 15 March

7:00 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Steve Mortimer, a Data Scientist at Dominion Enterprises, will discuss APIs from both a consumer’s point of view as well as from a provider’s point of view (i.e., how to write an API for your R model).

Try.Py – Learn Python: Wednesday, 16 March

7:00 pm at The Hatch, 111 Granby, downtown Norfolk

Jay Gendron, a Data Scientist at Booz-Allen, started this particular Python group for beginners to learn the very basics of programming in the Python environment.

Tidewater Big Data Enthusiasts: Tuesday, 22 March

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Dr. Chuck Cartledge, adjunct professor in CS at ODU, will lead this month’s discussion, which will center around the “variety” aspect of Big Data.

Office Hours: Saturday, 26 March

3:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

This is a monthly hands-on working group for those new to R programming. It focuses on getting started and the basics of R.

MOOCs and other educational venues:

The MOOC and online learning space has gotten entirely too big to track on a monthly, case-by-case basis. I am going to switch to listing things that catch my eye.

Harvard University Statistics 110 – Probability

This online course is by Dr. Joe Blitzstein, a highly-regarded statistician and author of the well-reviewed “Introduction to Probability” (which I believe goes hand-in-hand with these videos). I’ve seen these videos referenced a number of times around the Internet (all very positively), so I’m going to go out on a limb here and say that if you are interested in learning probability, this is probably a good path.

Books:

 Dr. Patrick Kilduff, a local fisheries consultant, pointed this out to me…

The book, “From Linear Models to Machine Learning: Regression and Classification, with Examples in R“, is actually not finished yet. But the author, Dr. Norm Matloff, who writes the “Mad (Data) Scientist” blog, is 50% finished and has it posted online for review and comment.

So this is a chance not only to learn cool stuff, but to contribute to its development and make it cooler still.

Thanks Patrick!

Miscellaneous:

New Release of RStudio

The newest release of RStudio v0.99.878 is on the street with lots of cool new and upgraded features.

To ggplot2…Or Not To ggplot2

This is an interesting and fun read.

In case you don’t know, ggplot2 is a graphics package in R that was developed by data science rock star Hadley Wickham. It is based on Leland Wilkinson’s book, “The Grammar of Graphics,” and it has attained near-Biblical stature among statisticians and data scientists who use R.

In early February, Dr. Jeff Leeks — a well-known data scientist at Johns Hopkins University and key figure in the Coursera Data Science specializations — wrote a blog post entitled “Why I Don’t Use ggplot2” on his blog, Simply Statistics. And he makes a lot of good points.

A week or so later, however, Dr. David Robinson — a well-known blogger and data scientist at Stack Overflow — responded with a blog post entitled, “Why I Use ggplot2” on his blog, Variance Explained. And he, too, makes a lot of good points.

Between the two posts — and various commentaries around the Internet — there is a lot of food for thought on the issue of what package should you use for graphics in R.

Kaggle vs. the Real World

This is a fascinating blog post.

Will McGinnis entered a Kaggle competition…spent about 30 minutes on his first submission…and came in 1113th place out of 1762 entries.

But the quality of his submission — as measured by the standard ROC area under the curve — was 0.96290, compared to 0.97024 that the winner had. In other words, the winning individual or team probably spent significantly more time and only achieved a difference in entropy of about 0.00367.

And his point is that in real-world practical terms, sometimes getting to absolute best is not the most efficient or effective use of time. Here it is:

Over Optimizing: A Story About Kaggle

Data Scientists Do Arithmetic

Some years ago I got interested in software agents. In the very, very early stages of that interest, I happened to find myself at the Naval Postgraduate School in Monterey, CA for a seminar on something that was loosely related. Although it was only loosely related, a certain faculty member who was known as an expert of software agents spoke a bit. Afterwards, I tracked him down and asked him — very naively — how software agents worked.

What he replied with was interesting, and I’ve never forgotten it.

He said that fundamentally, everything a computer does is arithmetic. Just adding and subtracting numbers.

The conversation went into a bit more depth than that, but his point was that it’s not magic. You just need to keep peeling back the layers.

I was reminded of that encounter when I read this Signal vs Noise blog article, “Data Scientists Mostly Just Do Arithmetic, And That’s A Good Thing“. It goes hand-in-hand with the article about Kaggle and the real world, and serves as a reminder that despite all the hoopla about Extreme Gradient Boost algorithms and Neural Networks and such, that most business problems are much, much more mundane than that.

Storing Data in DNA

Every once in a while there’s an article that so cool and weird that it’s just worth reading.

There is such an article in a recent edition of The New Scientist about a genetic researcher — Karin Ljubic Fister — who got frustrated with the limitations of computer storage, and figured out how to store binary data in plant DNA. 

There is a blog post — Landscapes of Data Infection — about the article, but it’s really worth checking out the original in The New Scientist: Interview with Karin Ljubic Fister.

Analytics Blog of the Month: Understanding Bayes

This is a really good series of posts that attempt to provide meaningful but bite-size chunks of Bayesian inference and analysis to the masses. The author has finished five of the 16 posts he proposes to make, and he also provides links to other really good resources.

If you are planning on trying to learning Bayesian inference and analysis on your own, this is among the best places to start that I’ve come across.

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s