March 2016 Items of Interest

March Meetups:

Machine Learning Working Group: Saturday, 05 March

3:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is a hands-on working group for those interested in working with the underlying algorithms and code that will be discussed at the Tuesday night Tidewater Analytics meeting.

Tidewater Analytics: Tuesday, 08 March

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is the third month of our machine learning track. The topic this month is regularization and tree pruning, and an introduction to ensemble methods. This will be more of an educational presentation of the topic. Those interested in hands-on work with the algorithms and code should attend the working group meeting on Saturday, 05 March.

757 Python User Group: Thursday, 10 March

7:00 pm at The Hatch, 111 Granby, downtown Norfolk

This is the kickoff meeting of the newly-reconstituted 757 Python User Group, thanks to Jesse Wright, as CS grad student at ODU who stepped up as organizer last month. The focus of this particular meeting is “Meet-and-Greet” fellow Pythonistas, and put in your $0.02 worth on what the direction of the group ought to be.

757 R User’s Group: Tuesday, 15 March

7:00 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Steve Mortimer, a Data Scientist at Dominion Enterprises, will discuss APIs from both a consumer’s point of view as well as from a provider’s point of view (i.e., how to write an API for your R model).

Try.Py – Learn Python: Wednesday, 16 March

7:00 pm at The Hatch, 111 Granby, downtown Norfolk

Jay Gendron, a Data Scientist at Booz-Allen, started this particular Python group for beginners to learn the very basics of programming in the Python environment.

Tidewater Big Data Enthusiasts: Tuesday, 22 March

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Dr. Chuck Cartledge, adjunct professor in CS at ODU, will lead this month’s discussion, which will center around the “variety” aspect of Big Data.

Office Hours: Saturday, 26 March

3:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

This is a monthly hands-on working group for those new to R programming. It focuses on getting started and the basics of R.

MOOCs and other educational venues:

The MOOC and online learning space has gotten entirely too big to track on a monthly, case-by-case basis. I am going to switch to listing things that catch my eye.

Harvard University Statistics 110 – Probability

This online course is by Dr. Joe Blitzstein, a highly-regarded statistician and author of the well-reviewed “Introduction to Probability” (which I believe goes hand-in-hand with these videos). I’ve seen these videos referenced a number of times around the Internet (all very positively), so I’m going to go out on a limb here and say that if you are interested in learning probability, this is probably a good path.

Books:

 Dr. Patrick Kilduff, a local fisheries consultant, pointed this out to me…

The book, “From Linear Models to Machine Learning: Regression and Classification, with Examples in R“, is actually not finished yet. But the author, Dr. Norm Matloff, who writes the “Mad (Data) Scientist” blog, is 50% finished and has it posted online for review and comment.

So this is a chance not only to learn cool stuff, but to contribute to its development and make it cooler still.

Thanks Patrick!

Miscellaneous:

New Release of RStudio

The newest release of RStudio v0.99.878 is on the street with lots of cool new and upgraded features.

To ggplot2…Or Not To ggplot2

This is an interesting and fun read.

In case you don’t know, ggplot2 is a graphics package in R that was developed by data science rock star Hadley Wickham. It is based on Leland Wilkinson’s book, “The Grammar of Graphics,” and it has attained near-Biblical stature among statisticians and data scientists who use R.

In early February, Dr. Jeff Leeks — a well-known data scientist at Johns Hopkins University and key figure in the Coursera Data Science specializations — wrote a blog post entitled “Why I Don’t Use ggplot2” on his blog, Simply Statistics. And he makes a lot of good points.

A week or so later, however, Dr. David Robinson — a well-known blogger and data scientist at Stack Overflow — responded with a blog post entitled, “Why I Use ggplot2” on his blog, Variance Explained. And he, too, makes a lot of good points.

Between the two posts — and various commentaries around the Internet — there is a lot of food for thought on the issue of what package should you use for graphics in R.

Kaggle vs. the Real World

This is a fascinating blog post.

Will McGinnis entered a Kaggle competition…spent about 30 minutes on his first submission…and came in 1113th place out of 1762 entries.

But the quality of his submission — as measured by the standard ROC area under the curve — was 0.96290, compared to 0.97024 that the winner had. In other words, the winning individual or team probably spent significantly more time and only achieved a difference in entropy of about 0.00367.

And his point is that in real-world practical terms, sometimes getting to absolute best is not the most efficient or effective use of time. Here it is:

Over Optimizing: A Story About Kaggle

Data Scientists Do Arithmetic

Some years ago I got interested in software agents. In the very, very early stages of that interest, I happened to find myself at the Naval Postgraduate School in Monterey, CA for a seminar on something that was loosely related. Although it was only loosely related, a certain faculty member who was known as an expert of software agents spoke a bit. Afterwards, I tracked him down and asked him — very naively — how software agents worked.

What he replied with was interesting, and I’ve never forgotten it.

He said that fundamentally, everything a computer does is arithmetic. Just adding and subtracting numbers.

The conversation went into a bit more depth than that, but his point was that it’s not magic. You just need to keep peeling back the layers.

I was reminded of that encounter when I read this Signal vs Noise blog article, “Data Scientists Mostly Just Do Arithmetic, And That’s A Good Thing“. It goes hand-in-hand with the article about Kaggle and the real world, and serves as a reminder that despite all the hoopla about Extreme Gradient Boost algorithms and Neural Networks and such, that most business problems are much, much more mundane than that.

Storing Data in DNA

Every once in a while there’s an article that so cool and weird that it’s just worth reading.

There is such an article in a recent edition of The New Scientist about a genetic researcher — Karin Ljubic Fister — who got frustrated with the limitations of computer storage, and figured out how to store binary data in plant DNA. 

There is a blog post — Landscapes of Data Infection — about the article, but it’s really worth checking out the original in The New Scientist: Interview with Karin Ljubic Fister.

Analytics Blog of the Month: Understanding Bayes

This is a really good series of posts that attempt to provide meaningful but bite-size chunks of Bayesian inference and analysis to the masses. The author has finished five of the 16 posts he proposes to make, and he also provides links to other really good resources.

If you are planning on trying to learning Bayesian inference and analysis on your own, this is among the best places to start that I’ve come across.

 

Posted in Uncategorized | Leave a comment

February 2016 Items of Interest

February Meetups:

Machine Learning Working Group: Saturday, 06 February

3:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is a hands-on working group for those interested in working with the underlying algorithms and code that will be discussed at the Tuesday night Tidewater Analytics meeting.

Tidewater Analytics: Tuesday, 09 February

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is the second month of our machine learning track. The topic this month is classification trees. This will be more of an educational presentation of the topic. Those interested in hands-on work with the algorithms and code should attend the working group meeting on Saturday, 06 February.

757 R User’s Group: Tuesday, 16 February

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

“Open Mic Night”: an opportunity for everyone to share their favorite tips and tricks…ask questions about specific issues in R that have been vexing them…and discuss future topics.

Try.Py – Learn Python: Wednesday, 17 February

7:00 pm at The Hatch, 111 Granby, downtown Norfolk

Great news! Two new Python User Groups are starting up. Jay Gendron has started this one — Try Py — for beginners to learn the very basics of programming in the Python environment. He’ll be using Google’s Python learning modules, and trying to get people up to the point where they are ready to move on to more intermediate stuff at the regular 757 Python User Group.

Speaking of which, the 757 PUG has been saved from oblivion by Jesse Wright, a computer science grad student at ODU. The first meeting will be in March, and it will be listed appropriately in next month’s blog post.

Tidewater Big Data EnthusiastsTuesday, 23 February

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Dr. Ilyas Ustun will talk about his research into using smartphone technology to visualize your own trips, to detect vehicle starts and stops without GPS, and how bits of data can benefit traffic engineers, transportation planners, and optimize traffic signals to decrease delays.

Office Hours: Saturday, 27 February

3:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

This is a monthly hands-on working group for those new to R programming. It focuses on getting started and the basics of R.

MOOCs and other educational venues:

Coursera

Coursera continues to dominate the MOOC space when it comes to data science and analytics. They are developing more and more specialization tracks as they go, so that business model must be working (or at least showing promise).

Here is a listing of all their current data analysis courses. There are 11 specialization tracks listed, and 89 individual courses:

Coursera Data Analysis Courses

Miscellaneous:

Machine Learning is Fun

This is a wonderful two-part article that I wish I had run across when I was first starting to get interested in machine learning. Yes, there is some math, and yes, some Python code, but all in all it’s pretty accessible without too much technical stuff.

Machine Learning is Fun – Part I

Machine Learning is Fun – Part II

Top 10 Data Mining Algorithms in Plain English

This is another article I wish that I had run across when I first started this stuff. Very

Top 10 Data Mining Algorithms…

Top 100 R-Bloggers Posts of 2015

The title says it all. This is a great collection of the top 100 R-Bloggers posts from 2015, with an added 28 thrown in for good measure. Well worth slowly perusing.

Top 100 R-Bloggers Posts of 2015

R Package Round Up

Also from R-Bloggers, an interesting collection of R packages. This post lists the top 20 packages downloaded from CRAN in 2015, along with the author’s personal top five packages.

R Packages

How I became a Data Scientist

This is a 34-minute YouTube video by Owen Zhang, who went from being a software developer in a corporate IT environment to being a data scientist in a startup. This was presented at the 2015 Open Data Science Conference, and the focus is on the practical lessons he learned along with some interesting points he observed during his transition.

Analytics Blog of the Month: No Free Hunch

This is Kaggle’s blog, and as one might expect, it provides a wealth of interesting and practical information related to data science and predictive analytics. Of particular note this month was the announcement of the release of Kaggle Datasets.

Posted in Uncategorized | Leave a comment

January 2016 Items of Interest

Happy New Year!

If last year was a good year for the Hampton Roads data science and analytics community — and it was — 2016 is shaping up to be a great year.

We began 2015 with one nebulously defined Meetup — Tidewater Analytics — and over the course of the year several things happened.

First, Tidewater Analytics turned out to be more successful than I had ever imagined it would be. Since January 2015 we’ve had about twenty presentations by different people on either what they did with analytics in their jobs/lives, or about some analytics topic of interest. All the presentations were well attended, and an analytics community has slowly taken root. I could not be happier about that.

Second, two additional, related Meetups were established: the 757 R User Group, and Tidewater Big Data Enthusiasts. Both of these meetups are relatively new, but both were met with enthusiasm and seem to be developing nicely.

The one sour note to 2015 was that the 757 Python User Group is on the verge of going away. Brian McGill, the former organizer, announced last autumn that he was stepping down, and no one appears to have stepped up to take it over. I just got an e-mail from the Meetup.com folks, and they said that if no one takes over the organizer role by 16 January, it will be shut down. Hopefully someone will pick up the baton and keep it going (after all, Python is one of the most popular language for data science).

Looking forward to 2016, now, I see several new things happening.

To begin with, Tidewater Analytics has found a focus, and that focus is machine learning in the context of data science competitions. Specifically, our monthly meetings are going to be de facto self-taught machine learning modules, and we are going to aim at competing in a Kaggle — or similar — competition by the end of the year.

Also, Cathy Green — a colleague who has my absolute highest regard when it comes to all things data — is organizing a Meetup to address the myriad opportunities in Open Data. This will include commercial opportunities as well as the use of Open Data in research settings and civic endeavors. I anticipate this will kick off in April, and will meet on the first Tuesday of the month. This is a hugely exciting development, and I think that when combined with the other data-related Meetups, it provides Hampton Roads with across-the-board resources for learning and promoting data science and analytics.

January Meetups:

Machine Learning Working Group: Saturday, 09 January

3:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is a hands-on working group for those interested in working with the underlying algorithms and code that will be discussed at the Tuesday night Tidewater Analytics meeting.

Tidewater Analytics: Tuesday, 12 January

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

This is the kickoff meeting for our year of machine learning. The topic this month is regression trees. This will be more of an educational presentation of the topic. Those interested in hands-on work with the algorithms and code should attend the working group meeting on Saturday, 09 January.

757 R User’s Group: Tuesday, 19 January

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Dr. Patrick Kilduff, a fisheries consultant, will give an overview on how to develop R packages.

Tidewater Big Data EnthusiastsTuesday, 26 January

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Dr. Chuck Cartledge will lead a presentation on “Tools and Techniques to Visualize Big Data,” based on ideas from Nethan Yau’s book, “Data Points: Visualization That Means Something”.

Office Hours: Saturday, 30 January

4:00 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

This is a monthly hands-on working group for those new to R programming. It focuses on getting started and the basics of R.

MOOCs and other educational venues:

In view of the machine learning focus in Tidewater Analytics this year, I’m going to mention four online learning resources for machine learning. Not all of them are currently running, but there are resources from all of them that are accessible.

Coursera/Stanford (Hastie/Tibshirani): Statistical Learning

This is the online course that goes with the book we’ll be using as a primary resource in Tidewater Analytics. It is a very good course, and well worth taking. Unfortunately, although it has been offered more than once — and will probably be offered again — there are no active sessions just now. In the meantime, the videos from the course are available in several places around the Internet, including this R-Bloggers site.

Coursera/Stanford (Andrew Ng): Machine Learning

Andrew Ng is one of the names in machine learning, and many people swear by this course. I personally find it to be too oriented towards neural networks, and using Octave as the analysis package. But still, it has value. It looks to me as though they offer the course on a regular basis, with one starting again in late January.

Coursera/University of Washington (Pedro Domingos): Machine Learning

Like Trevor Hastie, Robert Tibshirani, and Andrew Ng, Pedro Domingos is one of the names in machine learning. Although the course itself is not active just now, the videos are available for download. I found the videos for this course to be a little bit wanting as they are pretty much just the camera focused on him giving lectures. Not that you can’t learn from that, but I would never consider this as a primary source of learning. This is more like augmentation material.

Penn State University: Applied Data Mining and Statistical Learning

These are really just the course notes for PSU’s Stat 897D course, but the fact is that that course uses “Introduction to Statistical Learning” as the text, and the course notes augment the book pretty well.

Books:

Introduction to Statistical Learning

Since we are using this book as a primary resource for our Tidewater Analytics work this year, I thought it would be appropriate to say something about it.

First, two of the authors – Trevor Hastie and Robert Tibshirani – are two of the names in this field. They each have significant accomplishments under their belts going back decades, and they are the principal authors of a previous book, “Elements of Statistical Learning,” which is a canonical detailed technical treatment of the field.

My understanding is that because “Elements…” is so technical, they teamed up with Gareth James and Daniela Witten to write this somewhat less technical introduction that is more accessible to those just getting into it.

And this book is relatively accessible. It assumes you know some basic statistics, and it assumes you know some basic R. But beyond that it does not assume much on the part of the reader.

The things I like most about it are the treatments of the higher-level general concepts. By that I mean, they have pretty good discussions in the first two chapters on parametric vs. non-parametric approaches, the trade-off between model accuracy and model interpretability, assessing model accuracy, the bias-variance trade-off, and regression vs classification.

The thing I really don’t like is that they sometimes gloss over details. For example, while discussing how to prune a regression tree in Chapter 8, they refer to using a certain parameter, alpha. But there is really no discussion of where to get alpha. This is frustrating, and it happens quite a bit throughout the book.

Also, in the print version of the book the colors used in the graphics can be pretty challenging. For example, in Figure 2.9 they depict a true function, f, in black, and then estimates of f in green, blue, and orange. In the print version, it is extremely difficult to separate out the black, blue, and green curves. And this color combination is seen throughout the book. (I should add that in the PDF version there is no problem with this.)

Speaking of which, you can either buy the print version or download the PDF version free. I’m one of those people who likes to pay authors for their hard work, and I bought the book. But I also use the PDF version when I really need to drill into a graphic.

On balance this is a good book for getting your feet wet in the machine learning waters. It would be difficult to teach yourself anything using just this book, and I would recommend at a minimum finding the videos that go with it (they, too, are available for free on at least one site on the Internet). Better still, Hastie and Tibshirani offer a free course via Coursera every once in a while. This is good because you can take advantage of the discussion boards, the class wiki, and so on.

Miscellaneous:

A Very Short History of Data Science

I have avoided using the term “data science” in the past, because I found it to be a bit vague, over-loaded, and pretentious. But that is starting to change. I’ve started to see real and valid definitions of it that align with what I can credibly say I and others I know do from time to time.

This recent article in Forbes by Gil Press adds  to my growing willingness to use the term. I was particularly happy to find the 1962 John Tukey article, “The Future of Data Analysis,” which lays the groundwork for having this “other” discipline that is not entirely statistics, but overlaps considerably while encompassing much more.

Both the Forbes article and the Tukey article are worth a read if you have a few minutes.

Analytics Blog of the Month: AnnMaria’s Blog

This is a pretty interesting blog, by a pretty interesting lady.

Ann Maria is Ann Maria De Mars. She’s a statistician and technology executive, and also Rhonda Rousey’s mother (and Rhonda Rousey, in case you don’t know, is the world famous mixed martial arts fighter who got clobbered in an Ultimate Fighting Championship match with Holly Holm this past autumn). And Ann Maria is a bit of a martial artist herself, having won some Judo acclaim over the years.

Anyway, Ann Maria is an interesting woman who writes interesting posts. Not all of her posts are about statistics or data science or analytics, but enough of them are that I have her on my Data Science feed.

One blog post that recently got my attention was Winners Know When to Quit. It opened with the following sentence:

That idea that winners never quit is complete and total bullshit.

Ya gotta love her.

If you’re looking for a new blog to check out, give hers a shot. She might not be your exact cup of tea, but I’m pretty sure you will find her to be thought-provoking and worth some occasional real consideration.

 

Posted in Uncategorized | Leave a comment

December 2015 Items of Interest

December Meetups:

Tidewater Analytics: Tuesday, 08 December

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

The feature presentation will by by Cathy Green, an independent Data Architect and Business Intelligence consultant. She will give an overview of the analytic features native to Excel 2013

Following Cathy’s presentation, there will be a brief discussion related to the 2016 Kaggle Machine Learning group’s activities.

757 R User’s Group: Tuesday, 15 December

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Keith Brown, a risk analyst for USAA, will discuss Hadley Wickham’s package ggplot2  for graphics.

Office Hours: No meeting this month.

Tidewater Big Data Enthusiasts: No meeting this month.

MOOCs and other educational venues:

TED Artificial Intelligence Playlist

A very interesting collection of six TED talks focused on artificial intelligence. All under 20 minutes long, and all accessible to lay people.

Books:

Recommended R Readings

This is a fantastic list of R books, ranging from “R for Dummies” on up to the most complex and sophisticated use cases. Well worth keeping on hand.

Miscellaneous:

The Hardest Parts of Data Science

This blog post is really spot on. Fitting models has become almost automated in some situations, but even when not, modern software really makes it pretty easy.

What’s really hard — and important — is defining the problem to begin with, and then measuring the solution.

Yanir Seroussi provides some excellent observations and thoughts on the matter in this blog post.

The Identity of Statistics in Data Science

So what is Data Science?

I avoid the term “Data Science” as much as I can. I think it’s vague and a bit pretentious. But a fellow named Tommy Jones, writing in AMSTATNEWS (the membership magazine of the American Statistical Association), came up with a definition that I think is pretty good. It’s in an article called, “The Identity of Statistics in Data Science,” and the context is figuring out where statistics and data science intersect. (This has been a huge topic of discussion and debate in the statistics community, the computer science community, and the nebulous data science community.

He basically considers data science “Supply Chain Management” for data products. It begins with a real-world problem and ends with a report. And in between there might be data wrangling, model development and validation, data base considerations, visualizations, and so on. It’s a good article and worth reading if you are thinking about including the term “data science” in your working vocabulary, this is worth a read.

Analytics Blog of the Month:

I cannot say enough good things about David Robinson‘s blog, Variance Explained. This guy takes difficult but important issues, and explains them in ways that — although they make take some work — really clarify the issues. His posting is a little sporadic, but the blog is well-worth following, because when he does post, it’s almost always a gem. A few of my favorites include:

K-means Clustering is not a free lunch

Understanding the beta distribution

Cleaning and visualizing genomic data

If you’re going to follow one data analytics blog, this one would be my recommendation.

Posted in Uncategorized | Leave a comment

November 2015 Items of Interest

November Meetups:

Tidewater Analytics: Tuesday, 10 November

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

The feature presentation will by by Steve Miller, an Operations Researcher for Newport News shipyard. He will give an overview of simulation and its different modes, along with ties to analytics.

Following Steve’s presentation, the Kaggle Machine Learning group will meet and review the Titanic Survivor data set and Data Camp tutorial.

757 R User’s Group: Tuesday, 17 November

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Keith Brown, a risk analyst for USAA, will discuss Hadley Wickham’s package ggplot2  for graphics.

Office Hours: Saturday, 21 November

4:00 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

An informal gathering to review basic R features and functions for those who are relatively new to R. The long-term orientation will be towards machine learning and the Kaggle competition being planned by Tidewater Analytics.

This month the topics will be writing functions in R, and finding/installing R packages for using R functions that other people have written.

Tidewater Big Data Enthusiasts: Tuesday, 24 November

7:00 pm, Tuesday, 22 September at 757 Creative Space, 259 Granby St. Suite 250, Norfolk.

This month’s topic will be Medicare Payments to the Tidewater Area. It will be a hands-on exercise with Hadoop and Hive to examine 11 million records and show the financial impact of selected procedures in various ZIP codes in the Tidewater area.

MOOCs and other educational venues:

Coursera:

As mentioned last month, Coursera now has a number of interesting specialization tracks under the general heading of Data Science. They include the following:

Data Science, with Johns Hopkins. This was their flagship course, and a lot of people locally, nationally, and globally got their feet wet in data science through this excellent set of one-month classes.

Big Data, with UC San Diego

Executive Data Science, with Johns Hopkins University

Business Analytics, with the University of Pennsylvania

Data Mining, with University of Illinois

Genomic Data Science, with Johns Hopkins University

Check out the web sites for start dates.

EdX:

EdX has a structure similar to Courser’s in which they team up with top-notch universities on various topics. They are really upping their game, and some of their relevant offerings are the following:

Data Science and Analytics in Context, with Columbia University

Data Analysis for Life Sciences (Statistics with R), with Harvard University. This is a multi-class offering.

Introduction to Statistics, with UC Berkeley. This is a multi-class offering that includes descriptive statistics, inferential statistics, probability, etc.

Introduction to R Programming, with Microsoft. Self-paced.

Check out the web sites for start dates. Some of the courses have come and gone, but the material has been archived and is still available for self-study.

Revolutions Index of Online R Courses:

This is an excellent listing of online R courses.

Books:

Advanced R, by Hadley Wickham

I don’t consider myself an “advanced” R user, but I’ve moved well beyond the beginner phase and was looking for something to help get me to the next level. It turned out this book was just the ticket. Although described as “advanced,” the topics, examples, and writing are all easily accessible for anyone who has progressed to the late-beginner stage or beyond. One of the best books I’ve bought on programming in general, and R in particular.

Moneyball, by Michael Lewis

If there was one single thing that turned my head towards data analytics some years ago, it was reading this book. Since that first read, I probably read four more times  over the years. A recent situation induced me to read it again, and I found that even after all these years, it’s still an awesome book. If you want to see how data analytics can be a literal and figurative game-changer, this is the book to read.

Miscellaneous:

What to do with small data?

There has been a lot of press about “big data,” and many an enterprise has built out a human and physical infrastructure to deal with it. But what about small data? Often times big data solutions fail in small(er) data scenarios. This is an interesting article that discusses a number of approaches to the problem.

Hadley Wickham AMAs (Ask Me Anything)

For those who don’t know of him, Hadley Wickham is the Chief Scientist at RStudio, and the author of numerous R packages. He had an AMA on Reddit in late-September that was awesome and well worth reading, and he will have another AMA on Friday, 13 November to discuss new stuff in ggplot2.

Edward Tufte podcast

Edward Tufte is to data visualization what Hadley Wickham is to R packages. He’s the guy who — probably more than any one person — put data visualization in the popular imagination with his series of stunning books. In any case, this podcast is about 50 minutes, and you can hear his thoughts on the current data visualization landscape and where things are going.

Not So Standard Deviations podcast

Roger Peng — of the Coursera Data Science series — has teamed up with data scientist and blogger Hilary Parker to create an ongoing series of podcasts that discuss various aspects of data analytics and data science. Kind of goofy at times, but enjoyable and often informative.

ODU Business Gateway unWINEd

5:00 – 7:00 pm, Tuesday, 17 November at Bon Secours Cancer Institute, 155 Kingsley Lane, Norfolk

This is a social networking event that the Business Gateway has about once a month. It’s free…there are snacks and wine…and it’s a good place to meet people. Dress tends to be business casual.

Hampton Roads Dev Fest

Saturday, 14 November

This seems to be a local Microsoft love-in. Not that there’s anything wrong with that, but I see little, if any, focus on data science or analytics. And I find it odd that there are no R speakers, considering Microsoft bought Revolution Analytics recently. Nonetheless, if you are a programmer you might enjoy it and/or get something out of it.

Posted in Uncategorized | Leave a comment

October 2015 Items of Interest

October Meetups:

Tidewater Analytics: Tuesday, 13 October

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

The feature presentation will by by Brian Magill, a senior programmer at Science Systems and Applications (as well as the organizer for the 757 Python User Group). He will discuss the programming language Julia, which has been compared favorably to R and Python.

Following Brian’s presentation, the Kaggle Machine Learning group will meet.

757 R User’s Group: Tuesday, 20 October

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

Jay Gendron, of Booz Allen, will discuss Hadley Wickham’s packages tidyr and reshape2, both of which are used for data wrangling.

Tidewater Big Data Enthusiasts: Tuesday, 27 October

7:00 pm, Tuesday, 22 September at 757 Creative Space, 259 Granby St. Suite 250, Norfolk.

Topics this month include an overview of the Big Data ecosystem (e.g, Amazon Web Services, Google Analytics, Microsoft Azure), and a hands-on introduction to Amazon Web Services.

MOOCs and other educational venues:

Coursera:

Coursera has started a number of interesting specialization tracks under the general heading of Data Science. They include the following:

Big Data, with UC San Diego

Executive Data Science, with Johns Hopkins University

Business Analytics, with the University of Pennsylvania

Data Mining, with University of Illinois

Genomic Data Science, with Johns Hopkins University

Check out the web sites for start dates.

EdX:

EdX is picking up the pace, too. One of the new specializations that jumped out at me a six-part Data Analysis for Life Sciences, with Harvard University. That starts on 15 October.

The Rosalind Project

I cannot recommend this enough if you are learning how to code. Named after Rosalind Franklin, the real hero behind the discovery of DNA’s structure, this site has a series of increasingly harder coding problems (similar to Project Euler) that challenge you in a variety of different ways and definitely help you increase your skill level. Unlike Euler’s focus on mathematics, Rosalind is basically about bioinformatics, and the topics fall under the broad headings of string algorithms, combinatorics, graphs and graph algorithms, sorting, set theory, and probability. Good stuff!

Miscellaneous:

Data Science Weekly Newsletter

This is one of those subscriptions that I highly recommend. There’s always good stuff in here. Well worth taking a couple of minutes to sign up for.

Posted in Uncategorized | Leave a comment

September 2015 Items of Interest

September Meetups:

757 Python User’s Group

There will not be a September meeting of the 757 PUG.

Tidewater Analytics: Tuesday, 08 September

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

There will be a general overview of machine learning and an overview of Kaggle competitions, with a discussion about the formation of a Tidewater Kaggle group.

757 R User’s Group: Tuesday, 15 September

6:30 pm at 757 Creative Space, 259 Granby, Suite 250, downtown Norfolk.

The topic will be Hadley Wickham’s data manipulation package, dplyr.

Tidewater Big Data Enthusiasts: Tuesday, 22 September

7:00 pm, Tuesday, 22 September at 757 Creative Space, 259 Granby St. Suite 250, Norfolk

Dr. Chuck Cartledge, computer scientist and adjunct professor at Tidewater Community College, is kicking off the area’s newest meetup on Big Data.

MOOCs and other educational venues:

Coursera:

Mining Massive Data Sets

A great compliment to Tidewater Big Data Enthusiasts.

EdX:

Introduction to R Programming

A free course. Interesting to note that this is sponsored by Microsoft. As many may know, Microsoft bought Revolution Analytics a few months ago. This furthers their advance into R and analytics.

Open Education:

Introduction to Excel VBA Programming

I think Excel in general, but Excel VBA in particular, is one of the most under rated and under used tool in data analytics. I’ve done a bunch of Excel VBA programming over the years, and really like it. Although I migrated from the Microsoft ecosystem quite some time ago, I’d highly recommend this to anyone who is interested in analytics and has access to current Excel applications. This will fit particularly well with Cathy Green’s December presentation on Excel 2013.

And somewhat related to this, a Reddit post with resources for becoming an Excel master.

Books and such:

 Mastering RStudio

Although not released yet, this seems as though it will be a nice little book to have.

Intermediate Python

One of the biggest complaints in Python land is that although there are tons and tons of good beginners’ books, there is an absolute dearth of good intermediate books.

I downloaded and scanned this particular book, and I’d say it’s decent. I would have liked more depth in some areas, but it’s good and worth having.

Fluent Python

As with “Intermediate Python,” above, this fills a niche that’s been needing filling. I just got my copy, and really like it.

Miscellaneous:

Hadley Wickham at the DC Statistical Programming Meetup

Wow! This is a big deal. Hadley Wickham will be speaking at a Meetup in DC on Wednesday, 16 September. The topic will be, “Creating Fluent Interfaces in R”, which he describes as follows:

A fluent interface lets you easily express yourself in code. Over time a fluent interface retreats to your subconcious. You don’t need to bring it to mind; the code just flows out of your fingers. I strive for this fluency in all the packages I write, and while I don’t always succeed, I think I’ve learned some valuable lessons along the way.

This should be great, and transferable to other programming languages. I plan on going.

Interview with Hadley Wickham

Not a particularly extensive interview, but he does offer some interesting insights into Big Data as well as career planning advice for aspiring Data Scientists.

The blog on which that interview exists also has some interviews with other Data Scientists that are worth checking out.

Curated List of Data Science Blogs

This is a pretty good list. I subscribe to many of the individual blogs.

 Gartner’s 2015 Hype Cycle

This is always a fun, interesting, and informative product to check out. Not that Gartner is spot on, but enough business leaders pay attention to this that it sometimes turns out to be a self-fulfilling prophecy.

Posted in Uncategorized | Leave a comment