August 2015 Items of Interest

August Meetups:

757 Python User’s Group: Tuesday, 04 August.

7:00 pm at the Hampton Public Library, 4207 Victoria Blvd., Hampton.

Brian Magill, senior programmer at Science Systems and Applications, will discuss Python Iterators and Generators, with an emphasis on Generators.

Tidewater Analytics: Tuesday, 11 August

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Steve Mortimer, Data Scientists at Dominion Enterprises, will lead a hands-on workshop on Git and GitHub.

757 R User’s Group:

There will be no R User Group meeting in August. The next meeting will be on Tuesday, 15 September

Other Meetup News:

Tidewater Big Data Enthusiasts:

7:00 pm, Tuesday, 22 September at 757 Creative Space, 259 Granby St. Suite 250, Norfolk

Dr. Chuck Cartledge, computer scientist and adjunct professor at Tidewater Community College, is kicking off the area’s newest meetup on Big Data.

Team Tidewater Kaggle group forming:

An open group for competing in Kaggle competitions will be forming in September. This will be open to anyone who wants to learn, improve, and simply participate at any level, and there will be room for a very diverse range of skills and interests.

There will be more information about this at both the August and September Tidewater Analytics Meetups. But if you think you might be interested, this would be a good blog post to take a peek at: Kaggle Competition Tips and Summaries, as well as this Kaggle R Tutorial on Machine Learning.

MOOCs and other educational venues:


Reasoning, Data Analysis, and Writing

Nationally-known and highly-regarded data scientist Hilary Mason has said that the single biggest skill lacking in data science is the ability to tall a narrative about the data. This Coursera specialization, taught by Duke University, seems like a very good candidate for helping address that shortfall.

Big Data University

Introduction to R

 I cannot vouch for the quality of the instruction, but they seem to hit on all the right topics for a beginner course.

Data Camp:

Intermediate R

As with the Introduction to R course, above, I cannot vouch for this. But again, the topics seem to be spot on.

Books and such:

Exploratory Data Analysis (EDA) with R

Dr. Roger Peng, who teaches the R portion of Coursera’s Johns Hopkins Data Science specialization, put out this excellent book on Exploratory Data Analysis with R.

From the web site description: This book teaches you to use R to effectively visualize and explore complex datasets. Exploratory data analysis is a key part of the data science process because it allows you to sharpen your question and refine your modeling strategies. This book is based on the industry-leading Johns Hopkins Data Science Specialization, the most widely subscribed data science training program ever created.


Python 3.5 released

Here is a summary of new and improved features in the recently released Python 3.5.

Johns Hopkins Data Science Hack-a-Thon

It’s a bit of a haul, but the same Johns Hopkins folks who teach the Coursera Data Science track are sponsoring a Data Science Hack-a-Thon in Baltimore on 21 – 23 September 2015.

Data Science Machine Learning Cheat Sheets:

KDNuggest has an excellent post that lists over 50 data science and machine learning cheat sheets.

Curated Data Science Information on GitHub:

Analytics Vidhya has a pretty cool post on the various Data Science resources available on GitHub.

Posted in Uncategorized | Leave a comment

July 2015 Items of Interest


757 Python User’s Group: Tuesday, 07 July.

The 757 Python User’s Group will meet at the Hampton Public Library, 4207 Victoria Blvd., Hampton.

The topic this month is Errors and Exceptions in Python 3. Rick Jones will discuss the following related topics: types of errors and traceback messages, exceptions vs assertions, general exception structure (try, except, else, finally), raising exceptions and user-defined exceptions, exception anti-patterns, and logging exceptions.

The Python code and the presentation slides will be distributed at the Meetup, and posted on the Tidewater Analytics GitHub site afterwards.

Tidewater Analytics: Tuesday, 14 July:

Tidewater Analytics will meet at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Chris Ovide, a local entrepreneur, will present his current project that involves using binary logistic regression to determine a person’s preferences in facial attraction. The main presentation will be Polyglot Persistence. Dr. Chuck Cartledge,of Old Dominion University, will discuss the variety of relational and NoSQL data stores that are currently in use, and compare and contrast them from the CRUD (create-read-update-delete) perspective.

757 R User’s Group: Tuesday, 21 July:

The 757 R User’s Group will meet at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.

Nipun Rahman will give a presentation on the SQLDF package, which allows manipulation of data frames using SQL commands.

The main meeting will be preceded by a 30 – 45 minute beginner’s tutorial session — starting at 5:45 pm — for those who are new to R.

Other Meetup News:

Tidewater Analytics Big Data Enthusiasts: Dr. Chuck Cartledge, who will be giving the Polyglot Persistence presentation at the July Tidewater Analytics meeting, has created a new local Meetup that will focus on Big Data. From the Meetup site:

We are a group of people in the Tidewater area who are interested in exploring, sharing, and understanding Big Data. We have a mixture of people with interests and expertise in various aspects of big data, what it is, how it works, how it affects each of us, and assist people establishing personal networks of friends and colleagues with similar interests.

A collection of things that I hope the group will explore, includes:

1.  What is “Big Data,” how does it affect me, and how am I supplying “Big Data??”  Probably include a simple demonstration of using Pig to process Medicare data, and then displaying the results via R.

2.  Where can I get my hands on some Big Data??  Medicare payments, pharmaceutical payments, census data, ZCATs, etc.   Some sort of demo on getting data from these places, how the data has to be munged, what kinds of problems exist with all data sets.

3.  What kinds of tools are available for processing Big Data??  The world doesn’t end at Hadoop or Casandra.  There are other tools/applications that might be a better fit.

4.  How do I visualize all this data??  Getting Big data is fun.  Analysing can be a challenge.  When it is all over, how can the data be made real with some sort of visualization techniques.

5.  What are the challenges with real-time Big Data??  Firstly, what does real-time mean??  Secondly, what kinds of tools are available to handle masses of real-time data.

6.  How does the “Internet of Things” affect what we call Big Data??  As more and more things (cars, phones, refrigerators, wearable devices) are wired, and more and more data is being collected, how does that affect what we do with Big Data??

As we talk and share ideas, other topics will come up and we will follow them to see where they go.

Come ready to share ideas, experiences, and interest in all things Big Data.

Link: Tidewater Big Data Enthusiasts

MOOCs and other online educational venues:


Genomic Data Science Certificate. Dr. Jeff Leek, of Johns Hopkins University and one of the principals behind the Data Science Specialization track, is offering a 7-part (plus Capstone) certificate program in Genomic Data Science. In view of the growing interest and prevalence of BioTech — along with Dr. Leek’s proven track record with Coursera — this is probably a pretty decent investment.

Data Science Certificate. Speaking of Jeff Leek, the ongoing 9-part (plus Capstone) Data Science Specialization begins again on 6 July. A number of Tidewater Analytics members have taken various courses in this, and the reviews have been uniformly good.

Duke University:

Data, Statistical Inference, and Modeling. Dr. Mine Çetinkaya-Rundel, Assistant Professor of the Practice in the Department of Statistical Science at Duke University, is offering an online, non-credit certificate course that “explores methods of acquiring and validating data, analyzing and modeling data, and interpreting results correctly without relying on statistical jargon.  A step-by-step technique shows how to use R statistical programming language for practical applications.  Examples and projects focus on real-world phenomena and have common usage in a wide variety of professions.  An instructor provides webinars (a.k.a., virtual office hours) to guide your learning through complex problems and projects.

Online Events and Topics:

Intro to SparkR. 1:00 pm, Wednesday, 15 July

SparkR is an R package that combines the power of R and Spark. It provides a lightweight front-end to Apache Spark by exposing the Spark API allowing users to run interactive Spark jobs from the R shell. The free webinar will feature Shivaram Venkataraman, Co-author of SparkR. Here is the registration link.

Neural Networks for Newbies

A good video for anyone interested in neural networks and just getting started at it.

Books and such:

Introduction to Statistics (with Python). This looks like a really good free book. If I were not so committed to R right now in my statistics work, I’d be all over this.

Machine Learning in Python: Essential Techniques for Predictive Analysis. A relatively new book with good reviews.

Probabilistic Programming & Bayesian Methods for Hackers. This seems to be a free, downloadable book with lots of good information on programming Bayesian methods. It is self-described as “An intro to Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view.


Master R Developer Workshop: Although not until September, this is one that you cannot start planning for too early. Hadley Wickham will be conducting this workshop in Washington DC on 14 – 15 September, 2015. The price is pretty hefty at $1,500 plus some sort of $55 registration fee. But if you’re serious about being an R developer, it probably doesn’t get any better than this.

Open Source Data Science Master’s: This is a quirky little website that’s worth taking a look at if you are reading this blog.

R 3.2.1 Released: Many new features and bug fixes. Check it out: R 3.2.1

Posted in Uncategorized | Leave a comment

June 2015 Items of Interest


757 Python User’s Group: Tuesday, 02 June.

The 757 Python User’s Group will meet at 7:00 pm at the Hampton Public Library,  4207 Victoria Blvd., Hampton. No topics are announced at this time.

Tidewater Analytics: Tuesday, 09 June:

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk. Dr. Vicki Garcia of Old Dominion University will discuss her background and use of analytics in studying behavioral ecology. Rick Jones will present a relatively non-technical overview of time series analysis for the purpose of introducing the fundamental concepts and vocabulary.

757 R User’s Group: Tuesday, 16 June:

6:30 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.Dr. Patrick Kilduff will present base plotting in R. There will be a pre-meeting at 6:00 pm for new R users.

MOOCs and other educational venues:


Programming for Everybody (Python): 01 Jun – 09 Jul. This course aims to teach everyone to learn the basics of programming computers using Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course.

Text Mining and Analytics: 08 Jun – 04 Jul. Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making.

Questionnaire Design for Social Surveys: 01 Jun – 10 Aug. This course will cover the basic elements of designing and evaluating questionnaires. It will review the process of responding to questions, challenges and options for asking questions about behavioral frequencies, practical techniques for evaluating questions, mode specific questionnaire characteristics, and review methods of standardized and conversational interviewing.

Data Science Signature Track: Ongoing. This is a pretty good nine one-month courses that cover the range of data science topics. Although some of the topics demand a longer and more rigorous treatment, the early topics are great introductions to things like GitHub, markdown, R, and data munging. If you are interested in analytics, these courses are a great place to start.


The Analytics Edge: 02 Jun – 25 Aug. Data is transforming business, social interactions, and the future of our society. In this course, you will learn how to use data and analytics to give an edge to your career and your life. It will examine real world examples of how analytics have been used to significantly improve a business or industry. These examples include Moneyball, eHarmony, the Framingham Heart Study, Twitter, IBM Watson, and Netflix.

edX has also just announced a new series of Big Data courses. The series consists of 2 courses focused around Apache Spark. If you are not familiar with Spark, it is a very fast engine for large-scale data processing. It claims to perform up to 100 times faster than hadoop. The courses are free but verifiable certificates can be purchased for $50 per course. Here are the two courses:

Introduction to Big Data with Apache Spark: 01 Jun – 29 Jun. Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, Collaborative Filtering exercises that teach students how to manipulate data sets using parallel processing with PySpark.

Scalable Machine Learning: 29 Jun – 03 Aug. This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. It presents an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Apache Spark, a cluster computing system well-suited for large-scale machine learning tasks. You will implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key problems from various domains: online advertising, personalized recommendation, and cognitive neuroscience.

Data Camp:

Kaggle R Tutorial on Machine Learning: I have mixed feelings about Data Camp, but this particular course seems like it might be interesting and useful.

Online Events:

Wednesday, 10 June: RStudio Webinar – Hadley Wickham – Getting Your Data Into R
Registration Link

Books and such:

Hadley Wickham’s Advanced R was published in September 2014, but a pretty good review of it was just posted a few days ago on R-Bloggers. Sounds like a book worth getting.

Dr. Norman Matloff’s new book, Parallel Computing for Data Science, will be published in June. There are no reviews (since it has not been published), but there is this from the publisher’s blurb:

Parallel Computing for Data Science: With Examples in R, C++ and CUDA is one of the first parallel computing books to concentrate exclusively on parallel data structures, algorithms, software tools, and applications in data science. It includes examples not only from the classic “n observations, p variables” matrix format but also from time series, network graph models, and numerous other structures common in data science. The examples illustrate the range of issues encountered in parallel programming.

With the main focus on computation, the book shows how to compute on three types of platforms: multicore systems, clusters, and graphics processing units (GPUs). It also discusses software packages that span more than one type of hardware and can be used from more than one type of programming language. Readers will find that the foundation established in this book will generalize well to other languages, such as Python and Julia.


Top 10 Data Mining Algorithms in Plain English

RStudio v0.99 now available. Highlights include:

* Customizable code snippets by the user to support automating common editing tasks.

* New and improved code completion.

* A new data viewer that supports large datasets, filtering, searching, and sorting.

* Code diagnostics in the source code editor.

PyCharm 4.5 now available. Highlights include:

* In-line debugger.

* Improved iPython notebook integration.

* Temporary scratch files.

Note: There were some bugs in PyCharm 4.5, and what you probably want to download is PyCharm 4.5.1


Posted in Uncategorized | Leave a comment

May 2015 Items of Interest


MOOCs and other educational venues:

  • Applied Logistic Regression – Ohio State is offering this eight-week course via Coursera starting Monday, 11 May. Logistic regression is an increasingly popular statistical model has become the standard method in many domains for regression analysis of binary response data. Here is the link: Note: Enrollment for this course will close on Wednesday, May 13, 2015.
  • Text Mining and Text Analytics Natural language processing is rapidly entering the mainstream and gaining credibility. is offering a offering a four-part Text Analytics series that covers (1) Text Mining, (2) Natural Language Processing, (3) Python’s Natural Language Processing Toolkit, and (4) Sentiment Analysis. This seems like a decent entry point into text analytics.

Books and such:



Posted in Uncategorized | Leave a comment

April 2015 Items of Interest


  • 757 Python Users Group: The 757 PUG will meet at 7:00 pm on Tuesday, 07 April at the Hampton Public Library, 4207 Victoria Blvd, Hampton. The topic will be Cython, a tool that can speed up Python dramatically.
  • Tidewater Analytics: The regular meeting of Tidewater Analytics is at 7:00 pm on Tuesday, 14 April at 757 Creative Space, Norfolk. The main topic this month will be Bayesian Methods. The main topic will be preceded by a short member show-and-tell that highlights an individual member’s work in analytics.
  • R User Group: The R User Group will meet at 6:30 pm on Tuesday 21 April at 757 Creative Space, Norfolk. The main topic this month will be data types and data structures in R. There will be a 30-minute “Introduction to R” session at 6:00 pm for those new to R. Also, there will be an opportunity after the main presentation for members to discuss issues they are having in R.

MOOCs and other educational venues:

  • The Eindhoven University of Technology, Netherlands, is offering a six-week course in Process Mining via Coursera, starting 01 April. This is described as “the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.
  • Coursera, EdX, and Udacity all continue to offer a variety of courses — often affiliated with top-notch universities — related to analytics.

Books and such:

  • Jeff Leeks, one of the principal architects and instructors in Coursera’s 9-month Data Science program, published a nice little book called “The Elements of Data Analytic Style” (modeled loosely on Strunk and White’s classic, “The Elements of Style”). The book is free, but they are accepting donations. You can find it at https://leanpub/datastye


Posted in Uncategorized | Leave a comment

March 2015 Items of Interest


757 Python Social: The 757 Python User Group is having a dinner/social at 7:00 pm on Tuesday, 03 March at Marker 20, 21 E Queens Way, Hampton, VA. Brian Magill, the organizer, has a background and interest in analytics; he gave an excellent presentation on Python NumPy in January. If you have Python and analytics proclivities, and would like to help shape future Python presentations, this would be a good opportunity to meet Brian and have a discussion.

Tidewater Analytics meeting: The regular meeting of Tidewater Analytics is at 7:00 pm on Tuesday, 10 March at 757 Creative Space.

R User Group: The still-forming R User Group will meet on either Tuesday or Wednesday during the third week of March at 757 Creative Space. Keep an eye out for updates.

MOOCs and other educational venues:

There are a couple of analytics-oriented MOOCs getting started that might be of interest:


Books and such:

If you do much command line work, the 4th edition of O’Reilly’s “Effective Awk Programming: Universal Text Processing and Pattern Matching” is scheduled for release this month. Awk is one of the pillars of command line data science, and by all accounts this is a pretty decent release.

Posted in Uncategorized | Leave a comment