June 2015 Items of Interest


757 Python User’s Group: Tuesday, 02 June.

The 757 Python User’s Group will meet at 7:00 pm at the Hampton Public Library,  4207 Victoria Blvd., Hampton. No topics are announced at this time.

Tidewater Analytics: Tuesday, 09 June:

7:00 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk. Dr. Vicki Garcia of Old Dominion University will discuss her background and use of analytics in studying behavioral ecology. Rick Jones will present a relatively non-technical overview of time series analysis for the purpose of introducing the fundamental concepts and vocabulary.

757 R User’s Group: Tuesday, 16 June:

6:30 pm at 757 Creative Space, 259 Granby St. Suite 250, downtown Norfolk.Dr. Patrick Kilduff will present base plotting in R. There will be a pre-meeting at 6:00 pm for new R users.

MOOCs and other educational venues:


Programming for Everybody (Python): 01 Jun – 09 Jul. This course aims to teach everyone to learn the basics of programming computers using Python. The course has no pre-requisites and avoids all but the simplest mathematics. Anyone with moderate computer experience should be able to master the materials in this course.

Text Mining and Analytics: 08 Jun – 04 Jul. Explore algorithms for mining and analyzing big text data to discover interesting patterns, extract useful knowledge, and support decision making.

Questionnaire Design for Social Surveys: 01 Jun – 10 Aug. This course will cover the basic elements of designing and evaluating questionnaires. It will review the process of responding to questions, challenges and options for asking questions about behavioral frequencies, practical techniques for evaluating questions, mode specific questionnaire characteristics, and review methods of standardized and conversational interviewing.

Data Science Signature Track: Ongoing. This is a pretty good nine one-month courses that cover the range of data science topics. Although some of the topics demand a longer and more rigorous treatment, the early topics are great introductions to things like GitHub, markdown, R, and data munging. If you are interested in analytics, these courses are a great place to start.


The Analytics Edge: 02 Jun – 25 Aug. Data is transforming business, social interactions, and the future of our society. In this course, you will learn how to use data and analytics to give an edge to your career and your life. It will examine real world examples of how analytics have been used to significantly improve a business or industry. These examples include Moneyball, eHarmony, the Framingham Heart Study, Twitter, IBM Watson, and Netflix.

edX has also just announced a new series of Big Data courses. The series consists of 2 courses focused around Apache Spark. If you are not familiar with Spark, it is a very fast engine for large-scale data processing. It claims to perform up to 100 times faster than hadoop. The courses are free but verifiable certificates can be purchased for $50 per course. Here are the two courses:

Introduction to Big Data with Apache Spark: 01 Jun – 29 Jun. Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, Collaborative Filtering exercises that teach students how to manipulate data sets using parallel processing with PySpark.

Scalable Machine Learning: 29 Jun – 03 Aug. This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. It presents an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Apache Spark, a cluster computing system well-suited for large-scale machine learning tasks. You will implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key problems from various domains: online advertising, personalized recommendation, and cognitive neuroscience.

Data Camp:

Kaggle R Tutorial on Machine Learning: I have mixed feelings about Data Camp, but this particular course seems like it might be interesting and useful.

Online Events:

Wednesday, 10 June: RStudio Webinar – Hadley Wickham – Getting Your Data Into R
Registration Link

Books and such:

Hadley Wickham’s Advanced R was published in September 2014, but a pretty good review of it was just posted a few days ago on R-Bloggers. Sounds like a book worth getting.

Dr. Norman Matloff’s new book, Parallel Computing for Data Science, will be published in June. There are no reviews (since it has not been published), but there is this from the publisher’s blurb:

Parallel Computing for Data Science: With Examples in R, C++ and CUDA is one of the first parallel computing books to concentrate exclusively on parallel data structures, algorithms, software tools, and applications in data science. It includes examples not only from the classic “n observations, p variables” matrix format but also from time series, network graph models, and numerous other structures common in data science. The examples illustrate the range of issues encountered in parallel programming.

With the main focus on computation, the book shows how to compute on three types of platforms: multicore systems, clusters, and graphics processing units (GPUs). It also discusses software packages that span more than one type of hardware and can be used from more than one type of programming language. Readers will find that the foundation established in this book will generalize well to other languages, such as Python and Julia.


Top 10 Data Mining Algorithms in Plain English

RStudio v0.99 now available. Highlights include:

* Customizable code snippets by the user to support automating common editing tasks.

* New and improved code completion.

* A new data viewer that supports large datasets, filtering, searching, and sorting.

* Code diagnostics in the source code editor.

PyCharm 4.5 now available. Highlights include:

* In-line debugger.

* Improved iPython notebook integration.

* Temporary scratch files.

Note: There were some bugs in PyCharm 4.5, and what you probably want to download is PyCharm 4.5.1


This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s