Monday, January 11, 2016

Data science for the rest of us

A couple of weeks ago I followed an interesting webinar from Microsoft called Data Science for the rest of us. I have been interested in data science ever since I read the excellent book Doing Data Science: straight talk from the frontline from Cathy O’Neill and Rachel Schutt and articles like the Data Scientist: the sexiest job of the 21st century sparked this interest even more.

In this webinar Brendan Rohrer (@_brohrer_)  explains with a number of great examples some key ingredients or trade secrets of doing data science in easy to understand terms – here’s a quick recap (although I really recommend you to watch the video):
  • Trade secret 1: Data is not the starting point (and you have to ask sharp questions): I really like the definition as formulated by  Jeff Leek (@jtleek) (taken from Data science done well looks easy, which is a big problem) Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience. So you first need a precise question and then you need to look for the right data or as indicated in the webinar relevant, connected, accurate and enough data. I’m not a data scientist but this really seems like the hardest part (or as phrased here For Big Data scientist, ‘janitor work’ is the key hurdle to insights )
  • Trade secret 2: Turn your data in a picture – check out the example used in the seminar below. It is important to understand that people effortlessly recognize and classify objects among tens of thousands of  possibilities so visualization of your data can help you to make sense of the data (For an interesting scientific article on this topic – take a look at How does the brain solve visual object recognition? )

  • Trade secret 3: Data science can only answer five questions: predict how much/how many [regression], which category does something belong to [classification], which groups exist in a dataset [clustering], is something weird [anomaly detection] and which action should you take[reinforcement learning].
  • Trade secret 4: Machine learning is simple. This statement is a little aggerated – but the analogy of mastering a foreign language and mastering machine learning is indeed correct. You need to learn the lingo (everyone probably knows tables – either in Excel or a database, but data scientist will refer to these  rows of data in a table as data point or samples by data scients. The columns in your table typically describe a specific characteristic – well  data scientist will call this a feature.)
  • Trade secret 5: there are a lot of right ways to solve a specific problem. If you look at the Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio you will notice that there a lot of different ways to solve a specific problem (with certain nuances such as the number of features available, or speed of calculating the model, …) but in most cases it apparently does not seem to matter that much.


To get an overview of other Microsoft webinars on similar topics check out Big Data and Advance Analytics: On-demand and upcoming live webinars
References links:

No comments: