February 26, 2014

Data Science - a DIY approach

Everybody wants to become a data scientist but given the huge number of tools and techniques that are available no one really seems to know where to start and how to go about acquiring the right skills.
image taken from http://www.rosebt.com

In the process of teaching data science to students at the Praxis Business School I have realised that there is a huge amount of information that is freely available on the web and thanks to Google, one can find this information very easily. If one is motivated enough, one can learn data science on his own. In an earlier post I have given a very broad Introduction to Data Science but in this post, that I plan to update every now and then, I have tried to assemble a collection of tutorials, grouped by topic that will help anyone get a good grip on this subject by actually doing something practical.

R : Statistical Programming Language

Forget about SAS, SPSS and other proprietary languages that want to extort money from you and simply invest your time and effort in mastering R, a free but excellent tool that is used most widely in data science.

Here is a "Level 0 tutorial for getting started with R"  [ alternate link ] by April Galyardt that will get you started. This tutorial will tell you how to install R but you should also consider installing R Studio, a free graphic development environment that makes R easier to use.

Galyardt's tutorial is very comprehensive but if you are that impatient type for whom doing anything in depth is too tiring and yet you want to be familiar with this cutting edge technology then I suggest that you go through this 5-part Beginners Guide to R from Computerworld.

Both these two tutorials require you to install R on your machine, which is not too difficult but then even if that is too much for you, I suggest that you visit Data Camp, or Try R at the O'Reilly Code School. This will give you a flavour of what is it that people do with R.

Statistics

The ancient science of statistics has suddenly got a new lease of life and celebrity status because of a sudden surge of interest in data science. If you leave aside the Map Reduce and the associated world of Hadoop, then [ statistics + statistical tools ] is virtually synonymous with data science. Hence it is essential to get a good grip of statistics.

As an ancient science that is taught in many undergraduate and postgraduate courses, there are many excellent books on statistics but today's data scientist must not only be proficient in statistics but also in a tool like R.

An excellent place to start learning Statistics along with R is the open source, copyright free OpenIntro website. Not only will this allow you to download an excellent  book in PDF format but will offer you a set of lab exercises based on R where you can try out what you have learnt.

Other fairly comprehensive tutorials on statistics with R are available at  Chi Yau's r-tutor.com,  from Kelly Black of Clarkson University. and from King of Coastal Carolina University. However if you are seriously interested in a career with data science you should purchase a regular text book on Statistics for reference purposes and for this you can look at either Statistics for Management by Levin or Applied Business Statistics by Black.

Traditional statistics is generally restricted to descriptive and interpretive statistics that is well covered in these links. However data science is more interested in predictive statistics and this is addressed in areas like Data Mining and Machine Learning that will be addressed later. However check out this book for a quick Introduction to Data Science  and also download a PDF file of the book.

For a look at using R for more advanced tasks you can take a look at this "meta book" that points you to other books that shows you how to tackle more complex problems.

Python with Anaconda

Python is another very strong challenger for the position of the best, or most powerful tool, for data science and it makes sense to get a hang of it. This is because it not only supports much of the functionality that R provides, through the Pandas library, but is also quite compatible with Hadoop and the world of Map-Reduce.

A quick way to get a hang of Python without installing it on your machine is to try out the Python tutorial at AfterHoursProgramming or somewhere similar. But  a better way to get going would be to download Anaconda, that not only installs Python but also provides a rich GUI interface to execute Python programs in addition to the standard command line approach. Once you have your Python (with or without Anaconda) in place, you can try out this workshop from Open Tech School free of cost. Or you can jump straight into this very comprehensive tutorial on using Python for data science (alternate )that actually uses data from a Kaggle competition.

Visualisation

Visualisation is another key area of data science because a business wants data scientists to tell a good story with the data. My blog on visualisation is a nice starting point for all those who are interested in getting a quick hang of to get started with free tools like Google Charts. Tableau is an excellent and widely used tool that is used for visualisation that has a  public version for free download that you can try out with the free tutorials available in my blog post. Other excellent tools are Google Fusion Tables that you can explore here or Chartbuilder that you can try out online here or learn more about in this tutorial.


Data Mining / Machine Learning

As we said earlier, traditional statistics is generally limited to descriptive and interpretive work but the world wants predictions and predictive statistics is where we get into data mining and machine learning. Please understand that data mining and machine learning is a complex subject and you need to get a good grounding on the algorithms that are used for Classification, Clustering, Association Rules, Collaborative Filtering,  Text Analytics and other complex tasks. A quick overview of all these techniques is available in this Overview of Data Mining Techniques.

It is difficult to take a short cut through this subject but if you have a basic idea of what all this means then you can try out the examples and exercises given in this book on R-DataMining. Rattle is an excellent add-on to R that gives you a GUI interface for Data Mining. You can download rattle and then try out with this short but descriptive tutorial with this data.

On the other hand if you are more of a programmer and less of a statistician, you may prefer to use Python for your task. A good way to get started is to download and read "A Programmers Guide to Data Mining" that not only introduces the subject but also gives loads of ready made Python code for you to try out.

A very comprehensive tutorial featuring Python ( data collection, cleaning ), R (data analysis ) and D3 ( visualisation ) is available from Jake Porway's website on Data Without Borders.

Free Downloadable Books
Here is a list of 12 books that you can legally download and use in your quest for mastery over data science.

[new]Putting it all together : The Big Picture
There is more to datascience that to be able to run statistical analysis with R. One needs to be be able to "play" with data and seek out hidden patterns that apparently do not seem to exist. For example look at this "Island of Games" data puzzle to see what we mean. Data Science is much more than just a bag of tools and techniques -- it is a way of doing things. So you learn by doing and this blog post lists a set of activities that you can do and if you can do it well, you can consider yourself a data scientist. But in case you cannot do all that, just do a Kaggle project with this tutorial and you are on your way.

Evergreen SQL

Last but not the least, relational databases and SQL is an absolute must for anyone who is interested in data science but since it is so well known and so widely used in the technology community that I refrain from singing its praises or giving any pointers to how it can be learnt.

If you go all through all this then you are almost there on your way to become a data scientist. The only thing missing is Map-Reduce and Hadoop that has been addressed in a subsequent post on Demystify Map Reduce and Hadoop with this DIY tutorial. In the meantime subscribe to the twitter feed of KDNuggets to keep yourself abreast of what is happening in the world of Data Scientists -- that the wise men at Harvard have declared to be the Sexiest Job of the Twenty First Century.

This survey has addressed the tools and techniques that a data scientist uses in his daily job. What is missing here is an understanding of the business domain -- like Retail, Finance, Telecom -- where these tools need to be used to deliver business value. To learn all this you may consider joining the One Year program on Business Analytics at the Praxis Business School, Kolkata  where we teach all this and more. Sorry for that little advertisement but I need to earn a livelihood as well :-)


I will be updating this post based on changes in my knowledge and perception of this field, so you might see some changes every now and then.   Last update 28 Mar 2014, 04 Apr 2014, 31 May 2014, 19 Jun 2014, 19 Jul 2014, 28 Aug 2014, 12 Sep 2014

4 comments:

Nakul Konapur 12:04 am  

Fantastic blog Sir!!....All in one place :)

Ravi Murugesan 9:39 am  

An excellent overview of how one can learn data science. With all the opportunities you've mentioned, no-one can say they don't have the means to learn!

Sudeep Mallick 9:37 pm  

Great blog sir! Please also include weka in your list. It is open source, light weight and good to learn about data pre processing tasks like transformation using its simple GUI

Ankit Ballav 12:59 pm  

Now..I can see the future of Data Science...

About This Blog

  © Blogger template 'External' by Ourblogtemplates.com 2008

Back to TOP