|image taken from http://www.rosebt.com|
In the process of teaching data science to students at the Praxis Business School I have realised that there is a huge amount of information that is freely available on the web and thanks to Google, one can find this information very easily. If one is motivated enough, one can learn data science on his own. In an earlier post I have given a very broad Introduction to Data Science but in this post, that I plan to update every now and then, I have tried to assemble a collection of tutorials, grouped by topic that will help anyone get a good grip on this subject by actually doing something practical.
R : Statistical Programming Language
Forget about SAS, SPSS and other proprietary languages that want to extort money from you and simply invest your time and effort in mastering R, a free but excellent tool that is used most widely in data science.
Here is a "Level 0 tutorial for getting started with R" by April Galyardt that will get you started. This tutorial will tell you how to install R but you should also consider installing R Studio, a free graphic development environment that makes R easier to use.
Galyardt's tutorial is very comprehensive but if you are that impatient type for whom doing anything in depth is too tiring and yet you want to be familiar with this cutting edge technology then I suggest that you go through this 5-part Beginners Guide to R from Computerworld.
Both these two tutorials require you to install R on your machine, which is not too difficult but then even if that is too much for you, I suggest that you visit Data Camp, or Try R at the O'Reilly Code School. This will give you a flavour of what is it that people do with R.
The ancient science of statistics has suddenly got a new lease of life and celebrity status because of a sudden surge of interest in data science. If you leave aside the Map Reduce and the associated world of Hadoop, then [ statistics + statistical tools ] is virtually synonymous with data science. Hence it is essential to get a good grip of statistics.
As an ancient science that is taught in many undergraduate and postgraduate courses, there are many excellent books on statistics but today's data scientist must not only be proficient in statistics but also in a tool like R.
An excellent place to start learning Statistics along with R is the open source, copyright free OpenIntro website. Not only will this allow you to download an excellent book in PDF format but will offer you a set of lab exercises based on R where you can try out what you have learnt.
Other fairly comprehensive tutorials on statistics with R are available from Kelly Black of Clarkson University. and from King of Coastal Carolina University. However if you are seriously interested in a career with data science you should purchase a regular text book on Statistics for reference purposes and for this you can look at either Statistics for Management by Levin or Applied Business Statistics by Black.
Traditional statistics is generally restricted to descriptive and interpretive statistics that is well covered in these links. However data science is more interested in predictive statistics and this is addressed in areas like Data Mining and Machine Learning that will be addressed later. However check out this book for a quick Introduction to Data Science and also download a PDF file of the book
Python with Anaconda
Python is another very strong challenger for the position of the best, or most powerful tool, for data science and it makes sense to get a hang of it. This is because it not only supports much of the functionality that R provides but is also quite compatible with Hadoop and the world of Map-Reduce.
A quick way to get a hang of Python without installing it on your machine is to try out the Python tutorial at AfterHoursProgramming or somewhere similar. But a better way to get going would be to download Anaconda, that not only installs Python but also provides a rich GUI interface to execute Python programs in addition to the standard command line approach. Once you have your Python (with or without Anaconda) in place, you can try out this workshop from Open Tech School free of cost.
Visualisation is another key area of data science because a business wants data scientists to tell a good story with the data. My blog on visualisation is a nice starting point for all those who are interested in getting a quick hang of to get started with free tools like Google Charts. Tableau is an excellent and widely used tool that is used for visualisation that has a public version for free download that you can try out with the free tutorials available in my blog post. Another excellent tool is Google Fusion Tables that you can explore here.
Data Mining / Machine Learning
As we said earlier, traditional statistics is generally limited to descriptive and interpretive work but the world wants predictions and predictive statistics is where we get into data mining and machine learning. Please understand that data mining and machine learning is a complex subject and you need to get a good grounding on the algorithms that are used for Classification, Clustering, Association Rules, Text Analytics and other complex tasks.
It is difficult to take a short cut through this subject but if you have a basic idea of what all this means then you can try out the examples and exercises given in this book on R-DataMining. Rattle is an excellent add-on to R that gives you a GUI interface for Data Mining. You can download rattle and then try out with this short but descriptive tutorial with this data.
On the other hand if you are more of a programmer and less of a statistician, you may prefer to use Python for your task. A good way to get started is to download and read "A Programmers Guide to Data Mining" that not only introduces the subject but also gives loads of ready made Python code for you to try out.
Last but not the least, relational databases and SQL is an absolute must for anyone who is interested in data science but since it is so well known and so widely used in the technology community that I refrain from singing its praises or giving any pointers to how it can be learnt.
If you go all through all this then you are almost there on your way to become a data scientist. The only thing missing is Map-Reduce and Hadoop that I will write about only when I personally know enough about that subject. In the meantime subscribe to the twitter feed of KDNuggets to keep yourself abreast of what is happening in the world of Data Scientists -- that the wise men at Harvard have declared to be the Sexiest Job of the Twenty First Century.
This survey has addressed the tools and techniques that a data scientist uses in his daily job. What is missing here is an understanding of the business domain -- like Retail, Finance, Telecom -- where these tools need to be used to deliver business value. To learn all this you may consider joining the One Year program on Business Analytics at the Praxis Business School, Kolkata where we teach all this and more. Sorry for that little advertisement but I need to earn a livelihood as well :-)
I will be updating this post based on changes in my knowledge and perception of this field, so you might see some changes every now and then.