Wednesday, January 21, 2015

A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs

Introduction


Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.

Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.

If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.

If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.

Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.

You can start learning much of this material freely and easily with MOOCs. Here's an initial list.

MOOCs

Data Science Basics

Johns Hopkins: The Data Scientist’s Toolbox. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.

Johns Hopkins: R Programming. R-based. Started January 5, 2015. Coursera.

Scripting Languages

Intro to Computer Science. Python-based. Take anytime. Udacity; videos and exercises are free.


Programming Foundations with Python. Python-based. Take anytime. Udacity; videos and exercises are free.



MIT: Introduction to Computer Science and Programming Using Python. Python-based. Class started January 9, 2015. edX.



Databases and SQL

Stanford: Introduction to Databases. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.


Machine Learning

Stanford: Machine Learning. Octave-based. Class started January 19, 2015. Coursera.


Stanford: Statistical Learning. R-based. Class started January 19, 2015. Stanford OpenEdX.



No comments:

Post a Comment