Skip to main content

Probability and Cumulative Dice Sums

A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs

Introduction


Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.

Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.

If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.

If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.

Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.

You can start learning much of this material freely and easily with MOOCs. Here's an initial list.

MOOCs

Data Science Basics

Johns Hopkins: The Data Scientist’s Toolbox. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.

Johns Hopkins: R Programming. R-based. Started January 5, 2015. Coursera.

Scripting Languages

Intro to Computer Science. Python-based. Take anytime. Udacity; videos and exercises are free.


Programming Foundations with Python. Python-based. Take anytime. Udacity; videos and exercises are free.



MIT: Introduction to Computer Science and Programming Using Python. Python-based. Class started January 9, 2015. edX.



Databases and SQL

Stanford: Introduction to Databases. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.


Machine Learning

Stanford: Machine Learning. Octave-based. Class started January 19, 2015. Coursera.


Stanford: Statistical Learning. R-based. Class started January 19, 2015. Stanford OpenEdX.



Comments

  1. Hello,
    The Article on A Very Rough Guide to Getting Started in Data Science is nice .it give detail information about data science .Thanks for Sharing the information about it.
    big data scientist

    ReplyDelete

Post a Comment

Popular posts from this blog

Mining the First 3.5 Million California Unclaimed Property Records

As I mentioned in my previous article  the state of California has over $6 billion in assets listed in its unclaimed property database .  The search interface that California provides is really too simplistic for this type of search, as misspelled names and addresses are both common and no doubt responsible for some of these assets going unclaimed. There is an alternative, however - scrape the entire database and mine it at your leisure using any tools you want. Here's a basic little scraper written in Ruby . It's a slow process, but I've managed to pull about 10% of the full database in the past 24 hours ( 3.5 million out of about 36 million). What does the distribution of these unclaimed assets look like?  Among those with non-zero cash reported amounts: Total value - $511 million Median value - $15 Mean value - $157 90th percentile - $182 95th percentile - $398 98th percentile - $1,000 99th percentile - $1,937 99.9th percentile - $14,203 99.99th percen

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models . These are models that contain both fixed and random effects . There are multiple ways of defining fixed vs random random effects , but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean. Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository ): model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id) fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log) ) The fixed effects are year , field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime

A Bayes' Solution to Monty Hall

For any problem involving conditional probabilities one of your greatest allies is Bayes' Theorem . Bayes' Theorem says that for two events A and B, the probability of A given B is related to the probability of B given A in a specific way. Standard notation: probability of A given B is written \( \Pr(A \mid B) \) probability of B is written \( \Pr(B) \) Bayes' Theorem: Using the notation above, Bayes' Theorem can be written:  \[ \Pr(A \mid B) = \frac{\Pr(B \mid A)\times \Pr(A)}{\Pr(B)} \] Let's apply Bayes' Theorem to the Monty Hall problem . If you recall, we're told that behind three doors there are two goats and one car, all randomly placed. We initially choose a door, and then Monty, who knows what's behind the doors, always shows us a goat behind one of the remaining doors. He can always do this as there are two goats; if we chose the car initially, Monty picks one of the two doors with a goat behind it at random. Assume we pick Door 1 an