Kaggle Workshops
Workshops, https://www.kaggle.com/, 2019
These are a collection of all the workshops I’ve run at Kaggle, from July 2016 to December 2019.
Practical Model Evaluation (with AutoML)
How do you know which machine learning model is going to work best for a specific problem? Learning how to evaluate machine learning models is an important part of the data science workflow. You’ll need it for everything from picking your final submissions for a Kaggle competition to choosing which model your team should put into production.
We know how important model evaluation is, so we’ve put together a three-day workshop to walk you through the model evaluation process from start to finish. We’ll go beyond just optimization metrics, though, and talk about factors for model selection relevant to working data scientists.
Day 1 Notebook: Figuring out what matters for you
Day 2 Notebook: Training models with automated machine learning
Day 3 Notebook: Evaluating our models
Utility Script Competition
As part of the Utility Script Competition we ran on Kaggle, I wrote this notebook with guidelines for writing more professional data science code.
SQL Summer Camp
As part of the SQL Summer Camp I wrote a workshop as an introduction to BigQuery ML. It is based on the official documentation tutorial. In this tutorial, you use the sample Google Analytics sample dataset for BigQuery to create a model that predicts whether a website visitor will make a transaction. For information on the schema of the Analytics dataset, see BigQuery export schema.
Intro to API’s
This was a three day event held during Kaggle CareerCon 2019. Each day we learned about a new part of developing an API and put it into practice. By day 3, you’ll have written and deployed an API of your very own!
Day 1: The Basics of Rest APIs – What They Are and How to Design One. By the end of this day you’ll have written the OpenAPI specification for your API.
Day 2: How to Make an API for an Existing Python Machine Learning Project. By the end of this day, you’ll have a Flask app that you can use to serve your model.
Day 3: How to deploy your API on your choice of services – Heroku or Google Cloud. By the end of this day, you’ll have deployed your model and will be able to actually use your API! (Note that, in order to use Google Cloud, you’ll need to create a billing account, which requires a credit card. If you don’t have access to a credit, you can still use Heroku.)
Getting Started with Automated Data Pipelines
The Getting Started with Automated Data Pipelines series is a set of three notebooks and livestreams (recordings are available) designed to help you get started with creating data pipeline that allow you to automate the process of moving and transforming data.
Day 1: Versioning & Creating Datasets from GitHub Repos
Day 2: Validation & Creating Datasets from URL’s
Day 3: ETL & Creating Datasets from Kernel Output
Dashboarding with Notebooks
Want to learn how to combine the speed of spinning up a notebook with the ease of an automatically updating dashboard? Then this is the event for you! Each day from December 17th to December 21st 2018 you’ll get a practical, hands-on exercise that won’t take more than 20 minutes but will help you refine your dashboarding skills.
Day 1: Determining what information should be monitored with a dashboard. Notebook, Livestream Recording
Day 2: How to create effective dashboards in notebooks, Python Notebook, R Notebook, Livestream
Day 3: Running notebooks with the Kaggle API, Notebook, Livestream
Day 4: Scheduling notebook runs using cloud services, Notebook, Livestream
Day 5: Testing and validation, Python Notebook, R Notebook, Livestream
JupyterCon 2018 Workshops
At JupyterCon 2018 I gave two workshops. You can find all my materials in these two notebooks:
Reproducible research best practices (highlighting Kaggle Kernels)
I Do, We Do, You Do: Supporting active learning with notebooks
5-Day Challenges
I ran several educational 5-Day Challenges on different topics in 2017 & 2018. Each challenge consists of five short exercises designed to give you hands-on practice with a different data science technique. This notebook collects links to the exercises for each challenge so you can work through them at your own pace.
5-Day Data Challenge
Topic: Getting started with data science
Level: Beginner
Language: Python and R
Daily tasks:
Day 1: Reading data into a kernel
Day 2: Plot a Numeric Variable with a Histogram
Day 3: Perform a t-test
Day 4: Visualize categorical data with a bar chart
Day 5: Using a Chi-Square Test
New to data science? Need a quick refresher? This five day challenge will give you the guidance and support you need to kick-start your data science journey.
By the time you finish this challenge, you will:
Read in and summarize data
Visualize both numeric and categorical data
Know when and how to use two foundational statistical tests (t-test and chi-squared)
All the material for this challenge is in one notebook.
5-Day Data Challenge: Regression
Topic: Regression
Level: Intermediate (should already be familiar with R)
Language: R
Daily tasks:
Day 2: Learn how to fit & evaluate a model with diagnostic plots
Day 4: Learn how to fit & interpret a multiple regression model
Day 5: Learn how to use Elastic Net to select input variables
By the time you finish this challenge, you’ll understand how and when to implement three foundational regression techniques. Each day we will cover one aspect of regression analysis in depth.
How to pick the right regression technique for your data
How to use diagnostic plots to check your model
How to interpret and communicate your model
Visualizing your model
Comparing models & selecting variables
We’ll work with real datasets to help develop an intuitive understanding of how each type of model works and how to interpret the results.
SQL Scavenger Hunt (not a 5-Day Challenge, but follows a similar format)
Topic: How to query data in SQL
Level: Beginner
Language: Python and SQL
Daily tasks:
In our SQL Scavenger Hunt, you’ll learn how to use SQL to get data from BigQuery databases. Each day you’ll learn about a core SQL technique and practice using it to get the data you need to answer real-world questions like:
How many GitHub users made more than ten commits on January 1, 2015?
Which five cities had the highest air pollution last week?
You’ll also learn best practices for working with BIG datasets.
SQL (short for “Structured Query Language”) is the primary way to get data out of relational databases. It’s also the third most popular software tool for data science, right after Python and R, and a key skill for aspiring data scientists to develop.
This challenge is also available as a Learn track
Python: 5-Day Data Challenge: Data Cleaning
Topic: Data cleaning
Level: Beginner to intermediate (should already be familiar with Python)
Language: Python
Daily tasks:
Data cleaning is a key part of data science, but it can be deeply frustrating. Why are some of your text fields garbled? What should you do about those missing values? Why aren’t your dates formatted correctly? How can you quickly clean up inconsistent data entry? In this five day challenge, you’ll learn why you’ve run into these problems and, more importantly, how to fix them!
In this challenge we’ll learn how to tackle some of the most common data cleaning problems so you can get to actually analyzing your data faster. We’ll work through five hands-on exercises with real, messy data and answer some of your most commonly-asked data cleaning questions.
R: 5-Day Data Challenge: Data Cleaning
Topic: Data cleaning
Level: Beginner to intermediate (should already be familiar with R)
Language: R
Daily tasks:
Day 1: Reading in common data file formats: .json, .txt and .xlsx
Day 5: Cleaning numbers (percentages, money, dates and times)
Data cleaning is a necessary part of data science, but it can be deeply frustrating. What are you supposed to do with this .json file? How can you handle all these missing values in your data? Is there a fast way to get rid of duplicate entries? In this challenge, we’ll learn how to solve some common data cleaning problems.
This challenge is in R and covers different topics from the earlier Python version of the Data Cleaning 5-Day Challenge so even if you did the last challenge, you’ll discover some new tips and tricks! Here’s what we’ll be covering: