Introduction to Python for data analysis

"This tutorial aims at introducing students to the programming langage Python for data analysis. By the end of this module, students will be familiar with Python basic syntax and understand why Python serves well the purpose of data analysis."

Information

The estimated time to complete this training module is 4h.

The prerequisites to take this module are:

You should already have everything installed for this module!
We will be using Jupyter Notebook which is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Contact Andréanne Proulx if you have questions on this module, or if you want to check that you completed successfully all the exercises.

Before starting:

Open the terminal
Type jupyter notebook
If you're not automatically directed to a webpage copy the URL printed in the terminal and paste it in your browser
Once on the webpage, click “New” in the top-right corner and select “Python 3”
You have Jupyter notebook

The little box you see once in the notebook is called a “cell.” You can enter (multi-line) code by typing in the cell, and then run the code by pressing “Shift+Enter.” To create a new cell, press the “+” button on lefthand side of the toolbar at the top of the screen!

You are ready for this tutorial and you are strongly encouraged to type along the presentation!

Resources

This module was presented by Ross Markello during the QLSC 612 course in 2020.

All the tutorial notes related to the video below are available here.

Exercises

For this part, we will use the famous scikit-learn dataset iris which consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with information about petal and sepal length and width stored in a 150x4 numpy.ndarray.

Before starting to write some code, you want to set-up the environment so that it load the required modules. In a Jupyter Notebook import the following librairies:
```
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
%matplotlib inline
```
Load the iris dataset
```
data =load_iris()
```
Explore the dataset using .keys()
Print the shape and type of data
Store ‘data’ and ‘features_names’ in distinct variables
Create a pandas dataframe with ‘data’ and use feature_names for column names
Get the summary statistics for this dataframe using .describe()
Subset the dataframe to keep only the first 50 rows
Try to answer this question : Are there any extreme sepal length values?
- Reminder : extreme value are > 3.9 standard deviation. (value - mean) / std. For this one, you might need to use a for loop.
What about other features of the flowers? Try automating the previous operation by writing a function name find_extreme_values()
Read about the boxplot function in matplotlib to get familiar with python documentation. What does it tell us? https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html
Use this function to plot the boxplot distribution for features. Try adding a title and name for axis.
Save dataframe in csv format and the plot as png.

Note: Internet is your best friend. Remember that whenever you are stuck, resources and blogs can help you figure it out (Stack Overflow).

If you are done, you can play around with different functions. Try to answer interesting questions you might have using the data.

Follow up with Andréanne Proulx to validate you completed the exercise correctly.
🎉 🎉 🎉 you completed this training module! 🎉 🎉 🎉

More resources

There are hundreds of excellent resources online for learning Python and/or data science. A few good ones:

CodeAcademy offers interactive programming courses for many languages and tools, including Python and git
A Whirlwind Tour of Python is an excellent intro to Python by Jake VanderPlas; Jupyter notebooks are available here
Another excellent and free online book is Allen Downey's “Think Python”
Object Oriented Programming in Python 3 (https://realpython.com/python3-object-oriented-programming/)
Jake Vanderplas's Python Data Science Handbook is also available online as a set of notebooks
Kaggle maintains a nice list of data science and Python tutorials
Neuromatch Academy also has great tutorials available for Python in a computational neuroscience context.
- Tutorial 1: https://compneuro.neuromatch.io/tutorials/W0D1_PythonWorkshop1/student/W0D1_Tutorial1.html
- Tutorial 2: https://compneuro.neuromatch.io/tutorials/W0D2_PythonWorkshop2/chapter_title.html

If you are curious, eiger to learn more, you can also try out this tutorial which inspired much of the content you saw today. introduction to Python