Introduction to Python for data analysis

Introduction to Python for data analysis

"This tutorial aims at introducing students to the programming langage Python for data analysis. By the end of this module, students will be familiar with Python basic syntax and understand why Python serves well the purpose of data analysis."

Information

The estimated time to complete this training module is 4h.

The prerequisites to take this module are:

  • You should already have everything installed for this module!
  • We will be using Jupyter Notebook which is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Contact Andréanne Proulx if you have questions on this module, or if you want to check that you completed successfully all the exercises.

Before starting:

  1. Open the terminal
  2. Type jupyter notebook
  3. If you're not automatically directed to a webpage copy the URL printed in the terminal and paste it in your browser
  4. Once on the webpage, click “New” in the top-right corner and select “Python 3”
  5. You have Jupyter notebook

The little box you see once in the notebook is called a “cell.” You can enter (multi-line) code by typing in the cell, and then run the code by pressing “Shift+Enter.” To create a new cell, press the “+” button on lefthand side of the toolbar at the top of the screen!

You are ready for this tutorial and you are strongly encouraged to type along the presentation!

Resources

This module was presented by Ross Markello during the QLSC 612 course in 2020.

All the tutorial notes related to the video below are available here.

Exercises

For this part, we will use the famous scikit-learn dataset iris which consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) with information about petal and sepal length and width stored in a 150x4 numpy.ndarray.

  1. Before starting to write some code, you want to set-up the environment so that it load the required modules. In a Jupyter Notebook import the following librairies:

    # imports 
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_iris
    %matplotlib inline
    
  2. Load the iris dataset

    data =load_iris()
    
  3. Explore the dataset using .keys()

  4. Print the shape and type of data

  5. Store ‘data’ and ‘features_names’ in distinct variables

  6. Create a pandas dataframe with ‘data’ and use feature_names for column names

  7. Get the summary statistics for this dataframe using .describe()

  8. Subset the dataframe to keep only the first 50 rows

  9. Try to answer this question : Are there any extreme sepal length values?

    • Reminder : extreme value are > 3.9 standard deviation. (value - mean) / std. For this one, you might need to use a for loop.
  10. What about other features of the flowers? Try automating the previous operation by writing a function name find_extreme_values()

  11. Read about the boxplot function in matplotlib to get familiar with python documentation. What does it tell us? https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html

  12. Use this function to plot the boxplot distribution for features. Try adding a title and name for axis.

  13. Save dataframe in csv format and the plot as png.

Note: Internet is your best friend. Remember that whenever you are stuck, resources and blogs can help you figure it out (Stack Overflow).

If you are done, you can play around with different functions. Try to answer interesting questions you might have using the data.

  • Follow up with Andréanne Proulx to validate you completed the exercise correctly.
  • 🎉 🎉 🎉 you completed this training module! 🎉 🎉 🎉

More resources

There are hundreds of excellent resources online for learning Python and/or data science. A few good ones:

If you are curious, eiger to learn more, you can also try out this tutorial which inspired much of the content you saw today. introduction to Python