4. Getting started with Pandas – Introduction to Data Wrangling, Cleaning, Analysis, and Visualization with Python and Pandas

Import Pandas

In the first blank cell, type the following command to import the Pandas library into our Jupyter Notebook:

import pandas as pd

To run the command, you can click the “Run” button in the top toolbar, or you can click shift + return.

This import statement not only imports the Pandas library but also gives it the alias “pd.” Using this alias will save us from having to type out the entire word “Pandas” each time we need to use it. Libraries are sets of instructions that Python can use to perform specialized functions.

By default, Pandas will display 60 rows and 20 columns. However, we can change those settings if we want to see more rows and columns. For this workshop, let’s set the display settings to include 100 rows:

pd.options.display.max_rows = 100

If you don’t see an error when you run the notebook—that is, if there is no output—you can move on to the next step. It is not rare in programming that when you do things right, the result will be nothing happening. This is what we like to call a silent success.

Read in a CSV file as a DataFrame

Next, we will read in our dataset saved as a CSV file. We will specifically work with the refugee-arrivals-by-destination.csv dataset. You want to make sure you save the dataset in the same location as your Jupyter Notebook, in this case the pandas_workshop folder saved on your Desktop.

To read in a CSV file, we will use the method pd.read_csv() and insert the name of our desired file path:

refugee_df = pd.read_csv('refugee-arrivals-by-destination.csv', delimiter=",", encoding='utf-8')

With this command, we have created a Pandas DataFrame object, which is a 2-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

It is common practice to abbreviate DataFrame with “df”, as in refugee_df. When reading in the CSV file, we also specified the encoding and delimiter. The delimiter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter is usually a comma. UTF is “Unicode Transformation Format”, and ‘8’ means 8-bit values are used in the encoding. It is one of the most efficient and convenient encoding formats among various encodings. In Python, strings are by default in utf-8 format which means each alphabet corresponds to a unique code point. Setting the encoding format ensures our strings are uniform.

Python Methods and Attributes

Objects in Python (and other programming languages) are basically containers that can hold data and/or functions inside them. When a function is inside an object, we usually call the function a “method.” When data is inside an object, we usually call it an “attribute.” For example, in the command we ran above, we used the “.read_csv()” method to open the “refugee-arrivals-by-destination.csv” file and added the “delimiter=”” and “encoding=’utf-8’” attributes.

The terminology isn’t that important, though. What we do need to know is that you can access these “methods” and “attributes” with a . (a dot or period). When we added sort(), append(), pop(), and lower() to our library app, we briefly saw how some methods contained inside certain objects in Python, like Lists (for sort, append, and pop), and String objects, like lower.

For more info on methods and attributes, review the “Objects in Python” lesson in the Intro to Python workshop.

Terms used in lesson

Jupyter Notebook: The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Pandas: Pandas is a software library written for the Python programming language for data manipulation and analysis.

Library: A Python library is a reusable piece of code / sets of instructions that you use in your script.

DataFrame: A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. Similar to a spreadsheet.

Lesson 5