9. Sort Columns, Groupby Columns, & Count values – Introduction to Data Wrangling, Cleaning, Analysis, and Visualization with Python and Pandas

Stacking requests

In this lesson, we will be using commands that stack various requests such as methods, parameters, operators, and more to define the command. Pandas encourages this kind of stacking, but it can seem overwhelming at first to beginners. For example, as we will see below, a command could include two or more methods that stack on top of each other, and end with a slice operator to view only the top N rows of the results. In addition, a command can include specific parameters to call out a particular column or sort the data in descending order.

We will move slowly through each of the following commands to break them down.

Sort columns

To sort a DataFrame, we can use the .sort_values() method with the parameter by= and including the name of the column we want to sort by written in quotation marks.

For example, we can sort the DataFrame by the percentages of total refugee arrivals:

refugee_df.sort_values(by='percent_total', ascending=False)[:15]

Note: In the command above, we used the “by=” parameter to specify that the data be sorted according to the “percent_total” column and we added the “ascending=False” parameter in order to request that the data be displayed with the highest percentage first. By default, Pandas will sort in “ascending” order, from the smallest value to the largest value. We also added a Python list slice (i.e., [:15]) to view just the top 15 rows.

Groupby Columns

We can group data and perform calculations on the groups using the .groupby() method. For example, to see the breakdown of the number of arrivals by country of origin, we can use the following command:

refugee_df.groupby('origin')

This command created a Groupby object—grouped data—that we can use to perform calculations such as counting the number of non-blank values in each column for each arrival by country of origin.

Next, we will use the following command to count the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order:

refugee_df.groupby('origin')['arrivals'].count().sort_values(ascending=False)[:20]

These results show us the total number of arrivals by country of origin across the 2005-2015 period, in descending order, sliced for the top 20 results.

Let’s unpack the command to better understand these results:

We have three stacked methods here: .groupby(), .count(), and .sort_values().
groupby(‘origin’)[‘arrivals’]: For the Groupby object we defined in the previous step, groupby(‘origin’), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by country of origin.
.count(): This method counts non-blank cells for each column or row. The results we see in the output show the total number of refugee arrivals by country of origin.
.sort_values(ascending=False): This method specifies how we want our output to be sorted. We include the ascending=False parameter in order to request that the data be displayed with the highest percentage first.
[:20]: This Python slide specifies that we just want to see the top 20 rows.

Count values

We can count the number of unique values in a column by using the .value_counts() method.

refugee_df['state'].value_counts()

These results show us how many refugees were resettled in each state across the 2005-2015 period. We can see the full list of states noted in the DataFrame, and these include the District of Columbia, Puerto Rico and Guam. We can also note that Wyoming is missing from the list, which can be confirmed by filtering the DataFrame to select only certain values.

refugee_df[refugee_df['state'] == 'Wyoming']

Terms used in lesson:

.sort_values(): Use the .sort_values() method to sort the data within a column in your DataFrame

.groupby(): Use the .groupby() method to group data and perform calculations on the groups in your DataFrame

.count(): Use the .count() method to count non-blank cells for each column or row.

.value_counts(): Use the .value_counts() method to count the number of unique values in a column.

Lesson 10