Working With NetBeez Data Using Pandas

This tutorial will explore using the Python library called Pandas to analyze NetBeez data and help network engineers troubleshoot network performance issues. In this example we will use ping test data. However, the NetBeez API has additional data that can be used from agents, alerts, other tests, such as HTTP and iperf, and WiFi metrics.

To work through this tutorial a basic understanding of Python is needed. At minimum, readers should be able to run Python (version 3) code. In the next sections we will cover:

  • Introduction to Pandas
  • How to get started
  • Preparing the data
  • Graphing and visualizing the data

Introduction to Pandas

Pandas is a powerful and easy-to-use open-source data analysis and manipulation tool. Dataframes are a core component of Pandas and help to transform and visualize data. Think of dataframes as a class that stores data in a two-dimensional array.

Getting Started

To follow this tutorial, you can download sample ping data from two targets produced from a NetBeez agent. In alternative, you can use your own NetBeez data by using the NetBeez API (I am working on a separate blog post that explains how to query the NetBeez API to export data into a CSV file – subscribe to our blog if you want to be alerted when that’s available).

First, we will need to import our data from the CSVs into dataframes with Pandas, then assign them to variables. The following lines of code imports the CSV files located in the Downloads/Panda_Data/ folder within my macOS folder (~/Downloads/Panda_Data/)

google_target_data = pd.read_csv('~/Downloads/Pandas_Data/www.google.com_data.csv')
youtube_target_data = pd.read_csv('~/Downloads/Pandas_Data/youtube.com_data.csv')
agent_id target_id test_template_id target_name test_name heir_type interface_type network_interface_id data_id error_code ts value
66 5 21 youtube.com YouTube PingTest wireless 109 1659102041813-66-21 1659102041813 65.5
66 5 21 youtube.com YouTube PingTest wireless 109 1659102046828-66-21 195 1659102046828
66 5 21 youtube.com YouTube PingTest wireless 109 1659102051841-66-21 1659102051841 65.8
66 5 21 youtube.com YouTube PingTest wireless 109 1659102056856-66-21 1659102056856 65.5
66 5 21 youtube.com YouTube PingTest wireless 109 1659102061870-66-21 1659102061870 66

Data Preparation

Describe is a great method to gain insight into numerical columns which returns statistics like count (total values counted), mean, standard deviation, minimum, maximum, and quartiles. To execute methods on a particular column, enter a column name string inside of square brackets after the dataframe name like youtube_target_data[‘value’].

describe = youtube_target_data['value'].describe()
value
count 716
mean 60.43882682
standard_deviation 9.774659733
min 42.6
25% 47.725
50% 65.3
75% 66
max 118

The below methods are used to get these values individually:

count = youtube_target_data['value'].count()
mean = youtube_target_data['value'].mean()
median = youtube_target_data['value'].median()
min_value = youtube_target_data['value'].min()
max_value = youtube_target_data['value'].max()
standard_deviation = youtube_target_data['value'].std()

Dropping Data

Pandas provide a method to easily drop unnecessary columns from our dataframe. If the error column is not needed, it can be dropped with the below code:

youtube_target_data.drop(columns=['error_code'], inplace=True)
agent_id target_id test_template_id target_name test_name heir_type interface_type network_interface_id data_id ts value
66 5 21 youtube.com YouTube PingTest wireless 109 1659102041813-66-21 1659102041813 65.5
66 5 21 youtube.com YouTube PingTest wireless 109 1659102046828-66-21 1659102046828
66 5 21 youtube.com YouTube PingTest wireless 109 1659102051841-66-21 1659102051841 65.8
66 5 21 youtube.com YouTube PingTest wireless 109 1659102056856-66-21 1659102056856 65.5
66 5 21 youtube.com YouTube PingTest wireless 109 1659102061870-66-21 1659102061870 66

Subsetting Data

If not all of the columns are necessary to analyze the data, a subset can be created with all rows and only a select few columns. In the case of the sample data, a new subset dataframe is created with only the columns: target_name, ts, and value.

youtube_target_data = youtube_target_data[['target_name', 'ts', 'value']]
target_name ts value
youtube.com 1659102041813 65.5
youtube.com 1659102046828
youtube.com 1659102051841 65.8
youtube.com 1659102056856 65.5
youtube.com 1659102061870 66

Sampling Data

There are methods for getting information on the first five or last five rows in a dataframe. The .head() displays the first five rows and the .tail() method displays the last five rows. If a specific number of rows are desired, an integer can be added inside of the parentheses. In the case of .head(10), the first ten rows are displayed.

youtube_target_data.head(3)
target_name ts value
youtube.com 1659102041813 65.5
youtube.com 1659102046828
youtube.com 1659102051841 65.8

Pandas provides a way to sample rows at random with the .sample() method. You will need to indicate how many rows you would like sampled inside of the parenthesis like below:

random_samples = youtube_target_data.sample(n=10)
target_name ts value
youtube.com 1659102955206 70.4
youtube.com 1659102809799 65.6
youtube.com 1659104163801 65.9
youtube.com 1659104931449 65
youtube.com 1659103782564 43.5
youtube.com 1659104765897 45.8
youtube.com 1659102061870 66
youtube.com 1659103386351 42.9
youtube.com 1659102940183 66.3
youtube.com 1659102217733 65.9

There are methods for filtering the largest and smallest rows by a particular column. To return three columns with the greatest value, .nlargest(3, ‘value’). Conversely, .nsmallest(3, ‘value’) will return the three rows with the smallest value in the ‘value’ column. The first input parameter is the number of rows to be returned and the second input parameter will be the column to filter by greatest and least.

youtube_target_data.nlargest(3, 'value')
target_name ts value
youtube.com 1659105177567 118
youtube.com 1659103230926 96.6
youtube.com 1659103245953 74.7

Missing Data

If missing data exists inside of the dataset, the below method can be used to check which columns have missing data and returns sums of n/a values in each column.

na_totals = youtube_target_data.isna().sum()
column_name count
target_name 0
ts 0
value 1

One way to work with missing data is to drop the rows that have n/a values. For instance, if there are 717 rows in the dataframe and one row is missing data, using .dropna() will remove the one row of missing data leaving 716 rows.

youtube_target_data = youtube_target_data.dropna()

Another way to work with missing data is to fill in the missing values with a default value, mean, medium, or any value. The .fillna() method will need to be called on the column with the value inside of the parentheses. In the sample data, row two was missing a value in the ‘value’ column. After running the below command, the mean will replace any missing data.

youtube_target_data['value'] = youtube_target_data['value'].fillna(mean)
target_name ts value
youtube.com 1659102041813 65.5
youtube.com 1659102046828 60.4388
youtube.com 1659102051841 65.8

Visualizing Data

There are multiple ways to plot data with Python. To plot the sample data, the library Matplotlib may be used. This will help to give a visual insight into our data. First, the libraries will need to be imported:

import matplotlib.pyplot as plt
import datetime as datetime
import matplotlib.dates as mdates

Next, the column of timestamps will be converted to datetime values for both of our sample datasets:

youtube_target_data['ts'] = youtube_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))
google_target_data['ts'] = google_target_data['ts'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000))

The x-axis tick labels could be formatted with the below code to only display hours and minutes:

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

The sample data will need to be passed to our graph with the below code:

plt.plot(youtube_target_data['ts'],youtube_target_data['value'], color='g', label='Youtube', linewidth=0.75)
plt.plot(google_target_data['ts'],google_target_data['value'], color='b', label='Google', linewidth=0.75)

A label can be added to the y-axis with the below code. The values in the sample data are in milliseconds.

plt.ylabel('ms')

To extend the graph lines to the edge of the graph, the graph’s min and max will need to be adjusted:

plt.xlim([min(google_target_data['ts']), max(google_target_data['ts'])])

Finally, to show the graph with the two plotted lines and the graph’s legend, the below code will be used:

plt.legend()
plt.show()

Conclusion

Pandas is a powerful Python library that can help to analyze and transform data. This tutorial only scratched the surface of what Pandas is capable of. For more information on Pandas, the documentation is a great resource. More information on using NetBeez’s API can be found here.

Source