In this post, we will deal with data from ECDC and we will explain basic data manipulation in Python with the Pandas package.
In our day, data is everywhere in enormous size and depth. Data science is an emerging field that penetrates every aspect of our life and, lately, it has proved to be an extraordinary weapon for predicting infections from Covid-19 and organizing strategies to limit the damage.
To import Pandas
and Matplotlib
packages we code:
import pandas as pd
import matplotlib.pyplot as plt
We download the excel file locally from ECDC site and open it using the read_excel
function of Pandas library. We have named the file as data.xls
in our case.
df=pd.read_excel("data.xlsx", engine="openpyxl")
We can first explore the data and the columns of the dataframe df:
We observe the columns of the dataframe — in our case, we will use the columns: dateRep
, cases
and deaths
. Additionally, the name of the country is stored in column countriesAndTerritories
.
We next select ‘Italy
’ as the country under study. A new column is created named DateTime
of type datetime where we store the day. In the following, we create a new dataframe with the name df_italia which is the same as the dataframe df_italia_sorted
but it is sorted according to the column DateTime
df_italia=df[df.countriesAndTerritories=='Italy']
df_italia['DateTime']=pd.to_datetime(df_italia['dateRep'],format="%d/%m/%Y")
#We sort according to DateTime
df_italia_sorted=df_italia.sort_values(by='DateTime')
df_italia_selected=df_italia_sorted[df_italia_sorted.month>4]
We are interested in data after the month of April (i.e., May, June, July, August, … etc) so we choose to filter using the column month
and create a new dataframe df_italia_selected.
Since the data in columns cases
and deaths
may have great variation, it is practical in order to understand the trend to use a moving average. We choose a moving average of seven days and we create two new columns (Moving Average Cases
and Moving Average Deaths
) where we store the average values of cases
and deaths
.
#Calculate moving average
df_italia_selected['Moving Average Cases']=df_italia_selected.cases.rolling(7,min_periods=1).mean()
df_italia_selected['Moving Average Deaths']=df_italia_selected.deaths.rolling(7,min_periods=1).mean()
We now plot the cases and deaths as functions of time. We choose the red color for cases and blue for deaths. It is useful to plot cases
and deaths
in the same figure with common x-axis in order to understand possible connection and relation. So, we use the subplots function and first create figure fig and axis ax1 (this will be the axis for the cases and it will be the left axis). We then create ax2 using twinx
function. The values for deaths will be our right axis. A dashed line is used for the average values.
#Figure
fig, ax1=plt.subplots()
color1='tab:red'
ax1.plot(df_italia_selected['DateTime'], df_italia_selected['cases'], color=color1)
ax1.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Cases'], color=color1,linestyle='dashed')
ax1.set_xlabel('Data')
ax1.set_ylabel('Cases',color=color1)
ax1.tick_params(axis='y',labelcolor=color1)
locs, labels=plt.xticks()
plt.setp(labels,rotation=90)
ax2=ax1.twinx() #instantiate a second axes that shares the same x-axis
color2='tab:blue'
ax2.plot(df_italia_selected['DateTime'], df_italia_selected['deaths'], color=color2)
ax2.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Deaths'], color=color2,linestyle='dashed')
ax2.set_ylabel('Deaths',color=color2)
ax2.tick_params(axis='y',labelcolor=color2)
fig.tight_layout() #otherwise the right y-label is slightly clipped
The figure below is the program output.
Cases and deaths as a function of data for Italy