Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
pandas4xr
Alan G. Isaac
January 29, 2014
1 Introduction to Exchange Rate Data with Python
As a preliminary, we import some modules we will be using. We adopt conventional short names.
In [2]:
#preliminariesimport datetimeimport numpy as npimport matplotlib as mplimport matplotlib.pyplot as plt%matplotlib inlineimport pandas as pd
Some of these libraries change rapidly, so it is good to know what version you are using.
In [68]:
print("numpy version {}".format(np.__version__))print("matplotlib version {}".format(mpl.__version__))print("pandas version {}".format(pd.__version__))
numpy version 1.8.0matplotlib version 1.3.1pandas version 0.12.0
1.1 Learning Outcomes
• understand the Series and DataFrame data types of Pandas• access and analyze online data• access and analyze CSV files• use Pandas and Matplotlib basic time-series charts• smooth with moving average
2 Pandas: Basic Data Analysis
Pandas background:
• website: http://pandas.pydata.org• Wes McKinney’s intro: http://vimeo.com/59324550
2.1 Data Types
Core information: http://pandas.pydata.org/pandas-docs/dev/dsintro.htmlTwo important data types defined by pandas:Series and DataFrame.
A DataFrame is like a spreadsheet; a Series is like a spreadsheet column. We often think of a series as observationson a single variable. Note: all the data has a common type. Extensive documentation is available online. See:http://pandas.pydata.org/pandas-docs/dev/dsintro.html
Series
In [3]:
s1 = pd.Series([10,11,12,13], name=’my series’)print(s1)print("the first element is {}".format(s1[0]))s1[0] = 20print("Now the first element is {}".format(s1[0]))print(s1.describe())
0 101 112 123 13Name: my series, dtype: int64the first element is 10Now the first element is 20count 4.000000mean 14.000000std 4.082483min 11.00000025% 11.75000050% 12.50000075% 14.750000max 20.000000dtype: float64
You can index and slice a series. See http://pandas.pydata.org/pandas-docs/dev/indexing.html
In [31]:
print(s1[0])printprint(s1[0::2])
20
0 202 12Name: my series, dtype: int64
A Series has an associated plot method, which depends on Matplotlib.
In [26]:
ax = s1.plot()ax.legend()
Out [26]:
<matplotlib.legend.Legend at 0xa5a4908>
Note that the Series is plotted against its index.
In [29]:
s1.index
Out [29]:
Int64Index([0, 1, 2, 3], dtype=int64)
DataFrame
We can create a DataFrame many ways. One way is to combine multiple Series objects.
In [4]:
s2 = pd.Series([30,31,32,33],name="series two")#first approach: concatenation of seriesdf1 = pd.concat([s1,s2], axis=1)df1
Out [4]:
my series series two0 20 301 11 312 12 323 13 33
In [9]:
#another approach, giving new namesdf1 = pd.DataFrame(dict(s1=s1,s2=s2))df1
Out [9]:
s1 s20 20 301 11 312 12 323 13 33
You can easily extract a series: its name is an attribute of your DataFrame.
In [16]:
print(df1.s1)printprint(type(df1.s1))
0 201 112 123 13Name: s1, dtype: int64
<class ’pandas.core.series.Series’>
You can index and slice a DataFrame.
See: http://pandas.pydata.org/pandas-docs/dev/indexing.html
In [15]:
print(df1.ix[0]) #get the first rowprintprint(df1[::2]) #get every second row
s1 20s2 30Name: 0, dtype: int64
s1 s20 20 302 12 32
<class ’pandas.core.series.Series’>
Often we want our index to be named.
In [5]:
df1.index.name="period"df1
Out [5]:
s1 s2period0 20 301 11 312 12 323 13 33
A DataFrame also has a plot method.
In [6]:
df1.plot();
We can create dataframes with time-series data from files or from the web. We are particularly interested in data fromFRED.
2.2 Get Data from FRED
In [17]:
import pandas.io.data as webdatatstart = datetime.datetime(2010, 1, 1)tend = datetime.datetime(2013, 12, 15)#get the EURUSD exchange rateexuseu = webdata.DataReader("EXUSEU", "fred", tstart, tend);print(type(exuseu))print(exuseu.head())
<class ’pandas.core.frame.DataFrame’>EXUSEU
DATE2010-01-01 1.42662010-02-01 1.36802010-03-01 1.35702010-04-01 1.34172010-05-01 1.2563
In [11]:
print(exuseu.index) #this is a DatetimeIndexprintprint(exuseu.describe())ax1 = exuseu.plot()
<class ’pandas.tseries.index.DatetimeIndex’>[2010-01-01 00:00:00, ..., 2013-12-01 00:00:00]Length: 48, Freq: None, Timezone: None
EXUSEUcount 48.000000
mean 1.333623std 0.055664min 1.22230025% 1.29687550% 1.32710075% 1.366200max 1.446000
Smooth Data with Moving Average
In [52]:
exuseu[’mavg’] = pd.rolling_mean(exuseu, 4) #create a new columnprint(exuseu.head()) #we lose 3 observationsexuseu[3:].plot()
EXUSEU mavgDATE2010-01-01 1.4266 NaN2010-02-01 1.3680 NaN2010-03-01 1.3570 NaN2010-04-01 1.3417 1.3733252010-05-01 1.2563 1.330750
Out [52]:
<matplotlib.axes.AxesSubplot at 0xdc3c5f8>
In [36]:
exusuk = webdata.DataReader("EXUSUK", "fred", tstart, tend);exjpus = webdata.DataReader("EXJPUS", "fred", tstart, tend);print(type(exusuk))print(exusuk.head())
<class ’pandas.core.frame.DataFrame’>EXUSUK
DATE2010-01-01 1.61582010-02-01 1.56182010-03-01 1.50582010-04-01 1.53322010-05-01 1.4669
In [18]:
#get multiple series (requires pandas 0.13+)#erates = webdata.DataReader(["EXUSEU", "EXUSUK", "EXJPUS"], "fred", tstart, tend)
In [57]:
erates = pd.DataFrame(dict(eurusd=exuseu.EXUSEU,gbpusd=exusuk.EXUSUK,jpyusd100=100/exjpus.EXJPUS))erates.plot();
2.3 Currency Returns
Define raw foreign currency return:
dlst =St − St−1
St−1=
St
St−1− 1
In [39]:
dls = erates.pct_change()print(dls.head())dls = dls[1:] #slice off the NaNprint(dls.corr())dls.plot();
eurusd gbpusd jpyusdDATE2010-01-01 NaN NaN NaN2010-02-01 -0.041077 -0.033420 0.0106682010-03-01 -0.008041 -0.035856 -0.0063562010-04-01 -0.011275 0.018196 -0.0292832010-05-01 -0.063651 -0.043243 0.016088
eurusd gbpusd jpyusdeurusd 1.000000 0.724467 0.077517gbpusd 0.724467 1.000000 0.186015jpyusd 0.077517 0.186015 1.000000
In [40]:
pd.scatter_matrix(dls, diagonal=’kde’);
In [50]:
fig, ax = plt.subplots(1,1)corr = dls.corr()plt.imshow(corr, cmap=’Blues’, interpolation=’none’)
plt.colorbar()plt.xticks(range(len(corr)), corr.columns)plt.yticks(range(len(corr)), corr.columns);
Riskiness of Currency Returns
In [51]:
# Code from Thomas Wiecki’s Financial Analysis in Pythonplt.scatter(dls.mean(), dls.std())plt.xlabel(’Mean Currency Returns’)plt.ylabel(’Std Currency Returns’)for label, x, y in zip(dls.columns, dls.mean(), dls.std()):
plt.annotate(label,xy = (x, y), xytext = (20, -20),textcoords = ’offset points’, ha = ’right’, va = ’bottom’,bbox = dict(boxstyle = ’round,pad=0.5’, fc = ’yellow’, alpha = 0.5),arrowprops = dict(arrowstyle = ’->’, connectionstyle = ’arc3,rad=0’))
2.4 Writing and Reading CSV
In [18]:
#get quarterly real GDP data from FREDgdpc1 = webdata.DataReader("GDPC1", "fred", tstart, tend)print(gdpc1.describe())gdpc1[’ly’] =np.log(gdpc1[’GDPC1’]) #create new columnprint(gdpc1.describe())
GDPC1count 15.000000mean 15220.846667std 379.293620min 14597.70000025% 14918.20000050% 15242.10000075% 15536.800000max 15839.300000
GDPC1 lycount 15.000000 15.000000mean 15220.846667 9.630131std 379.293620 0.024938min 14597.700000 9.58861925% 14918.200000 9.61033650% 15242.100000 9.63181775% 15536.800000 9.650967max 15839.300000 9.670249
We can save our data set as a .csv file.
In [9]:
gdpc1.to_csv(’data/temp.csv’) #‘data‘ directory must already exist
Take a look at the file you created. (Use type on Windows or cat on other systems.)
In [13]:
!type data\temp.csv
DATE,GDPC1,ly2010-01-01,14597.7,9.588619261040032010-04-01,14738.0,9.5981844713269452010-07-01,14839.3,9.605034345797522010-10-01,14942.4,9.6119580883554542011-01-01,14894.0,9.608713726270592011-04-01,15011.3,9.6165585298045542011-07-01,15062.1,9.6199369338637962011-10-01,15242.1,9.631816615023332012-01-01,15381.6,9.6409272688581262012-04-01,15427.7,9.6439198739744242012-07-01,15534.0,9.6507864489795932012-10-01,15539.6,9.6511468835646182013-01-01,15583.9,9.6539936090239462013-04-01,15679.7,9.6601221610683072013-07-01,15839.3,9.670249472472733
Naturally you can read CSV data from disk.
In [14]:
df2 = pd.read_csv(’data/temp.csv’, index_col=’DATE’, parse_dates=True)df2.head()
Out [14]:
GDPC1 lyDATE2010-01-01 14597.7 9.5886192010-04-01 14738.0 9.5981842010-07-01 14839.3 9.6050342010-10-01 14942.4 9.6119582011-01-01 14894.0 9.608714
In [19]:
dly = gdpc1[’ly’].diff()print(dly) #note the NaN; lose an obs to diffprintdly=dly[1:] # slice it offprint(dly)#plot itdly.plot();
DATE2010-01-01 NaN2010-04-01 0.0095652010-07-01 0.0068502010-10-01 0.0069242011-01-01 -0.0032442011-04-01 0.0078452011-07-01 0.0033782011-10-01 0.0118802012-01-01 0.0091112012-04-01 0.0029932012-07-01 0.0068672012-10-01 0.0003602013-01-01 0.0028472013-04-01 0.0061292013-07-01 0.010127Name: ly, dtype: float64
DATE2010-04-01 0.0095652010-07-01 0.0068502010-10-01 0.0069242011-01-01 -0.0032442011-04-01 0.0078452011-07-01 0.0033782011-10-01 0.0118802012-01-01 0.0091112012-04-01 0.0029932012-07-01 0.0068672012-10-01 0.0003602013-01-01 0.0028472013-04-01 0.0061292013-07-01 0.010127Name: ly, dtype: float64
2.5 Use Matplotlib with Pandas Data Types
Here we want to plot two series with different frequencies in one chart, so we turn to direct use of Matplotlib.
In [23]:
fig, ax1 = plt.subplots(1,1)ax1.plot(dly.index, dly, ’b-’)ax1.set_ylabel(’GDP growth’, color=’b’)ticks = ax1.get_xticks()ax1.set_xticks((ticks[0],ticks[-1]))ax2 = ax1.twinx() #use same x-axis with separate y-scaleax2.plot(exuseu.index, exuseu, ’r-’);ax2.set_ylabel(’EUR-USD’, color=’r’);
3 END