Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Graphics in R�and for Presentations �Why, what, and how
R BohnFeb. 16, 2016
Why figures/plots ?
• Exploratory data analysis– What’s there: coverage in your data set
• Also look at averages• Example: Expensive cars; old cars
– Gross patterns = interactions of various kinds– Outliers, strange phenomena
• To explain results• To persuade• All built from same toolset 2
Exploratory Data Analysis: Pictures into insights
3
What does it tell us?
4
Box plot of kilometres, with a price scatter plot.
The average KM is around 60-65 thousand KM, while the highest price vehicles are generally lower than 50 thousand KM.
Boxplot of KM by month
This one did not turn out the way the text had shown, since the boxplot is supposed to be separated for each month. Nonetheless, we see that month of manufacture is not a predictor of price, which m akes sense.
5
8. Correlation
9. Scatter plot of Price and KM
Histogram price w log xform
customers would surely never buy a diamond were it not for the social norm of presentingone when proposing. And, there are *fewer* consumers who can afford a diamond largerthan one carat. Hence, we shouldn’t expect the market for bigger diamonds to be ascompetitive as that for smaller ones, so it makes sense that the variance as well as the pricewould increase with carat size.
Often the distribution of any monetary variable will be highly skewed and vary over ordersof magnitude. This can result from path-dependence (e.g., the rich get richer) and/or themultiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollaramount. Hence, it’s a good idea to look into compressing any such variable by putting it ona log scale (for more take a look at this guest post on Tal Galili’s blog).
Indeed, we can see that the prices for diamonds are heavily skewed, but when put on alog10 scale seem much better behaved (i.e., closer to the bell curve of a normaldistribution). In fact, we can see that the data show some evidence of bimodality on thelog10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about thenature of customers for diamonds.
Let’s re-plot our data, but now let’s put price on a log10 scale:
12345678910
p,=,qplot(price,,data=diamonds,,binwidth=100),+,,,,theme_bw(),+,,,,ggtitle("Price")p,p,=,qplot(price,,data=diamonds,,binwidth,=,0.01),+,,,,scale_x_log10(),+,,,,theme_bw(),+,,,,ggtitle("Price,(log10)")p
12345
p,=,qplot(carat,,price,,data=diamonds),+,,,,scale_y_continuous(trans=log10_trans(),),+,,,,theme_bw(),+,,,,ggtitle("Price,(log10),by,CubedVRoot,of,Carat")p
customers would surely never buy a diamond were it not for the social norm of presentingone when proposing. And, there are *fewer* consumers who can afford a diamond largerthan one carat. Hence, we shouldn’t expect the market for bigger diamonds to be ascompetitive as that for smaller ones, so it makes sense that the variance as well as the pricewould increase with carat size.
Often the distribution of any monetary variable will be highly skewed and vary over ordersof magnitude. This can result from path-dependence (e.g., the rich get richer) and/or themultiplicitive processes (e.g., year on year inflation) that produce the ultimate price/dollaramount. Hence, it’s a good idea to look into compressing any such variable by putting it ona log scale (for more take a look at this guest post on Tal Galili’s blog).
Indeed, we can see that the prices for diamonds are heavily skewed, but when put on alog10 scale seem much better behaved (i.e., closer to the bell curve of a normaldistribution). In fact, we can see that the data show some evidence of bimodality on thelog10 scale, consistent with our two-class, “rich-buyer, poor-buyer” speculation about thenature of customers for diamonds.
Let’s re-plot our data, but now let’s put price on a log10 scale:
12345678910
p,=,qplot(price,,data=diamonds,,binwidth=100),+,,,,theme_bw(),+,,,,ggtitle("Price")p,p,=,qplot(price,,data=diamonds,,binwidth,=,0.01),+,,,,scale_x_log10(),+,,,,theme_bw(),+,,,,ggtitle("Price,(log10)")p
12345
p,=,qplot(carat,,price,,data=diamonds),+,,,,scale_y_continuous(trans=log10_trans(),),+,,,,theme_bw(),+,,,,ggtitle("Price,(log10),by,CubedVRoot,of,Carat")p
6
Advice: Start from an example
7
Online version:� http://www.cookbook-r.com/Graphs/
UCSD library: http://ucsd.worldcat.org/oclc/825073944
8PDF link
R has 4 graphics systems!
• Base R• Lattice = Trellis graphics• Ggplot2 system• grid
9
Start with scatter plots!
• 70% of your plots = some form of scatter• For categorical data: box plots, or jitter
• Practice reading plots. Extract insights from visual information.
10
11
library(ggplot2)ggplot(data=mtcars, aes(x=hp, y=mpg, shape=cyl, color=cyl)) + geom_point(size=3) + facet_grid(am~vs) + labs(title="Automobile Data by Engine Type", x="Horsepower", y="Miles Per Gallon")
Box plot: continuous vs discrete
12
Jittered categorical data
13
ggplot(Salaries, aes(x=rank, y=salary)) + geom_boxplot(fill="cornflowerblue", color="black", notch=TRUE)+ geom_point(position="jitter", color="blue", alpha=.5)+ geom_rug(side="l", color="black")
14
15
16
Multi var scatter plot
17
5/28/2014 ggpairs3.png (800×800)
http://solomonmessing.files.wordpress.com/2014/01/ggpairs3.png 1/1
18
Common problem: over-plotting
19
Solving over-plotting
• Smaller plot points• Change to isoquants or heat map• Transparency (alpha channel)• Jitter• Show marginal distributions along axes• Alter scales to spread data out more.• Several of the above
20
Solutions to overplotting1. Shrink the physical symbol size = 22. Change shape of symbols3. Make partially transparent alpha=.24. Random subset of the data points (50K à 5K)5. Jitter the points (random offsets)6. Change to isoquants or heat maps• R graphics cookbook
http://www.cookbook-r.com/Graphs/21
22
Interactive and Dynamic Graphics for Data Analysis:With R and GGobi �(Springer, Use R)
http://www.theusrus.de/blog/alpha-transparency-explained/
Example�http://www.statmethods.net/graphs/scatterplot.html
23
# High Density Scatterplot with Color Transparency �pdf("c:/scatterplot.pdf") �x <- rnorm(1000)�y <- rnorm(1000) �plot(x,y, main="PDF Scatterplot Example", col=rgb(0,100,0,50,maxColorValue=255), pch=16)�dev.off()
Multiple plots on a page
24
data(Salaries, package="car”) p1 <- ggplot(data=Salaries, aes(x=rank)) + geom_bar() p2 <- ggplot(data=Salaries, aes(x=sex)) + geom_bar() p3 <- ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary)) + geom_point() library(gridExtra) grid.arrange(p1, p2, p3, ncol=3)
25
Graphics to explain/persuade
26
27
Gulf of Mexico Oil & Natural Gas Facts
Gulf of Mexico Total
U.S. % from Gulf of Mexico
Oil (million barrels per day) Federal Offshore Crude Oil Production (4/05) 1.562 5.488 28.50% Total Gulf Coast Region Refinery Capacity (as of 1/1/05) 8.068 17.006 47.40% Total Gulf Coast Region Crude Oil Imports 6.49 10.753 60.40% - of which into ports in LA, MS and AL 2.524 10.753 23.50% - of which into LOOP 0.906 10.753 8.50% Natural Gas (billion cubic feet per day) Federal Offshore Marketed Production (3/05) 10.4 54.1 19.20% Energy Information Administration Data as of June 2005 unless otherwise noted.
Nice graphic, but data table cluttered. What is message?Katrina 1
Let’s try a graph in place of table
28
Katrina 2: Excel’s defaultDefault Excel graph
0
2
4
6
8
10
12
14
16
18
Off
shore oil
pro
duction
Import
s
- of
whic
h into
LO
OP
Gulf Mexico
US Total
• Shows sizes well• Gulf of Mexico
(blue) looks small relative to red bars
• Fonts too small to read
29
Version 3: stacked bar; throw out dataAmerica depends on Gulf Coast for Oil
0
2
4
6
8
10
12
14
16
18
Offshore oil
production
Imports Refinery
Capacity
Mill B
arrels
/d
ay
other US
Gulf Mexico
1) Clarifies relative roles of Gulf Coast, non-Gulf; imports vs offshore2) Columns re-ordered (why?) 3) Y axis labeled 4) Gulf looks bigger
30
Gulf Coast Vital to America for Oil!!
Gulf Coast Vital for Oil!
0
5
10
15
20
Offshore oil
production
Imports Refinery
Capacity
Mill Barr
els
/day
other US
GulfMexico
60%47%
31
What to do with the diagram?
32
This graphic by the French engineer Charles J. Minard is considered one of the greatest ever created to display statistical information. In June 1812, an army of 422,000 crossed the Russian-Polish border, represented by the thick stream on the upper left. By the time the army reached Moscow in October, Napoleon was down to 100,000 troops. The retreat in below-zero temperatures was equally disastrous, and only 4,000 of the original soldiers made it back to the Russian-Polish border From E. J. Marey, La Method Graphique (Paris, 1885), 73; reprinted in Edward R. Tufte, The Visual Display of Quantitative Information
Napoleon invades Russia, 1812-13
33
Growth Rate of Chinese Textile Export & Total Textile Export (Jan. – July 2005)
0 20 40 60 80
SOEs
Joint
Ventures
Private
Sector
Growth Rate (%)
Total Export (%)
Source: China Customs, Guangdong Branch: Report on Chinese Textile and Clothing Exports from January to July, 2005
34
Distorting graphs
35
Rate of change shows much different picture
36
Deliberate distortions �for propaganda
37
38
39See my blog, Art2science.org for critique
40
Common distortions• Omitting the zero line =====>• Using area versus linear distance
41
42
43
44
45
How many words? 3 styles
• Paragraph: Full sentences and paragraphs– 100 words, 1 minute to read
• Bullets: Terse outline– 50 words, 2 minutes
• Minimalist: Headlines– 15 words, 0 to 4 minutes
46
China’s Accession to the WTO
Upon China’s accession to the WTO, textile trade became subject to the relevant provisions of the Accession Agreements: Paragraph 242 of the Report of the Working Group on China’s Accession to the WTO and Section 16 of the Protocol on China’s Accession to the WTO.
The Accession Agreements postulate utterly discriminatory safeguards provisions that allow the United States and other countries to substantially restrict China’s textile exports.
These Agreements significantly differ from the ‘standard’ WTO disciplines on safeguards which stands in direct contradiction to the spirit of the WTO and puts China in an extremely Disadvantageous position relative to other textile exporters.
47
Relationship with Local Governments• Problems:
1) Unsure of local disaster management capacity 2) Concentration on counter-terrorism grants 3) Miscommunication between FEMA/Local government
• Action Plan:- Increase Grants to Local Governments - “Disaster Management Council”- Disaster Preparedness Assessment - Close and direct information line
Example of bullet style
48
“Wellhead” Prices
Bolivian Phases of LNG Project : Extraction and transportation to port
+
Example of minimalist style?
49
How many words? Tradeoffs
• Choices:– Paragraph: Full sentences and paragraphs– Bullets: Terse outline– Minimalist: Headlines
• Brute force/elegant• Is the audience paying close attention?• Use it later for documentation?
50
Other kinds of information
• Diagrams, pictures: visual interest, detail, depth
• Using tables to simplify complex information…(Or not!)
51
Garment factories
52
Export channels
53
source: Applebaum and Geriffi 1994
Textile Companies
Apparel Manufacturers
(all retail outlets) Retail Outlets
Fibers (natural and synthetic)
Spinning yarn Weaving, knitting, finishing fabric
Designing, cutting, sewing, buttonholing, ironing
Brand-named apparel companies Overseas buying offices Trading companies
Department stores Specialty stores Mass merchandise chains Discount chains Factory outlets
Global Value Chain
What conclusion should the audience draw?Should it be stated explicitly?
54
Raw Material
55
Geography of the project• Tarija gas fields in the
southwest of the country• Additional costs putting
the pipeline through Peru
• Only one-third of our investment will be in Bolivia
Nobody used a laser pointer!
56
Tables useful, but require care�Project Evaluation V• Scenario ($ millions except price) Bolivia NPV Partner NPV
Low Taxes (30% tax rate) 2,423 826
High Prices ($7 initial) 3,985 657
Low Prices ($5 initial) 1,883 -25
Bad Construction (up $1b) 2,472 181
High Interest Rates (15%) 2,570 204
Poor Margins (80% gross) 2,563 195
57
Agreement on Safeguards Accession Agreements Applicable on non-discriminatory basis to product categories regardless of a country of origin
Applicable to product categories originated exceptionally from China
Applicable for a certain period of time: 4 years in general, 8 years in exceptional cases
No time limits
If safeguard measures exceed one year, they should be gradually liberalized
No such requirement
Allows a country affected by safeguards to suspend concessions under Article I of GATT 1994 (suspension right) vis-à-vis a country, applying the safeguards
Suspension right is limited
Substantially limits re-imposition of safeguards again
Limits on re-imposition of safeguards are weak
58
FUNDING FOR HIGHWAYS, ALL UNITS OF GOVERNMENT, 2003 (in millions)
HIGHWAY TRUST OTHER FUNDS TOTAL STATE AGENCIES LOCAL
ITEM FUND ACCOUNT & ACCOUNTS FEDERAL & DC GOV'T TOTAL
Highway User Revenues:
Motor-Fuel and Vehicle Taxes 27,737 - 27,737 43,722 2,171 73,630
Tolls - - - 5,013 1,217 6,230
Subtotal 27,737 - 27,737 48,735 3,388 79,860
Other Taxes and Fees:
Property Taxes and Assessments - - - - 6,902 6,902
General Fund Appropriations 3/ - 1,686 1,686 3,400 16,344 21,430
Other Taxes and Fees - 434 434 3,382 4,008 7,824
Subtotal - 2,120 2,120 6,782 27,254 36,156
Investment Income and Other Receipts 18 - 18 2,749 5,176 7,943
Total Current Income 27,755 2,120 29,875 58,266 35,818 123,959
Bond Issue Proceeds 7/ - - - 9,526 4,899 14,425
Grand Total Receipts 27,755 2,120 29,875 67,792 40,717 138,384
Intergovernmental Payments:
Federal Government:
Highway Trust Fund (29,211) - (29,211) 28,512 699 -
All Other Funds - (1,422) (1,422) 800 622 -
State Agencies:
Highway-User Imposts - - - (11,720) 11,720 -
All Other Funds - - - (2,677) 2,677 -
Local Governments - - - 1,743 (1,743) -
Subtotal (29,211) (1,422) (30,633) 16,658 13,975 -
Funds Drawn from or Placed in Reserves 3,146 - 3,146 3,927 (1,645) 5,428
Total Funds Available 1,690 698 2,388 88,377 53,047 143,812
Data Source: US DOT, Federal Highway Administration, Table HF-10, December 2004
59
Original table of dataGulf of Mexico Oil & Natural Gas Facts
Gulf of Mexico Total
U.S. % from Gulf of Mexico
Oil (million barrels per day) Federal Offshore Crude Oil Production (4/05) 1.562 5.488 28.50% Total Gulf Coast Region Refinery Capacity (as of 1/1/05) 8.068 17.006 47.40% Total Gulf Coast Region Crude Oil Imports 6.49 10.753 60.40% - of which into ports in LA, MS and AL 2.524 10.753 23.50% - of which into LOOP 0.906 10.753 8.50% Natural Gas (billion cubic feet per day) Federal Offshore Marketed Production (3/05) 10.4 54.1 19.20% Energy Information Administration Data as of June 2005 unless otherwise noted.
60
Cleaning up the tableGulf of Mexico Oil & Natural Gas Facts
Gulf of Mexico Total
U.S.
% from
Gulf of Mexico
Oil (million barrels per day)
Federal Offshore Crude Oil Production (4/05) 1.5 5.4 28.5% Total Gulf Coast Region Refinery Capacity (as of 1/1/05) 8.0 17.0 47.4%
Total Gulf Coast Region Crude Oil Imports 6.4 10.7 60.4%
61
62
Disaster Mitigation (2) -Flood Area Management-
§ Flood Insurance Management Reduce the number of people living in flood-prone areas
§ Increase Flood Insurance Rates§ Provide Resettlement Money§ Refusing to Rebuild in Flooded Areas
Missouri Buyout Program:
1993 Flood 1995Flood May 2002 flood (as of 6/25/02)
Sandbagging sites in Arnold: 60 3 0
FEMA Public Assistance to Arnold 1,436,277 71,414 0Applications from Arnold for Individual Assistance 52 26 1
63
Spreadsheet• Flip to Excel and explain Financial
Appendices here.
64
Propose Joint Venture
w/ 30% tax rate: Left rejects 30%,
propose 50% PBT tax
1 Left rejects JV proposal,
HC Law reform fails. Negotiation begins with
Left to stave off riots
2 Quiroga gets support for JV from coalition
partners, Keeps left from
taking the streets
3 Gov’t fails to
secure enough capital to retain
majority ownership
Possible Outcomes
Quiroga’s First Move
65
Some Personal peeves � (not all agree)• Number the slides• USE Powerpoint’s text outline ability
• (Titles especially)• When importing from spreadsheets, reformat
• Larger fonts• Remove inessential information• Informative labels on rows + columns• Format numbers
Time management -- went very well
66
Conclusions1. Ideas come first; what do you want to say?
– Weak Ideas presented elegantly– versus Good Ideas presented plainly– Does it depend on intelligence of audience?
2. Logical flow of ideas second 3. Good slides only third
– Key message clearly visible– Minimal extraneous junk– Never just “grab and go” from other documents
4. Charts and graphics make huge difference + or -5. Multimedia effects last: Animation, music…Great slides take time to polish & repolish
67
Good memos and presentations: building to inescapable
conclusion1. Fact A - prove it2. If A and B, then C = logical inference3. Fact B - prove it4. Counterarguments to C = prove invalid5. Therefore, audience must agree with C �
if they accepted the previous steps
68
When it matters a lot, take the time to do it right…
69
Do as we say, not as we do
• Most Faculty are not good role models (though it’s getting better)
• Varies by discipline (and person)
Thank you!