27
CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

Embed Size (px)

Citation preview

Page 1: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

CRASH COURSE IN STATA

EC501Gabriella Conti

University of Essex

Page 2: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

OBJECTIVE

Introduce the use of Stata for: Data management Estimation

Cross sections Time series Panel data

Testing and prediction

Page 3: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

OVERVIEW

What is Stata Stata resources Getting started Language syntax Storage types Formats Inputting data Do-files, Ado-files, Log-files Examining the data

Page 4: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

STATA

Stata is a statistical package for managing, analyzing, and graphing data. User-friendly:

Command-driven language Interactive

Stata power: http://www.stata.com/capabilities/

Which Stata: about Latest version: Stata 8.2

Stata/SE Intercooled

Small

maxvar 32,767 2,047 99

matsize 11,000 800 40

Page 5: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

STATA RESOURCES (1) Stata itself:

help [command or topic name]; whelp [command or topic name]; help contents; search/net search/findit [command or topic name].

Stata manuals (version 8): Getting Started [GS] User’s Guide [U] Reference [R] Cross-sectional time-series [XT] Time series [TS] Graphics [G] …and lots more…

Stata website: http://www.stata.com FAQs: http://www.stata.com/support/faqs Statalist: http://www.stata.com/support/statalist Data sets used in manuals:

http://www.stata-press.com/data/

Page 6: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

STATA RESOURCES (2)

Stata Technical Bulletin [STB], now The Stata Journal The Boston College Software Archive (user-written

commands): net from http://fmwww.bc.edu/RePEc/bocode/ ssc install [command]

Stata is web-aware! UCLA Academic Technology Services:

http://www.ats.ucla.edu/stat/stata Other resources:

http://www.stata.com/links/resources1.html

Page 7: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

GETTING STARTED (1) Stata windows:

Results window [Ctrl+1 or click the results icon] Graph window [Ctrl+2] Viewer window [Ctrl+3 or click the viewer icon]:

help, search, net search, view Command window [Ctrl+4]:

Type commands here (use pag-up and pag–down buttons for past commands) Hit return to execute the command

Review window [Ctrl+5]: Past commands appear here (click on command, and it will appear in the

command window) Variables window [Ctrl+6]:

Variables appear here (click on variable, and it will appear in the command window, or wherever the Target in the Variables window specifies)

Data editor [Ctrl+7 or click the data editor icon or type edit in the command window]

Data browser [click the data browser icon or type browse in the command window]

Do-file editor [Ctrl+8 or click the do-file editor icon]

Page 8: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

GETTING STARTED (2) Stata toolbar (icons):

Open: open a stata dataset. Save: save a stata dataset. Print: print contents of active window. Log: to start or stop, pause or resume a log file. Viewer: open viewer window, or bring to the front. Results: open results window, or bring to the front. Graph: open graph window, or bring to the front. Do-file editor: open do-file editor, or bring window to the front. Data editor: open data editor, or bring window to the front. Data browser: open data browser, or bring window to the front. More: command to continue when paused in long output. Break: stop the current task. This command returns the system to as it

was before you issued the command.

Page 9: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

GETTING STARTED (3)

Commands interface: one of the main changes in Stata 8 is that it now has a Menu toolbar (in the style of SPSS). This enables the user to select an item from a pull-down menu which opens a dialogue box in which you can build Stata commands.

It is very useful to learn how to build commands with a compicated syntax (e.g. graphs).

The command issued by the dialogue box is submitted as you typed it by hand. Therefore if you cannot remember the syntax of a command, using the dialogue box and then checking the command in the Review window (or using the Page-up button) is a good way to get a reminder.

Page 10: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

BASIC LANGUAGE SYNTAX

[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [using filename] [, options]

Drop/keep variables or observations according to conditions [if exp] [in range]

Logical operators to use with [if exp]: & (and), | (or), != (not)

Relational operators can be used in [if exp]: ==, !=, >, >=, <, <=

Page 11: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

STORAGE TYPES (1) A number may contain a sign, an integer part, a decimal point, a fraction part, an e

or E, and a signed integer exponent. Numbers may not contain commas; e.g.: the number 2,210 must be typed as 2210 (or 2210. or 2210.0).

Numbers can be stored in one of five variable types: byte, int, long, float (the default), or double. The table shows the minimum and maximum values for each storage type.

Storage type

Minimum Maximum Closest to 0 without being 0

bytes

byte -127 100 ±1 1

int -32,767 32,740 ±1 2

long -2,147,483,647 2,147,483,620 ±1 4

float -1.70141173319* 1.70141173319* ± 4

double -8.9884656743* +8.9884656743* ± 8

3810

30710

3610

30810

3610

32310

Page 12: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

STORAGE TYPES (2) A string is a sequence of printable characters, and is typically enclosed in double

quotes. The quotes are not considered a part of the string. They merely delimit the beginning and end of the string.

The special string “”, often called null string, is considered by Stata to be a missing.

1. String variables often contain identifying information, such as the name of the city or state.

Such strings are typically listed, but are not used directly in statistical analysis, although the data might be sorted on the string or datasets might be merged on the basis of one or more string variables.

2. Occasionally, strings contain information that is to be used directly in the analysis, such as the sex, which might be coded “male” or “female”. Stata prefers such information to be numerically encoded and stored in numeric variables.

Stata’s statistical routines treat string variables as if every observation records a numeric missing value. However, Stata provides two commands for converting string variables into numeric (and back again): encode/decode and destring/string.

3. Strings may contain the character representation of a number – e.g.: “2.3”. You can convert it directly into a numeric variable using the real() function (with generate), or the

destring command. Strings are stored in string variables with storage types str1, str2, …, str80.

The storage type merely sets the max. length of the string, not its actual length; thus, “example” has length 7 whether it is stored as a str7, a str10, or even a str80. On the other hand, an attempt to assign the string “example” to a str6 would result in “exampl”.

The max. length of a string is 80 characters in Intercooled Stata or Small Stata and 244 in Stata/SE. String literals may exceed 80/244 characters, but only the first 80/244 are significant.

Page 13: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

FORMATS (1) The syntax for a Stata numeric format:

first type % to indicate the start of the format

then optionally type - if you want the result left-aligned

then optionally type 0 if you want to retain leading zeros (honored only with the f format)

then type a number w stating the width of the result

then type .

then type a number d stating the number of digits to follow the decimal point

then type

either e for scientific notation; e.g.: 1.00e+03

or f for fixed format; e.g.: 1000.0

or g for general format; Stata chooses based on the number being displayed

then optionally type c to indicate comma format (not allowed with e)

Page 14: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

FORMATS (2) The syntax for a string format is:

The default format for each of the numeric variable types are: byte %8.0g int %8.0g long %12.0g float %9.0g double %10.0g

The default format for a string is %ws or %9s, whichever is wider.

first type % to indicate the start of the format

then optionally type - if you want the result left-aligned

then type a number indicating the width of the result

then type s

Page 15: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

FILES EXTENSIONS

Data file (Stata format): filename.dta Do-file: filename.do Dictionary file: filename.dct Log-file: filename.smcl (only readable

in Stata) Log-file: filename.log (text file) Ado-file: filename.ado

Page 16: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (1) Check memory: memory If not enough memory has been assigned to Stata, you may get

the message: no room to add more observations

An attempt was made to increase the number of observations beyond what is currently possible. You have the following alternatives:

1. Store your variables more efficiently; see help compress. (Think of Stata's data area as the area of a rectangle; Stata can trade off width and length.)

2. Drop some variables or observations; see help drop.

3. Increase the amount of memory allocated to the data area using the set memory command; see help memory.

r(901); Set memory: set memory

Page 17: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (2)

1a. use filename [, clear nolabel] (or click the folder icon)

for datasets already in Stata format *.dta If filename is specified without an extension, .dta is

assumed. clear permits the data to be loaded even if there is a

dataset already in memory and even if that dataset has changed since the data were last saved.

nolabel prevents value labels from being loaded. Unlikely that you will ever use it.

1b. use [varlist] [if exp] [in range] using filename [, clear nolabel ]

only a subset of the data is loaded.

Page 18: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (3)2. insheet [varlist] using filename [, double [no]names [ comma | tab | delimiter("char") ] clear ]

For files created by spreadsheet or database programs (eg. Excel). For text (ASCII) files where there is one observation per line and the

values are separated by tabs or commas (*.csv). the first line of the file can contain the variable names or not.

double forces Stata to store variables as doubles rather than float. It will only speed insheet processing (but can determine for itself).

comma, tab, and delimiter("char") tell Stata how values are separated in the file.

It will only speed insheet processing (but can determine for itself when the character is a tab or a comma).

If values in the file are separated by semicolon, specify delimiter(";"). clear specifies that it is okay for the new data to replace what is currently in

memory. Best point: insheet using filename is all you need.

Page 19: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (4)

3a. infile varlist [_skip[(#)] [varlist [_skip[(#)] ...]]] using filename [if exp] [in range] [, automatic byvariable(#) clear]

For data in either free or comma-separated-value format (unformatted ASCII (text) data).

If filename is specified without an extension, *.raw is assumed. The file must contain only the data, not the variable names.

automatic causes creation of value labels from the nonnumeric data read. byvariable(#) specifies that the external file is organized by variables rather

than by observations. clear specifies that it is okay for the new data to replace what is currently in

memory.

Page 20: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (5)3b. infile using dfilename [if exp] [in range] [, automatic

using(filename2) clear ] For ASCII (text) data in fixed format with a dictionary.

A dictionary describes the contents of the file and will allow reading files in fixed or free format. dfilename contains the dictionary. If dfilename is specified without an extension, .dct is assumed.

using(filename2) specifies the name of the file containing the data. If using() is not specified, the data are assumed to follow the dictionary in dfilename or, if the dictionary specifies the name of some other file, that file is assumed to contain the data. If using(filename2) is specified, filename2 is used to obtain the data even if the dictionary itself says otherwise.

E.g.: dictionary using D:\DATA\LFS\RAW\OTT92.txt { _column(1) year %2f _column(3) quarter %1f _column(4) region %2f _column(31) sex %1f _column(32) age %2f _column(45) education %1f _column(59) workcond %2f _column(61) workweek %1f _column(62) workday %1f _column(63) workhour %2f _column(65) usualday %1f } automatic causes creation of value labels from the nonnumeric data read. clear specifies that it is okay for the new data to replace what is currently in memory.

Page 21: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (6)

4.a infix using dfilename [if exp] [in range] [, using(filename2) clear ]

4.b infix specifications using filename [if exp] [in range] [, clear]

For data be in fixed-column format. In the first syntax, dfilename contains the dictionary. If dfilename is specified

without an extension, .dct is assumed. using(filename2) specifies the name of the file containing the data. If using()

is not specified, the data are assumed to follow the dictionary in dfilename or, if the dictionary specifies the name of some other file, that file is assumed to contain the data. If using(filename2) is specified, filename2 is used to obtain the data even if the dictionary itself says otherwise.

E.g.: infix year 1-2 quarter 3 region 4-5 sex 31 age 32-33 education 45 workcond 59-60 workweek 61 workday 62 workhour 63-64 usualday 65 using D:\DATA\LFS\RAW\OTT92.txt

clear specifies that it is okay for the new data to replace what is currently in memory.

Page 22: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

INPUTTING DATA (7)

5. Stat/Transfer: http://www.stattransfer.com/

Performs the conversion of data automatically from one format to .dta format.

6. edit [varlist] [if exp] [in range] [, nolabel] edit brings up a spreadsheet-style data editor for entering new data and

editing existing data.

7. input [varlist] [, automatic label ] input allows you to type data directly into the dataset in memory.

8. odbc load [options] odbc allows Stata to load data from ODBC sources.

Type help odbc for more on this.

Page 23: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

DO-FILES Instead of using Stata interactively, you can use do-

files. Highly recommended.

A do-file is a standard ASCII text file that includes commands. Filename must include the extension .do.

Stata users can use any text editor to create do-files, or they can use the built-in do-file editor.

You can include comments using the indicators *, /* */, //, ///.

You can change the end-of-line delimiter for long lines: E.g.: #delimit ; once you change the line delimiter to semicolon,

all lines, even short ones, must end in semicolons. A do-file is executed by Stata:

when you type: do filename in the command window. When you click the “do current file” button in the do-file editor.

Page 24: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

ADO-FILES

An ado-file is an ASCII text file that contains a Stata program. When you type a command that Stata does not know

(i.e. it is not a built-in command), it looks in certain places for an ado-file of that name. If Stata finds it, Stata loads and executes it, so it appears to you as if the ado-command is just another command built into Stata.

Use the which command to determine if a command is built in or implemented as an ado-file.

Stata looks for ado-files in seven directories. Use the command sysdir to know where they are on your computer.

Page 25: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

LOG FILES log or click the log icon. log using filename [, append replace [ text | smcl ] ] log { on | off | close } cmdlog cmdlog using filename [, append replace ] cmdlog { on | off | close }

log allows you to make a full record of your Stata session. A log is a file containing what you type and Stata's output.

Useful to include the commands to start and stop the logging in the do-file itself. cmdlog allows you to make a record of what you type during your Stata session. A

command log contains only what you type and so is a subset of a full log. Command logs are always straight ASCII text files and this makes them easy to convert into do-files.

Full logs are recorded in one of two formats: SMCL (Stata Markup and Control Language) or text (meaning ASCII). The default is SMCL, but set logtype can change that, or you can specify an option [ text | smcl ] to state the format you wish.

log or cmdlog, typed without arguments, reports the status of logging. log using and cmdlog using open a log file. log close and cmdlog close close the file.

Between times, log off and cmdlog off, and log on and cmdlog on can temporarily suspend and resume logging.

append specifies that results are to be appended onto the end of an already existing file. If the file does not already exist, a new file is created.

replace specifies that filename, if it already exists, is to be overwritten, and so is an alternative to append.

Page 26: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

CONTROLLING OUTPUT -more– may appear in your results window when

you are trying to output a long listing. To see the next line: press Enter. To see the next screen: press any key or click on the –more- at the bottom of the results window, or click

the “go” icon. Set more off/on: to switch the more command off/on

Very useful in do-files.

break: to interrupt a Stata command at any time, use the “break” button, or type q in the command window.

Page 27: CRASH COURSE IN STATA EC501 Gabriella Conti University of Essex

NEXT TIME: LAB #1 Examining the data:

describe list codebook summarize inspect tabulate

Organising datasets: rename drop keep generate replace egen sort append Merge

…and more...