24
1 SOCY7709: Quantitative Data Management Instructor: Natasha Sarkisian String Variables, Dates, and Formats for Numeric Variables String Variable Formats A string is a sequence of characters and is typically enclosed in double quotes. When considering whether a string constitutes a distinct value (i.e., different from another string), capitalization matters, and leading and trailing spaces spaces before or after the text matter as well. As we already saw, numbers can be treated as strings. String variables can have any length up to 244 characters in Stata 12 and you can identify these formats as str1… str244. In Stata 13 and 14, string variables can contain up to two billion characters in where you can use str1...str2045 to define fixed-length strings of up to 2045 characters, and strL format to define a very long string. If you are displaying a string that contains a lot of text and use list command, it will only show as many characters as your width of screen and then cut off the rest. If you wanted to more (up to 2045 characters), you could use notrim option: . list stringvar, notrim To see all text for a given observation, regardless of how long it is (can be many pages given the new limits in recent Stata versions): . display _asis stringvar[5] When we refer to string values, we usually use quotes to delimit the beginning and the end. If there are leading spaces in a quote, we’d need to include them to get the exact match, e.g., “ text” is a different string from “text” so we need to include spaces to refer to that first value. When we enter data in Stata, we can enter them without quotes, but in that case, any leading or trailing spaces will be automatically removed. If a string is entered in quotes, it is accepted as is. In addition to regular double quotes "" for enclosing strings, Stata also allows compound double quotes: `" and "'. That is, instead of typing "text", you can type `"text"' the second version is used in programing because it allows for that quoted string to itself contain double quotes within it (without compound quotes, that is not possible because Stata would think we are ending the string whenever the quotation mark appeared). Missing values for strings are coded using null string "" not using either a period "." or a blank space " ". String variables can be formatted for display in different ways: string %fmt Description Example ------------------------------------------------------- right-justified %#s string %15s

SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

1

SOCY7709: Quantitative Data Management

Instructor: Natasha Sarkisian

String Variables, Dates, and Formats for Numeric Variables

String Variable Formats

A string is a sequence of characters and is typically enclosed in double quotes. When considering

whether a string constitutes a distinct value (i.e., different from another string), capitalization

matters, and leading and trailing spaces – spaces before or after the text – matter as well. As we

already saw, numbers can be treated as strings.

String variables can have any length up to 244 characters in Stata 12 and you can identify these

formats as str1… str244. In Stata 13 and 14, string variables can contain up to two billion

characters in where you can use str1...str2045 to define fixed-length strings of up to 2045

characters, and strL format to define a very long string. If you are displaying a string that

contains a lot of text and use list command, it will only show as many characters as your width of

screen and then cut off the rest. If you wanted to more (up to 2045 characters), you could use

notrim option:

. list stringvar, notrim

To see all text for a given observation, regardless of how long it is (can be many pages given the

new limits in recent Stata versions): . display _asis stringvar[5]

When we refer to string values, we usually use quotes to delimit the beginning and the end. If

there are leading spaces in a quote, we’d need to include them to get the exact match, e.g., “

text” is a different string from “text” so we need to include spaces to refer to that first value.

When we enter data in Stata, we can enter them without quotes, but in that case, any leading or

trailing spaces will be automatically removed. If a string is entered in quotes, it is accepted as is.

In addition to regular double quotes "" for enclosing strings, Stata also allows compound double

quotes: `" and "'. That is, instead of typing "text", you can type `"text"' – the second version is

used in programing because it allows for that quoted string to itself contain double quotes within

it (without compound quotes, that is not possible because Stata would think we are ending the

string whenever the quotation mark appeared).

Missing values for strings are coded using null string – "" – not using either a period "." or a

blank space " ".

String variables can be formatted for display in different ways:

string

%fmt Description Example

-------------------------------------------------------

right-justified

%#s string %15s

Page 2: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

2

left-justified

%-#s string %-20s

centered

%~#s string %~12s

-------------------------------------------------------

The centered format is for use with display only.

Basic Operations with Strings

Converting strings to numbers and vice versa:

We already learned to use tostring and destring to convert between strings and numbers. Those

commands are useful for converting variables that actually contain numbers to and from string

versus numeric format. Note that if you want to convert all string variables in your dataset that

contain numbers saved as strings (e.g., because a lot of them were created that way in the dataset

provided to you, but they are truly numeric), you can also use destring without specifying

variables, i.e., . destring, replace

Sometimes, most of the variable values are numeric, but some values are text-based – e.g., if

missing values are coded as X. In such cases, we can use ignore option:

. destring, replace ignore(X)

If a string variable contains nonnumeric characters that are not specified with ignore option, then

no changes will be made at all (unless force option is also specified). Note that if a cell contains

both a number and a character we specified to ignore, the character will be omitted and the

number will be used, e.g. if there is a string variable with percentage values coded as “58%” etc.,

we can use:

. destring varname, gen(varname_v2) ignore(%)

or . destring varname, gen(varname_v2) percent

The latter option also divides values by 100, turning them into proportions.

Another way to convert between string and numeric formatt is string(n,s) function and real(s):

. sum id

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

id | 2765 1383 798.3311 1 2765

. gen stringid=string(id, "%05.0f")

. list stringid in 1/10

+----------+

| stringid |

|----------|

1. | 00001 |

2. | 00002 |

Page 3: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

3

3. | 00003 |

4. | 00004 |

5. | 00005 |

|----------|

6. | 00006 |

7. | 00007 |

8. | 00008 |

9. | 00009 |

10. | 00010 |

+----------+

. gen idreal=real(stringid)

. sum idreal id

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

idreal | 2765 1383 798.3311 1 2765

id | 2765 1383 798.3311 1 2765

Or for date:

. gen datestring=string(date, "%td")

. list datestring in 1/10

+-----------+

| datestr~g |

|-----------|

1. | 30may2002 |

2. | 03jun2002 |

3. | 23feb2002 |

4. | 26mar2002 |

5. | 04may2002 |

|-----------|

6. | 02jun2002 |

7. | 21feb2002 |

8. | 09may2002 |

9. | 07may2002 |

10. | 13may2002 |

+-----------+

Formats (such as %05.0f and %td in the examples above) can be stored separately as a separate

string variable if different formats are desired for different observations.

But if we are dealing with variables where there are actual words, these commands are less

useful. Instead, we could use encode and decode.

. tab marital, nol

marital |

status | Freq. Percent Cum.

------------+-----------------------------------

1 | 1,269 45.90 45.90

2 | 247 8.93 54.83

3 | 445 16.09 70.92

4 | 96 3.47 74.39

5 | 708 25.61 100.00

------------+-----------------------------------

Total | 2,765 100.00

Page 4: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

4

. decode marital, gen(marstring)

. tab marstring

marital |

status | Freq. Percent Cum.

--------------+-----------------------------------

divorced | 445 16.09 16.09

married | 1,269 45.90 61.99

never married | 708 25.61 87.59

separated | 96 3.47 91.07

widowed | 247 8.93 100.00

--------------+-----------------------------------

Total | 2,765 100.00

. tab marstring, nol

marital |

status | Freq. Percent Cum.

--------------+-----------------------------------

divorced | 445 16.09 16.09

married | 1,269 45.90 61.99

never married | 708 25.61 87.59

separated | 96 3.47 91.07

widowed | 247 8.93 100.00

--------------+-----------------------------------

Total | 2,765 100.00

. des marstring

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------

marstring str13 %13s marital status

. sum marstring

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

marstring | 0

And going back:

. encode marstring, gen(marnumeric)

. tab marnumeric

marital |

status | Freq. Percent Cum.

--------------+-----------------------------------

divorced | 445 16.09 16.09

married | 1,269 45.90 61.99

never married | 708 25.61 87.59

separated | 96 3.47 91.07

widowed | 247 8.93 100.00

--------------+-----------------------------------

Total | 2,765 100.00

. tab marnumeric, nol

marital |

Page 5: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

5

status | Freq. Percent Cum.

------------+-----------------------------------

1 | 445 16.09 16.09

2 | 1,269 45.90 61.99

3 | 708 25.61 87.59

4 | 96 3.47 91.07

5 | 247 8.93 100.00

------------+-----------------------------------

Total | 2,765 100.00

In contrast, if we tried to apply destring:

. destring marstring, gen(test)

marstring contains nonnumeric characters; no generate

. destring marstring, gen(test) force

marstring contains nonnumeric characters; test generated as byte

(2765 missing values generated)

. tab test

no observations

Combining strings:

The following function of egen concatenates multiple variables to produce one string variable:

concat(varlist) [, format(%fmt) decode maxlength(#) punct(pchars)]

The values of string variables are unchanged. Values of numeric variables are converted to

string, as is, or are converted using a numeric format under the format(%fmt) option or decoded

under the decode option, in which case maxlength() may also be used to control the maximum

label length used. By default, variables are added end to end: punct(pchars) may be used to

specify punctuation, such as a space, punct(" "), or a comma, punct(,).

. egen text=concat(age hrs1 marstring), format(%2.1f) decode punct(,)

. list text in 1/10

+-------------------+

| text |

|-------------------|

1. | 27.0,20.0,married |

2. | 37.0,49.0,married |

3. | 33.0,44.0,married |

4. | 62.0,20.0,married |

5. | 70.0,3.0,married |

|-------------------|

6. | 29.0,43.0,married |

7. | 25.0,49.0,married |

8. | 58.0,.,married |

9. | 51.0,37.0,married |

10. | 43.0,.,married |

+-------------------+

Cleaning strings:

itrim(s) – remove any multiple blank spaces in strings; trim(s) – removes all leading spaces

(spaces prior to actual text) and trailing spaces (those after the text); ltrim(s) – removes leading

spaces only; rtrim(s) – removes trailing spaces only. E.g.:

Page 6: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

6

. gen testvar=" We love Stata"

. gen testtrim=ltrim(testvar)

. tab testtrim

testtrim | Freq. Percent Cum.

--------------------------+-----------------------------------

We love Stata | 2,765 100.00 100.00

--------------------------+-----------------------------------

Total | 2,765 100.00

. replace testtrim=itrim(testvar)

(2765 real changes made)

. tab testtrim

testtrim | Freq. Percent Cum.

--------------------------+-----------------------------------

We love Stata | 2,765 100.00 100.00

--------------------------+-----------------------------------

Total | 2,765 100.00

Changing case:

lower(s) converts strings into lowercase, and upper(s) converts strings into uppercase. proper(s)

makes the first letter capitalized, and capitalizes any other letters immediately following any

characters that are not letters (e.g. space, period, etc.). E.g.: . gen marstring_up=upper(marstring)

. tab marstring_up

marstring_up | Freq. Percent Cum.

--------------+-----------------------------------

DIVORCED | 445 16.09 16.09

MARRIED | 1,269 45.90 61.99

NEVER MARRIED | 708 25.61 87.59

SEPARATED | 96 3.47 91.07

WIDOWED | 247 8.93 100.00

--------------+-----------------------------------

Total | 2,765 100.00

. gen city="st.louis" in 1/100

(2665 missing values generated)

. replace city="new york" in 101/200

(100 real changes made)

. replace city="boston" in 201/300

(100 real changes made)

. replace city=proper(city)

(300 real changes made)

. tab city

city | Freq. Percent Cum.

------------+-----------------------------------

Boston | 100 33.33 33.33

New York | 100 33.33 66.67

St.Louis | 100 33.33 100.00

------------+-----------------------------------

Total | 300 100.00

Page 7: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

7

Measuring strings:

length(s) function evaluates the length of a string. wordcount(s) evaluates the number of words

in the string (a word is defined as a set of characters that start and terminate with spaces, start

with the beginning of the string, or terminate with the end of the string). E.g.:

. gen marlength=length(marstring)

. tab marlength

marlength | Freq. Percent Cum.

------------+-----------------------------------

7 | 1,516 54.83 54.83

8 | 445 16.09 70.92

9 | 96 3.47 74.39

13 | 708 25.61 100.00

------------+-----------------------------------

Total | 2,765 100.00

. gen marcount=wordcount(marstring)

. tab marcount

marcount | Freq. Percent Cum.

------------+-----------------------------------

1 | 2,057 74.39 74.39

2 | 708 25.61 100.00

------------+-----------------------------------

Total | 2,765 100.00

Advanced Operations with Strings

Changing strings:

We already learned that you can add two strings using “+”. You can also multiply strings by a

number to duplicate the same text:

. replace city=city*3

city was str8 now str24

(300 real changes made)

. tab city

city | Freq. Percent Cum.

-------------------------+-----------------------------------

BostonBostonBoston | 100 33.33 33.33

New YorkNew YorkNew York | 100 33.33 66.67

St.LouisSt.LouisSt.Louis | 100 33.33 100.00

-------------------------+-----------------------------------

Total | 300 100.00

abbrev(s,n) -- abbreviates strings to some number of characters between 5 and 32

. gen marabb=abbrev(marstring, 6)

. tab marabb

Page 8: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

8

marabb | Freq. Percent Cum.

------------+-----------------------------------

divo~d | 445 16.09 16.09

marr~d | 1,269 45.90 61.99

neve~d | 708 25.61 87.59

sepa~d | 96 3.47 91.07

wido~d | 247 8.93 100.00

------------+-----------------------------------

Total | 2,765 100.00

reverse(s) – reverses the text:

. gen marrev=reverse(marstring)

. tab marrev

marrev | Freq. Percent Cum.

--------------+-----------------------------------

decrovid | 445 16.09 16.09

deirram | 1,269 45.90 61.99

deirram reven | 708 25.61 87.59

detarapes | 96 3.47 91.07

dewodiw | 247 8.93 100.00

--------------+-----------------------------------

Total | 2,765 100.00

Searching in strings:

strmatch(s1,s2) -- returns 1 if s1 matches the pattern s2; otherwise, it returns 0. In s2, "?" means

that one character goes here, and "*" means that any number of characters (including possibly

zero characters) go here.

. gen match=strmatch(marstring, "married")

. tab marstring match

marital | match

status | 0 1 | Total

--------------+----------------------+----------

divorced | 445 0 | 445

married | 0 1,269 | 1,269

never married | 708 0 | 708

separated | 96 0 | 96

widowed | 247 0 | 247

--------------+----------------------+----------

Total | 1,496 1,269 | 2,765

. gen match2=strmatch(marstring, "*married")

. tab marstring match2

marital | match2

status | 0 1 | Total

--------------+----------------------+----------

divorced | 445 0 | 445

married | 0 1,269 | 1,269

never married | 0 708 | 708

separated | 96 0 | 96

widowed | 247 0 | 247

--------------+----------------------+----------

Total | 788 1,977 | 2,765

Page 9: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

9

strpos(s1,s2) -- returns the number indicating the position in s1 at which s2 is first found; if s2 is

not found, it returns 0. . gen marpos=strpos(marstring, "married")

. tab marpos

marpos | Freq. Percent Cum.

------------+-----------------------------------

0 | 788 28.50 28.50

1 | 1,269 45.90 74.39

7 | 708 25.61 100.00

------------+-----------------------------------

Total | 2,765 100.00

. gen marpos2=strpos(marstring, "ed")

. tab marpos2

marpos2 | Freq. Percent Cum.

------------+-----------------------------------

6 | 1,516 54.83 54.83

7 | 445 16.09 70.92

8 | 96 3.47 74.39

12 | 708 25.61 100.00

------------+-----------------------------------

Total | 2,765 100.00

subinstr(s1,s2,s3,n) – takes s1 and replaces the first n occurrences of s2 within s1 with s3. If n is

missing (.), all occurrences are replaced.

. gen marsub=subinstr(marstring, "ed", "ing", .)

. tab marsub

marsub | Freq. Percent Cum.

---------------+-----------------------------------

divorcing | 445 16.09 16.09

marriing | 1,269 45.90 61.99

never marriing | 708 25.61 87.59

separating | 96 3.47 91.07

widowing | 247 8.93 100.00

---------------+-----------------------------------

Total | 2,765 100.00

subinword(s1,s2,s3,n) – takes string s1 and replaces the first n occurrences of s2 as a word (i.e.,

space separated entity) within s1 with s3. If n is missing (.), all occurrences are replaced.

. gen marchanged=subinword(marstring, "married", "wedded",.)

. tab marchanged

marchanged | Freq. Percent Cum.

-------------+-----------------------------------

divorced | 445 16.09 16.09

never wedded | 708 25.61 41.70

separated | 96 3.47 45.17

wedded | 1,269 45.90 91.07

widowed | 247 8.93 100.00

Page 10: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

10

-------------+-----------------------------------

Total | 2,765 100.00

For a way to do more complex search and substitution operations, also see functions regexm,

regexr, and regexs (need to know how to work with so-called regular expressions).

Extracting portion of strings:

substr(s,n1,n2) – we already used this one when working with dates; it returns the substring of s,

starting at column n1, for a length of n2. If n1 is negative, then the distance is counted from the

end of the string; n2 cannot be negative. if n2 is . (missing), the entire remaining portion of the

string starting at position n1 is returned.

word(s, n) - returns the nth word in the string s. Positive numbers count words from the

beginning of the string, and negative numbers count words from the end of string. E.g.:

. gen marstring2=word(marstring, 2)

(2057 missing values generated)

. tab marstring2

marstring2 | Freq. Percent Cum.

------------+-----------------------------------

married | 708 100.00 100.00

------------+-----------------------------------

Total | 708 100.00

A function in egen command, ends, gives the first word of a string with the “head” option, the

last word using “last” option, or everything EXCEPT the first word using “tail” option (if there is

nothing that occurs after the first space or the first punctuation sign of choice, then the result will

be an empty string). How the words are separated is determined by the option punct – the default

is a space, but it can be comma etc. For example, if we have a string “rock, paper, scissors” and

we use egen to extract the last item:

. egen marstring3=ends(marstring), last punct(,) trim

Note that the trim option removes any leading or trailing spaces.

Nesting Operations

You can nest operations in one another. For example:

. gen maroperate=proper(abbrev(reverse(marstring*2), 15))

. tab maroperate

maroperate | Freq. Percent Cum.

----------------+-----------------------------------

Decroviddecro~D | 445 16.09 16.09

Deirram Reven~N | 708 25.61 41.70

Deirramdeirram | 1,269 45.90 87.59

Detarapesdeta~S | 96 3.47 91.07

Dewodiwdewodiw | 247 8.93 100.00

----------------+-----------------------------------

Page 11: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

11

Total | 2,765 100.00

Or we can convert ID into string with some leading zeros, and then select the portion that omits

the last two digits:

. gen newid = substr(string(id, "%05.0f"), 1, length(string(id, "%05.0f")) - 2)

. list newid in 1/10

+-------+

| newid |

|-------|

1. | 000 |

2. | 000 |

3. | 000 |

4. | 000 |

5. | 000 |

|-------|

6. | 000 |

7. | 000 |

8. | 000 |

9. | 000 |

10. | 000 |

+-------+

. tab newid

newid | Freq. Percent Cum.

------------+-----------------------------------

000 | 99 3.58 3.58

001 | 100 3.62 7.20

002 | 100 3.62 10.81

003 | 100 3.62 14.43

004 | 100 3.62 18.05

005 | 100 3.62 21.66

006 | 100 3.62 25.28

007 | 100 3.62 28.90

008 | 100 3.62 32.51

009 | 100 3.62 36.13

010 | 100 3.62 39.75

011 | 100 3.62 43.36

012 | 100 3.62 46.98

013 | 100 3.62 50.60

014 | 100 3.62 54.21

015 | 100 3.62 57.83

016 | 100 3.62 61.45

017 | 100 3.62 65.06

018 | 100 3.62 68.68

019 | 100 3.62 72.30

020 | 100 3.62 75.91

021 | 100 3.62 79.53

022 | 100 3.62 83.15

023 | 100 3.62 86.76

024 | 100 3.62 90.38

025 | 100 3.62 94.00

026 | 100 3.62 97.61

027 | 66 2.39 100.00

------------+-----------------------------------

Total | 2,765 100.00

Page 12: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

12

Dealing with Date Variables

Stata wants dates stored in number of units since January 1, 1960—the units can be seconds,

minutes, days or months. So if we want to be able to do use date procedures in Stata (e.g.

calculate the number of months between some events), we should store date variables in Stata

format. Coding and interpretation of date and time values in Stata are as follows: +---------------------------------------------------------------------

| | | ----- Numerical value & interpretation ------

| Format | Meaning | Value = -1 | Value = 0 | Value = 1

|--------+------------+---------------+---------------+---------------

| %tc | clock | 31dec1959 | 01jan1960 | 01jan1960

| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001

| | | | |

| %td | days | 31dec1959 | 01jan1960 | 02jan1960

| | | | |

| %tw | weeks | 1959w52 | 1960w1 | 1960w2

| | | | |

| %tm | months | 1959m12 | 1960m1 | 1960m2

| | | | |

| %tq | quarters | 1959q4 | 1960q1 | 1960q2

| | | | |

| %th | half-years | 1959h2 | 1960h1 | 1960h2

| | | | |

| %tg | generic | -1 | 0 | 1

| | | | |

| %ty | year | 1959 | 1960 | 1961

| | | | |

| %tC | clock | 31dec1959 | 01jan1960 | 01jan1960

| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001

+---------------------------------------------------------------------

(Note: %tC with capital C includes leap seconds).

We will work with the interview date variable in GSS 2002 as an example.

DATEINTV

Date of interview

Survey Question: Date of interview.

Range of Valid Numeric Responses

Minimum value= 1 Maximum value= 9998

Response Categories

Category Label Frequency

0 Not applicable 0

9999 Not available 18

Column: 1276 Width: 4 Type: numeric

Text: REMARKS: This variable consists of the month (Cols. 5734-5735) and date (Cols. 5736-

5737) on which the interview was conducted. Collapsed information by month is listed above for

convenience of display only.

. sum dateintv

Page 13: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

13

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

dateintv | 2747 383.1736 120.219 206 626

One way to manage this would be to split the original variable into date and month and use a

numeric importing function. To split it, it might be easier to use it as a string, so we convert the

original variable into string using tostring command:

. tostring dateintv, gen(datestr2)

datestr2 generated as str3

. gen month=substr(datestr2, 1, 1)

. gen day=substr(datestr2, 2, 2)

(18 missing values generated)

. tab month

month | Freq. Percent Cum.

------------+-----------------------------------

. | 18 0.65 0.65

2 | 557 20.14 20.80

3 | 745 26.94 47.74

4 | 703 25.42 73.16

5 | 526 19.02 92.19

6 | 216 7.81 100.00

------------+-----------------------------------

Total | 2,765 100.00

. tab day, m

day | Freq. Percent Cum.

------------+-----------------------------------

| 18 0.65 0.65

01 | 79 2.86 3.51

02 | 77 2.78 6.29

03 | 60 2.17 8.46

04 | 79 2.86 11.32

05 | 55 1.99 13.31

06 | 95 3.44 16.75

07 | 87 3.15 19.89

08 | 90 3.25 23.15

09 | 80 2.89 26.04

10 | 83 3.00 29.04

11 | 122 4.41 33.45

12 | 101 3.65 37.11

13 | 134 4.85 41.95

14 | 86 3.11 45.06

15 | 103 3.73 48.79

16 | 94 3.40 52.19

17 | 65 2.35 54.54

18 | 104 3.76 58.30

19 | 97 3.51 61.81

20 | 101 3.65 65.46

21 | 99 3.58 69.04

22 | 120 4.34 73.38

23 | 110 3.98 77.36

24 | 83 3.00 80.36

25 | 119 4.30 84.67

26 | 92 3.33 87.99

27 | 84 3.04 91.03

Page 14: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

14

28 | 110 3.98 95.01

29 | 72 2.60 97.61

30 | 54 1.95 99.57

31 | 12 0.43 100.00

------------+-----------------------------------

Total | 2,765 100.00

Now we convert these back into numbers:

. destring month, replace

month has all characters numeric; replaced as byte

(18 missing values generated)

. destring day, replace

day has all characters numeric; replaced as byte

(18 missing values generated)

. tab month, m

month | Freq. Percent Cum.

------------+-----------------------------------

2 | 557 20.14 20.14

3 | 745 26.94 47.09

4 | 703 25.42 72.51

5 | 526 19.02 91.54

6 | 216 7.81 99.35

. | 18 0.65 100.00

------------+-----------------------------------

Total | 2,765 100.00

. sum day

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

day | 2747 15.97306 8.259537 1 31

We need to add year – but such variable exists already: . tab year

gss year |

for this |

respondent | Freq. Percent Cum.

------------+-----------------------------------

2002 | 2,765 100.00 100.00

------------+-----------------------------------

Total | 2,765 100.00

Now we need to import date information from these three numeric variables into a single

numeric variable that is coded in the way that Stata understands; here are various possibilities of

importing date from numeric variables – the structure of the command would be, for example:

gen varname= mdyhms(M, D, Y, h, m, s)

where mdyhms is the function you use and M, D, Y, h, m, s in parentheses are replaced with

names of variables where information on each component is stored. Here are all possible

functions:

%tc | mdyhms(M, D, Y, h, m, s)

Page 15: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

15

%tc | dhms(td, h, m, s)

%tc | hms(h, m, s)

|

%tC | Cmdyhms(M, D, Y, h, m, s)

%tC | Cdhms(td, h, m, s)

%tC | Chms(h, m, s)

|

%td | mdy(M, D, Y)

|

%tw | yw(Y, W)

%tm | ym(Y, M)

%tq | yq(Y, Q)

%th | yh(Y, H)

%ty | Y

So for our example: . gen intervdate=mdy(month, day, year)

(18 missing values generated)

. sum intervdate

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

intervdate | 2747 15436.14 35.72223 15377 15517

. list intervdate in 1/10

+----------+

| interv~e |

|----------|

1. | 15490 |

2. | 15494 |

3. | 15394 |

4. | 15425 |

5. | 15464 |

|----------|

6. | 15493 |

7. | 15392 |

8. | 15469 |

9. | 15467 |

10. | 15473 |

+----------+

. format intervdate %td

. list intervdate in 1/10

+-----------+

| intervd~e |

|-----------|

1. | 30may2002 |

2. | 03jun2002 |

3. | 23feb2002 |

4. | 26mar2002 |

5. | 04may2002 |

|-----------|

6. | 02jun2002 |

7. | 21feb2002 |

8. | 09may2002 |

9. | 07may2002 |

10. | 13may2002 |

+-----------+

Here is an alternative example from HRS where the precision is only in months, not in days:

Page 16: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

16

. use C:\hrs2008\stata\H08A_R.dta, clear

. gen date2008=ym(LA501, LA500)

. tab date2008

date2008 | Freq. Percent Cum.

------------+-----------------------------------

577 | 24 0.14 0.14

578 | 2,781 16.15 16.29

579 | 3,031 17.60 33.90

580 | 2,674 15.53 49.43

581 | 2,233 12.97 62.40

582 | 1,968 11.43 73.83

583 | 1,526 8.86 82.69

584 | 961 5.58 88.27

585 | 842 4.89 93.16

586 | 505 2.93 96.10

587 | 322 1.87 97.97

588 | 325 1.89 99.85

589 | 25 0.15 100.00

------------+-----------------------------------

Total | 17,217 100.00

. format date2008 %tm

. tab date2008

date2008 | Freq. Percent Cum.

------------+-----------------------------------

2008m2 | 24 0.14 0.14

2008m3 | 2,781 16.15 16.29

2008m4 | 3,031 17.60 33.90

2008m5 | 2,674 15.53 49.43

2008m6 | 2,233 12.97 62.40

2008m7 | 1,968 11.43 73.83

2008m8 | 1,526 8.86 82.69

2008m9 | 961 5.58 88.27

2008m10 | 842 4.89 93.16

2008m11 | 505 2.93 96.10

2008m12 | 322 1.87 97.97

2009m1 | 325 1.89 99.85

2009m2 | 25 0.15 100.00

------------+-----------------------------------

Total | 17,217 100.00

Sometimes, however, date information is stored as string variables or we could transfer it into

strings; such strings could contain words, not only numbers. For our example, we could convert

our original variable into string, then add a string containing year (and the leading zero for

month): . tostring dateintv, gen(datestr)

datestr generated as str3

. replace datestr="20020" + datestr

datestr was str3 now str8

(2765 real changes made)

Page 17: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

17

Now we will generate an actual date variable in a kind of format that Stata recognizes using a so-

called mask. The conversion from string works as follows:

Format | String-to-numeric conversion function

-------+-----------------------------------------

%tc | clock(string, mask)

%tC | Clock(string, mask)

%td | date(string, mask)

%tw | weekly(string, mask)

%tm | monthly(string, mask)

%tq | quarterly(string, mask)

%th | halfyearly(string, mask)

%ty | yearly(string, mask)

%tg | no function necessary; read as numeric

-------------------------------------------------

The mask specifies the order in which the elements appear; e.g., for a string "June 3, 2010" or"6-

3-2010" we would use "MDY" -- month, day, and year; for a string "3jun2010 16:30:26", the

mask would be DMYhms". The full commands for these examples (not in our dataset!) would

be: . gen date2008= date(intervdate, "MDY")

. gen double time2008 = clock(time, "DMYhms")

Then we would format these: . format date2008 %td

. format time2008 %tc

Now back to our dataset:

. gen date= date(datestr, "YMD")

(18 missing values generated)

. sum date

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

date | 2747 15436.14 35.72223 15377 15517

Finally, we format this variable:

. format date %td

. list date in 1/10

+-----------+

| date |

|-----------|

1. | 30may2002 |

2. | 03jun2002 |

3. | 23feb2002 |

4. | 26mar2002 |

5. | 04may2002 |

|-----------|

6. | 02jun2002 |

7. | 21feb2002 |

8. | 09may2002 |

9. | 07may2002 |

10. | 13may2002 |

+-----------+

Page 18: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

18

A few additional functions to work with dates include those that allow us to extract specific

components (month, week, year, etc.) from a date:

. sum date

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

date | 2747 15436.14 35.72223 15377 15517

. gen month=month(date)

(18 missing values generated)

. tab month

month | Freq. Percent Cum.

------------+-----------------------------------

2 | 557 20.28 20.28

3 | 745 27.12 47.40

4 | 703 25.59 72.99

5 | 526 19.15 92.14

6 | 216 7.86 100.00

------------+-----------------------------------

Total | 2,747 100.00

. gen yr=year(date)

(18 missing values generated)

. tab yr

yr | Freq. Percent Cum.

------------+-----------------------------------

2002 | 2,747 100.00 100.00

------------+-----------------------------------

Total | 2,747 100.00

. gen daynum=day(date)

(18 missing values generated)

. sum daynum

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

daynum | 2747 15.97306 8.259537 1 31

. gen weekday=dow(date)

(18 missing values generated)

. tab weekday

weekday | Freq. Percent Cum.

------------+-----------------------------------

0 | 213 7.75 7.75

1 | 466 16.96 24.72

2 | 430 15.65 40.37

3 | 416 15.14 55.52

4 | 439 15.98 71.50

5 | 384 13.98 85.48

6 | 399 14.52 100.00

------------+-----------------------------------

Total | 2,747 100.00

Page 19: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

19

. gen week=week(date)

(18 missing values generated)

. sum week

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

week | 2747 14.18092 5.116454 6 26

Note that the first week of a year is the first 7-day period of the year.

When the dates are coded in terms of days (%td format), the following options are possible:

month(…), week(…) , year (…), day (…), dow(…). When the time measure is more precise and

dates are coded in terms of milliseconds, we can extract other types of entities, e.g., options that

can be used include hours(ms), minutes(ms), seconds(ms), etc. There are additional options for

different levels of precision; you can look them up:

. help datetime_functions

Numeric Variable Formats

Numbers in Stata can be stored in 5 different types of variables.

There are three integer formats:

byte – for numbers below 100, ideal for categorical variables

int - numbers up to 32,000

long – up to about 2 billion

And three formats for numbers with fractions:

float (the default) -- about 7 digits of accuracy (224 = 16,777,216 is the largest number

that can be precisely stored)

double – 16 digits of accuracy

When you create a new numeric variable and do not specify the storage type for it, the new

variable is made a float, unless you have previously used “set type” command. For example: . gen hrs40=(hrs1>=40) if hrs1<.

. des hrs40

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

hrs40 float %9.0g

. set type double

. drop hrs40

. gen hrs40=(hrs1>=40) if hrs1<.

(1036 missing values generated)

. des hrs40

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

Page 20: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

20

hrs40 double %10.0g

To set the default back: . set type float

To create a specific variable that differs from default format (float), specify format in the gen or

egen command: . gen byte hrs40=(hrs1>=40) if hrs1<.

If you declare a variable as an integer (byte, int or long), but make it equal to something that in

fact contains fractions, the fractional part will be truncated (not rounded but just cut off!). For

example:

. sum hrs1

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

hrs1 | 1729 41.77675 14.62304 1 89

. gen hrs1d10=hrs1/10

(1036 missing values generated)

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

hrs1d10 | 1729 4.177675 1.462304 .1 8.9

. gen byte hrs1d10_b=hrs1d10

(1036 missing values generated)

. sum hrs1d10_

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

hrs1d10_b | 1729 3.95026 1.485213 0 8

In most cases, it doesn’t make sense to worry too much about setting the format – except in those

cases where the default (float) causes an undesirable loss of precision. For example, if your IDs

are very large numbers (more than 7 digits) and you store them as default (float), they can be

rounded and therefore no longer uniquely identify individuals. Store such IDs using long or

double; saving them as a string variable is another safe option.

Float and double can also cause us problems if we want to use exact comparisons with fractions

because the way there are stored (in binary format), they might be a tiny little bit off from, say,

1.3 that is displayed to us. So do your comparisons based on intervals rather than exact values

when dealing with fractions. For example:

. tab hrs1d10 if hrs1d10>6 & hrs1d10<7

hrs1d10 | Freq. Percent Cum.

------------+-----------------------------------

6.1 | 2 5.00 5.00

6.2 | 6 15.00 20.00

6.3 | 3 7.50 27.50

6.4 | 4 10.00 37.50

6.5 | 18 45.00 82.50

Page 21: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

21

6.6 | 4 10.00 92.50

6.8 | 3 7.50 100.00

------------+-----------------------------------

Total | 40 100.00

. list id if hrs1d10==6.1

. list id if hrs1d10==6.2

. list id if hrs1d10==6.3

. list id if hrs1d10==6.4

. list id if hrs1d10==6.5

+------+

| id |

|------|

33. | 33 |

408. | 408 |

453. | 453 |

758. | 758 |

1105. | 1105 |

|------|

1264. | 1264 |

1340. | 1340 |

1414. | 1414 |

1520. | 1520 |

1702. | 1702 |

|------|

1947. | 1947 |

1957. | 1957 |

2096. | 2096 |

2156. | 2156 |

2269. | 2269 |

|------|

2277. | 2277 |

2327. | 2327 |

2743. | 2743 |

+------+

This problem never occurs with byte, integer, long, or string formats or with integer numbers

stored as float so if you want to use exact conditions, multiply your variable by, say, 100 or 1000

to get rid of decimals.

. gen hrs1dm=hrs1d10*10

(1036 missing values generated)

. des hrs1dm

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

hrs1dm float %9.0g

. list id if hrs1dm==61

+------+

| id |

|------|

865. | 865 |

2000. | 2000 |

+------+

Page 22: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

22

If your dataset is large, using small variable types like byte can save a lot of memory, but that

can be accomplished after all the variables are created, before saving the dataset, using

the compress command. It will automatically store variables in smaller types if it is possible to

do that without losing precision. It also looks whether strings can be stored as shorter strings. . compress

emailhr was int now byte

chathr was int now byte

artshr was int now byte

emhrh was int now byte

emhrw was int now byte

wwwhrw was int now byte

emhro was int now byte

wwwhro was int now byte

chldprb was int now byte

chldhlp was int now byte

hrs40 was double now byte

(47,005 bytes saved)

You can also change the format type of a specific variable using recast command: recast type varlist [, force]

where type is byte, int, long, float, double, str1, str2, ..., str2045, or strL. For example:

. recast byte hrs1dm

. des hrs1dm

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

hrs1dm byte %9.0g

. recast byte hrs1d10

hrs1d10: 786 values would be changed; not changed

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

hrs1d10 | 1729 4.177675 1.462304 .1 8.9

. recast byte hrs1d10, force

hrs1d10: 786 values changed

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

hrs1d10 | 1729 3.95026 1.485213 0 8

Note that force makes recast unsafe -- variables can get the new storage type even if that will

cause a loss of precision, introduction of missing values, or, for a string variables, the truncation

of strings.

Page 23: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

23

Display Formats for Numeric Variables

We already saw that formatting date variables helps Stata understand that we specified dates and

to display them correctly. You can also modify display format of various numeric variables, also

using format command: format varlist %fmt

Here are the variable formats for numeric variables (from help format):

Numerical

%fmt Description Example

-------------------------------------------------------

right-justified

%#.#g general %9.0g

%#.#f fixed %9.2f

%#.#e exponential %10.7e

%21x hexadecimal %21x

%16H binary, hilo %16H

%16L binary, lohi %16L

%8H binary, hilo %8H

%8L binary, lohi %8L

right-justified with commas

%#.#gc general %9.0gc

%#.#fc fixed %9.2fc

right-justified with leading zeros

%0#.#f fixed %09.2f

left-justified

%-#.#g general %-9.0g

%-#.#f fixed %-9.2f

%-#.#e exponential %-10.7e

left-justified with commas

%-#.#gc general %-9.0gc

%-#.#fc fixed %-9.2fc

-------------------------------------------------------

You may substitute comma (,) for period (.) in any of

the above formats to make comma the decimal point. In

%9,2fc, 1000.03 is 1.000,03. Or you can use “set dp comma.”

The format %g is usually used as %width.0g with 0 decimal points specified, but in fact what

that means is that this format can decide how many digits to display to the right of the decimal

point depending on how many digits total there are, while in %f, the number of digits after the

decimal point is specified precisely by the format. Also, %g format will switch to a %e display

format (exponential) if the number is too large or too small, while %f does not do that.

. des spsei

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

spsei float %3.2f spsei r's spouse's socioeconomic index

. list spsei in 7/8

+-------+

| spsei |

Page 24: SOCY7709: Quantitative Data Management Instructor .... destring, replace ignore(X) If a string variable contains nonnumeric characters that are not specified with ignore option, then

24

|-------|

7. | 64.1 |

8. | 29.2 |

+-------+

. format spsei %3.2f

. list spsei in 7/8

+-------+

| spsei |

|-------|

7. | 64.10 |

8. | 29.20 |

+-------+

. format spsei %09.2f

. list spsei in 7/8

+-----------+

| spsei |

|-----------|

7. | 000064.10 |

8. | 000029.20 |

+-----------+

. format spsei %3.2e

. list spsei in 7/8

+---------+

| spsei |

|---------|

7. | 6.4e+01 |

8. | 2.9e+01 |

+---------+

The default formats are: byte %8.0g

int %8.0g

long %12.0g

float %9.0g

double %10.0g

You can also change the default format for displaying all coefficients using set cformat

command – e.g., to only show 2 decimal points, we can use the following command prior to

running our regression models:

set cformat %9.2f