DATA Step, Macro, Functions and more

Update the missing variables

Accepted Solution Solved
Reply
Regular Contributor
Posts: 162
Accepted Solution

Update the missing variables

Dear all,

 

The attached dataset contains the missing "state" and "stateab". These two variables can be matched with the "STCOD".

 

Could anyone suggest a program that can help me to update the missing variables?

 

Thank you.

 

MSPAK

Attachment
Attachment

Accepted Solutions
Solution
‎09-25-2015 06:23 AM
Contributor hbi
Contributor
Posts: 66

Re: Update the missing variables

Hi,

 

Your dataset is pretty small, so a PROC SQL update statement would work nicely.

 

The first step involves creating a lookup table; I included a frequency column (informational only and not needed). But notice that for our nation's capital, there are two spelling variations ("District of Columbia" and "District Of Columbia"). The PROC SORT gets rid of the duplicate.

 

Next, a series of two PROC SQL update statement updates rows for which state is missing and then updates rows for which stateab is missing. Since your dataset is relatively small, an "in-place" PROC SQL update works just fine. If your dataset were much larger with millions of observations, then a PROC SQL update may be suboptimal.

 

Enjoy!

 

 

/* first, create a lookup table */
PROC SQL;
  CREATE TABLE distinct_states AS 
  SELECT STCOD, state, stateab, COUNT(*) as state_frequency
  FROM full
  WHERE NOT MISSING(state)
  GROUP BY STCOD, state, stateab
  ORDER BY 2, 1, 4 DESC;
QUIT;


/* note that your dataset has District of Columbia spelled two different 
   ways, thus get rid of one of them */
PROC SORT DATA=distinct_states nodupkey;
  BY STCOD;
RUN;


PROC SQL;
/* only update where state is missing */
  UPDATE full
  SET state = (SELECT MAX(state)
               FROM distinct_states
               WHERE full.STCOD = distinct_states.STCOD)
  WHERE MISSING(state);
QUIT;


PROC SQL;
/* only update where abbreviated state is missing */
  UPDATE full
  SET stateab = (SELECT MAX(stateab)
                 FROM distinct_states
                 WHERE full.STCOD = distinct_states.STCOD)
  WHERE MISSING(stateab);
QUIT;

 

View solution in original post


All Replies
Super User
Posts: 17,775

Re: Update the missing variables

Have you tried a merge?
Regular Contributor
Posts: 162

Re: Update the missing variables

Dear Reeza,

 

Yes, I tried. Thank you.

Valued Guide
Posts: 765

Re: Update the missing variables

Hi ... you can use functions ...

 

data fixed;
set z.full;
if missing(state) then do;
   state = fipname(stcod);
   stateab = fipstate(stcod);
end;
run;

 

Also, the variable COUNTY has three different forms in the data set:  just the county name, county + state, county name followed by the word "County".  . Why not make it consistent, just the county name ...

 


data fixed;
set z.full;
if missing(state) then do;
   state = fipname(stcod);
   stateab = fipstate(stcod);
end;
* get rid of the STATE, get rid of text 'COUNTY';
if find(county,',') then county=scan(county,1,',');
else
if findw(county,'County') then county = tranwrd(county,'County',' ');
run;

Solution
‎09-25-2015 06:23 AM
Contributor hbi
Contributor
Posts: 66

Re: Update the missing variables

Hi,

 

Your dataset is pretty small, so a PROC SQL update statement would work nicely.

 

The first step involves creating a lookup table; I included a frequency column (informational only and not needed). But notice that for our nation's capital, there are two spelling variations ("District of Columbia" and "District Of Columbia"). The PROC SORT gets rid of the duplicate.

 

Next, a series of two PROC SQL update statement updates rows for which state is missing and then updates rows for which stateab is missing. Since your dataset is relatively small, an "in-place" PROC SQL update works just fine. If your dataset were much larger with millions of observations, then a PROC SQL update may be suboptimal.

 

Enjoy!

 

 

/* first, create a lookup table */
PROC SQL;
  CREATE TABLE distinct_states AS 
  SELECT STCOD, state, stateab, COUNT(*) as state_frequency
  FROM full
  WHERE NOT MISSING(state)
  GROUP BY STCOD, state, stateab
  ORDER BY 2, 1, 4 DESC;
QUIT;


/* note that your dataset has District of Columbia spelled two different 
   ways, thus get rid of one of them */
PROC SORT DATA=distinct_states nodupkey;
  BY STCOD;
RUN;


PROC SQL;
/* only update where state is missing */
  UPDATE full
  SET state = (SELECT MAX(state)
               FROM distinct_states
               WHERE full.STCOD = distinct_states.STCOD)
  WHERE MISSING(state);
QUIT;


PROC SQL;
/* only update where abbreviated state is missing */
  UPDATE full
  SET stateab = (SELECT MAX(stateab)
                 FROM distinct_states
                 WHERE full.STCOD = distinct_states.STCOD)
  WHERE MISSING(stateab);
QUIT;

 

Valued Guide
Posts: 765

Re: Update the missing variables

[ Edited ]

Hi.  There are functions to convert state codes (fips numbers) to state names and two character abbreviations.  Every observation in the data set has a state code.

Regular Contributor
Posts: 162

Re: Update the missing variables

Dear MikeZded,

 

Did you mean that SAS system contains a database of state codes, with their names and abbreviation? May I know which SAS function can perform this procedure?

 

Thank you.

 

Regards,

MSPAK

Regular Contributor
Posts: 162

Re: Update the missing variables

Hi again,

 

Thanks. I read your suggested program. I think this only works for U.S. data. Do you think this is applicable to other countries as well?

 

Thank you very much.

 

MSPAK

Regular Contributor
Posts: 162

Re: Update the missing variables

Thanks MikeZded and hbi for your programs. Both are the correct solution.
Valued Guide
Posts: 765

Re: Update the missing variables

[ Edited ]

Correct, the various functions that deal with state and zip codes are US only.  There are a lot of such functions (see below from SAS online HELP).  Also the SQL solution for US state codes worked fine, but given that there are two functions that did the same thing,in two statements, it seems like a lot of code for US-based data.  However , it would be a solution for an instance when functions would not work, e.g. non-US data.

 


State and Zip Code Functions

 

FIPNAME 

converts FIPS codes to uppercase state names

FIPNAMEL 

converts FIPS codes to mixed-case state names

FIPSTATE
converts FIPS codes to two-character postal codes

GEODISTANCE
returns the geodetic distance between two latitude and longitude coordinates

STFIPS
converts state postal codes to FIPS state codes

STNAME
converts state postal codes to uppercase state names

STNAMEL
converts state postal codes to mixed-case state names

ZIPCITY
returns a city name and the two-character postal code that corresponds to a zip code

ZIPCITYDISTANCE
returns the geodetic distance between two zip code locations

ZIPFIPS
converts zip codes to FIPS state codes

ZIPNAME
converts zip codes to uppercase state names

ZIPNAMEL
converts zip codes to mixed-case state names

ZIPSTATE
converts zip codes to state postal codes 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 357 views
  • 1 like
  • 4 in conversation