cleaning column data

Reply
Occasional Contributor
Posts: 5

cleaning column data

In LOCATION dataset, a STATES column has two letter coded values like 'CT, NY, NJ, IL, AZ, CA' and some of the values  in the same column are displayed with full names of the states like 'COLORADO, DELAWARE, FLORIDA, INDIANA' 

what is the code to make all these full name states into their respective two lettered codes.

 

 

Super User
Super User
Posts: 8,609

Re: cleaning column data

Posted in reply to anushagandla

Do you have a list of what country should decode to which 2 letter abbreviation?  If so then just put some if statements in a datastep or merge the information on.  For example - and you have not provided any of the necessary information such as test data in the form of a datastep to write this acurately:

data codes;
  country="Canada"; code="CA"; output;
run;

proc sql;
  create table WANT as
  select  A.COUNTRY, 
             B.CODE
  from    HAVE A
  left join CODES B
  on       A.COUNTRY=B.COUNTRY;
quit;
Occasional Contributor
Posts: 5

Re: cleaning column data

hi,

 

the LOCATION dataset is huge dataset in which thousands of records are with full names of the states instead of two lettered codes.

how can we write if statement for soo many records with full names?

 

 

Super User
Posts: 21,464

Re: cleaning column data

Posted in reply to anushagandla

anushagandla wrote:

hi,

 

the LOCATION dataset is huge dataset in which thousands of records are with full names of the states instead of two lettered codes.

how can we write if statement for soo many records with full names?

 

 


That's why cleaning data sucks.

 

Run a proc freq on the column and get all distinct entries. 

Map each one you don't want to the new code using IF/THEN statements or a custom format. A Format is easier because you can create a format from a data set. SASMAPS will have a mapping of state names to abbreviations.

 

You only the write the statement once, but if you have 50 states, it could be up to 50 linesm, a format would definitely be less code.

 

There's some examples in here on how to create formats from a data set and using PUT to recode variables.

http://www2.sas.com/proceedings/sugi30/001-30.pdf

 

 

Super User
Posts: 12,148

Re: cleaning column data

Posted in reply to anushagandla

By any chance does this data set have a ZIPcode variable? Then the easiest may be to use the ZIPSTATE function.

If you have Zip+4 codes , such as 90049-1392 you only want to use the first 5 characters, or the digits corresponding to the base 5 zipcode. A numeric zip of 1040 would be treated as 01040.

 

data want;

   set have;

   ST2Char= zipstate(zipcodevariable);

run;

 

would be as start. If you have the Zip+4 then zipstate(substr(zipcodevariable,1,5))

Respected Advisor
Posts: 4,274

Re: cleaning column data

[ Edited ]
Posted in reply to anushagandla

@anushagandla

If these are US states and you don't have to deal with typos as well then below code should do the job.

data have;
  infile datalines dlm=',';
  input states:$40. @@;
  output;
datalines;
CT, NY, NJ, IL, AZ, CA,COLORADO, DELAWARE, FLORIDA, INDIANA
;
run;

proc sql;
  create table cntlin_source as
    select distinct
        'StateToCode' as fmtname
      , 'c' as type
      , statename2 as start
      , statecode as label
    from
      sashelp.zipcode
    ;
quit;

proc format cntlin=cntlin_source;
run;

data want;
  set have;
  state_codes=put(upcase(compress(states)),$StateToCode.);
run;
Respected Advisor
Posts: 3,157

Re: cleaning column data

[ Edited ]
Posted in reply to anushagandla

Another way to save some typing is to leverage SASHELP.ZIPCODE, where you will find all the states' full name as well as the two letter state code. Then you can choose either Proc SQL / Hash/Format to standardize your variable.

Ask a Question
Discussion stats
  • 6 replies
  • 135 views
  • 3 likes
  • 6 in conversation